


Smoothing noisy data using Local Regression and Likelihood.
arguments still to add: dc renorm itype mint maxit debug
Usage: fit = locfit(x,y) % local regression fit of x and y.
fit = locfit(x) % density estimation of x.
Smoothing with locfit is a two-step procedure. The locfit()
function evaluates the local regression smooth at a set of points
(can be specified through an evaluation structure). Then, use
the predict() function to interpolate this fit to other points.
Additional arguments to locfit() are specified as 'name',value pairs, e.g.:
locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 );
Data-related inputs:
x is a vector or matrix of the independent (or predictor) variables.
Rows of x represent subjects, columns represent variables.
Generally, local regression would be used with 1-4 independent
variables. In higher dimensions, the curse-of-dimensionality,
as well as the difficulty of visualizing higher dimensional
surfaces, may limit usefulness.
y is the column vector of the dependent (or response) variable.
For density families, 'y' is omitted.
NOTE: x and y are the first two arguments. All other arguments require
the 'name',value notation.
'weights' Prior weights for observations (reciprocal of variance, or
sample size).
'cens' Censoring indicators for hazard rate or censored regression.
The coding is '1' (or 'TRUE') for a censored observation, and
'0' (or 'FALSE') for uncensored observations.
'base' Baseline parameter estimate. If a baseline is provided,
the local regression model is fitted as
Y_i = b_i + m(x_i) + epsilon_i,
with Locfit estimating the m(x) term. For regression models,
this effectively subtracts b_i from Y_i. The advantage of the
'base' formulation is that it extends to likelihood
regression models.
'scale' A scale to apply to each variable. This is especially
important for multivariate fitting, where variables may be
measured in non-comparable units. It is also used to specify
the frequency for variables with the 'a' (angular) style.
'sty' Character string (length d) of styles for each predictor variable.
n denotes `normal'; a denotes angular (or periodic); l and r
denotes one-sided left and right; c is conditionally parametric.
Smoothing Parameters and Bandwidths:
The bandwidth (or more accurately, half-width) of the smoothing window
controls the amount of smoothing. Locfit allows specification of constant
(fixed), nearest neighbor, certain locally adaptive variable bandwidths,
and combinations of these. Also related to the smoothing parameter
are the local polynmial degree and weight function.
'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5
means that the width of each smoothing neighborhood is chosen
to cover 50% of the data.
'h' A constant (or fixed) bandwidth parameter. For example, 'h',2
means that the smoothing windows have constant half-width
(or radius) 2. Note that h is applied after scaling.
'pen' penalty parameter for adaptive smoothing. Needs to be used
with care.
'alpha' The old way of specifying smoothing parameters, as used in
my book. alpha is equivalent to the vector [nn,h,pen].
If multiple componenents are non-zero, the largest corresponding
bandwidth is used. The default (if none of alpha,nn,h,pen
are provided) is [0.7 0 0].
'deg' Degree of local polynomial. Default: 2 (local quadratic).
Degrees 0 to 3 are supported by almost all parts of the
Locfit code. Higher degrees may work in some cases.
'kern' Weight function, default = 'tcub'. Other choices are
'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'.
Choices may be restricted when derivatives are
required; e.g. for confidence bands and some bandwidth
selectors.
'kt' Kernel type, 'sph' (default); 'prod'. In multivariate
problems, 'prod' uses a simplified product model which
speeds up computations.
'acri' Criterion for adaptive bandwidth selection.
Derivative Estimation.
Generally I recommend caution when using derivative estimation
(and especially higher order derivative estimation) -- can you
really estimate derivatives from noisy data? Any derivative
estimate is inherently more dependent on an assumed smoothness
(expressed through the bandwidth) than the data. Warnings aside...
'deriv' Derivative estimation. 'deriv',1 specifies the first derivative
(or more correctly, an estimate of the local slope is returned.
'deriv',[1 1] specifies the second derivative. For bivariate fits
'deriv',2 specifies the first partial derivative wrt x2.
'deriv',[1 2] is mixed second-order derivative.
Fitting family.
'family' is used to specify the local likelihood family.
Regression-type families are 'gaussian', 'binomial',
'poisson', 'gamma' and 'geom'. If the family is preceded
by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is
used; in particular, a dispersion estimate is computed.
Preceding by an 'r' makes an attempt at robust (outlier-resistant)
estimation. Combining q and r (e.g. 'family','qrpois') may
work, if you're lucky.
Density estimation-type families are 'dens', 'rate' and 'hazard'
(hazard or failure rate). Note that `dens' scales the output
to be a statistical density estimate (i.e. scaled to integrate
to 1). 'rate' estimates the rate or intensity function (events
per unit time, or events per unit area), which may be called
density in some fields.
The default family is 'qgauss' if a response (y argument) has been
provided, and 'dens' if no response is given.
'link' Link function for local likelihood fitting. Depending on the
family, choices may be 'ident', 'log', 'logit',
'inverse', 'sqrt' and 'arcsin'.
Evaluation structures.
By default, locfit chooses a set of points, depending on the data
and smoothing parameters, to evaluate at. This is controlled by
the evaluation structure.
'ev' Specify the evaluation structure. Default is 'tree'.
Other choices include 'phull' (triangulation), 'grid' (a grid
of points), 'data' (each data point), 'crossval' (data,
but use leave-one-out cross validation), 'none' (no evaluation
points, effectively producing the global parametric fit).
Alternatively, a vector/matrix of evaluation points may be
provided.
(kd trees not currently supported in mlocfit)
'll' and 'ur' -- row vectors specifying the upper and lower limits
for the bounding box used by the evaluation structure.
They default to the data range.
'mg' For the 'grid' evaluation structure, 'mg' specifies the
number of points on each margin. Default 10. Can be either a
single number or vector.
'cut' Refinement parameter for adaptive partitions. Default 0.8;
smaller values result in more refined partitions.
'maxk' Controls space assignment for evaluation structures. For the
adaptive evaluation structures, it is impossible to be sure
in advance how many vertices will be generated. If you get
warnings about `Insufficient vertex space', Locfit's default
assigment can be increased by increasing 'maxk'. The default
is 'maxk','100'.
'xlim' For density estimation, Locfit allows the density to be
supported on a bounded interval (or rectangle, in more than
one dimension). The format should be [ll;ul] (ie, matrix with
two rows, d columns) where ll is the lower left corner of
the rectangle, and ur is the upper right corner.
One-sided bounds, such as [0,infty), are not supported, but can be
effectively specified by specifying a very large upper
bound.
'what' is used by other functions calling locfit(),
and should not be used directly.
The output of locfit() is a Matlab Cell array:
fit{1} = data.
fit{2} = evaluation structure.
fit{3} = smoothing parameter structure.
fit{4}{1} = fit points matrix.
fit{4}{2} = matrix of fitted values etc.
Note that these are not back-transformed, and may have the
parametric component removed.
fit{4}{3} = various details of the evaluation points.
fit{4}{4} = fit limits.
fit{4}{5} = family,link.
fit{5} = parametric component values.

0001 function fit=locfit(varargin) 0002 0003 % Smoothing noisy data using Local Regression and Likelihood. 0004 % 0005 % arguments still to add: dc renorm itype mint maxit debug 0006 % 0007 % Usage: fit = locfit(x,y) % local regression fit of x and y. 0008 % fit = locfit(x) % density estimation of x. 0009 % 0010 % Smoothing with locfit is a two-step procedure. The locfit() 0011 % function evaluates the local regression smooth at a set of points 0012 % (can be specified through an evaluation structure). Then, use 0013 % the predict() function to interpolate this fit to other points. 0014 % 0015 % Additional arguments to locfit() are specified as 'name',value pairs, e.g.: 0016 % locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 ); 0017 % 0018 % 0019 % Data-related inputs: 0020 % 0021 % x is a vector or matrix of the independent (or predictor) variables. 0022 % Rows of x represent subjects, columns represent variables. 0023 % Generally, local regression would be used with 1-4 independent 0024 % variables. In higher dimensions, the curse-of-dimensionality, 0025 % as well as the difficulty of visualizing higher dimensional 0026 % surfaces, may limit usefulness. 0027 % 0028 % y is the column vector of the dependent (or response) variable. 0029 % For density families, 'y' is omitted. 0030 % NOTE: x and y are the first two arguments. All other arguments require 0031 % the 'name',value notation. 0032 % 0033 % 'weights' Prior weights for observations (reciprocal of variance, or 0034 % sample size). 0035 % 'cens' Censoring indicators for hazard rate or censored regression. 0036 % The coding is '1' (or 'TRUE') for a censored observation, and 0037 % '0' (or 'FALSE') for uncensored observations. 0038 % 'base' Baseline parameter estimate. If a baseline is provided, 0039 % the local regression model is fitted as 0040 % Y_i = b_i + m(x_i) + epsilon_i, 0041 % with Locfit estimating the m(x) term. For regression models, 0042 % this effectively subtracts b_i from Y_i. The advantage of the 0043 % 'base' formulation is that it extends to likelihood 0044 % regression models. 0045 % 'scale' A scale to apply to each variable. This is especially 0046 % important for multivariate fitting, where variables may be 0047 % measured in non-comparable units. It is also used to specify 0048 % the frequency for variables with the 'a' (angular) style. 0049 % 'sty' Character string (length d) of styles for each predictor variable. 0050 % n denotes `normal'; a denotes angular (or periodic); l and r 0051 % denotes one-sided left and right; c is conditionally parametric. 0052 % 0053 % 0054 % Smoothing Parameters and Bandwidths: 0055 % The bandwidth (or more accurately, half-width) of the smoothing window 0056 % controls the amount of smoothing. Locfit allows specification of constant 0057 % (fixed), nearest neighbor, certain locally adaptive variable bandwidths, 0058 % and combinations of these. Also related to the smoothing parameter 0059 % are the local polynmial degree and weight function. 0060 % 0061 % 'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5 0062 % means that the width of each smoothing neighborhood is chosen 0063 % to cover 50% of the data. 0064 % 0065 % 'h' A constant (or fixed) bandwidth parameter. For example, 'h',2 0066 % means that the smoothing windows have constant half-width 0067 % (or radius) 2. Note that h is applied after scaling. 0068 % 0069 % 'pen' penalty parameter for adaptive smoothing. Needs to be used 0070 % with care. 0071 % 0072 % 'alpha' The old way of specifying smoothing parameters, as used in 0073 % my book. alpha is equivalent to the vector [nn,h,pen]. 0074 % If multiple componenents are non-zero, the largest corresponding 0075 % bandwidth is used. The default (if none of alpha,nn,h,pen 0076 % are provided) is [0.7 0 0]. 0077 % 0078 % 'deg' Degree of local polynomial. Default: 2 (local quadratic). 0079 % Degrees 0 to 3 are supported by almost all parts of the 0080 % Locfit code. Higher degrees may work in some cases. 0081 % 0082 % 'kern' Weight function, default = 'tcub'. Other choices are 0083 % 'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'. 0084 % Choices may be restricted when derivatives are 0085 % required; e.g. for confidence bands and some bandwidth 0086 % selectors. 0087 % 0088 % 'kt' Kernel type, 'sph' (default); 'prod'. In multivariate 0089 % problems, 'prod' uses a simplified product model which 0090 % speeds up computations. 0091 % 0092 % 'acri' Criterion for adaptive bandwidth selection. 0093 % 0094 % 0095 % Derivative Estimation. 0096 % Generally I recommend caution when using derivative estimation 0097 % (and especially higher order derivative estimation) -- can you 0098 % really estimate derivatives from noisy data? Any derivative 0099 % estimate is inherently more dependent on an assumed smoothness 0100 % (expressed through the bandwidth) than the data. Warnings aside... 0101 % 0102 % 'deriv' Derivative estimation. 'deriv',1 specifies the first derivative 0103 % (or more correctly, an estimate of the local slope is returned. 0104 % 'deriv',[1 1] specifies the second derivative. For bivariate fits 0105 % 'deriv',2 specifies the first partial derivative wrt x2. 0106 % 'deriv',[1 2] is mixed second-order derivative. 0107 % 0108 % Fitting family. 0109 % 'family' is used to specify the local likelihood family. 0110 % Regression-type families are 'gaussian', 'binomial', 0111 % 'poisson', 'gamma' and 'geom'. If the family is preceded 0112 % by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is 0113 % used; in particular, a dispersion estimate is computed. 0114 % Preceding by an 'r' makes an attempt at robust (outlier-resistant) 0115 % estimation. Combining q and r (e.g. 'family','qrpois') may 0116 % work, if you're lucky. 0117 % Density estimation-type families are 'dens', 'rate' and 'hazard' 0118 % (hazard or failure rate). Note that `dens' scales the output 0119 % to be a statistical density estimate (i.e. scaled to integrate 0120 % to 1). 'rate' estimates the rate or intensity function (events 0121 % per unit time, or events per unit area), which may be called 0122 % density in some fields. 0123 % The default family is 'qgauss' if a response (y argument) has been 0124 % provided, and 'dens' if no response is given. 0125 % 'link' Link function for local likelihood fitting. Depending on the 0126 % family, choices may be 'ident', 'log', 'logit', 0127 % 'inverse', 'sqrt' and 'arcsin'. 0128 % 0129 % Evaluation structures. 0130 % By default, locfit chooses a set of points, depending on the data 0131 % and smoothing parameters, to evaluate at. This is controlled by 0132 % the evaluation structure. 0133 % 'ev' Specify the evaluation structure. Default is 'tree'. 0134 % Other choices include 'phull' (triangulation), 'grid' (a grid 0135 % of points), 'data' (each data point), 'crossval' (data, 0136 % but use leave-one-out cross validation), 'none' (no evaluation 0137 % points, effectively producing the global parametric fit). 0138 % Alternatively, a vector/matrix of evaluation points may be 0139 % provided. 0140 % (kd trees not currently supported in mlocfit) 0141 % 'll' and 'ur' -- row vectors specifying the upper and lower limits 0142 % for the bounding box used by the evaluation structure. 0143 % They default to the data range. 0144 % 'mg' For the 'grid' evaluation structure, 'mg' specifies the 0145 % number of points on each margin. Default 10. Can be either a 0146 % single number or vector. 0147 % 'cut' Refinement parameter for adaptive partitions. Default 0.8; 0148 % smaller values result in more refined partitions. 0149 % 'maxk' Controls space assignment for evaluation structures. For the 0150 % adaptive evaluation structures, it is impossible to be sure 0151 % in advance how many vertices will be generated. If you get 0152 % warnings about `Insufficient vertex space', Locfit's default 0153 % assigment can be increased by increasing 'maxk'. The default 0154 % is 'maxk','100'. 0155 % 0156 % 'xlim' For density estimation, Locfit allows the density to be 0157 % supported on a bounded interval (or rectangle, in more than 0158 % one dimension). The format should be [ll;ul] (ie, matrix with 0159 % two rows, d columns) where ll is the lower left corner of 0160 % the rectangle, and ur is the upper right corner. 0161 % One-sided bounds, such as [0,infty), are not supported, but can be 0162 % effectively specified by specifying a very large upper 0163 % bound. 0164 % 0165 % 'what' is used by other functions calling locfit(), 0166 % and should not be used directly. 0167 % 0168 % 0169 % The output of locfit() is a Matlab Cell array: 0170 % fit{1} = data. 0171 % fit{2} = evaluation structure. 0172 % fit{3} = smoothing parameter structure. 0173 % fit{4}{1} = fit points matrix. 0174 % fit{4}{2} = matrix of fitted values etc. 0175 % Note that these are not back-transformed, and may have the 0176 % parametric component removed. 0177 % fit{4}{3} = various details of the evaluation points. 0178 % fit{4}{4} = fit limits. 0179 % fit{4}{5} = family,link. 0180 % fit{5} = parametric component values. 0181 % 0182 0183 0184 0185 % Minimal input validation 0186 if nargin < 1 0187 error( 'At least one input argument required' ); 0188 end 0189 xdata = double(varargin{1}); 0190 d = size(xdata,2); 0191 n = size(xdata,1); 0192 if ((nargin>1) && (~ischar(varargin{2}))) 0193 ydata = double(varargin{2}); 0194 if (any(size(ydata) ~= [n 1])); error('y must be n*1 column vector'); end; 0195 family = 'qgauss'; 0196 na = 3; 0197 else 0198 ydata = 0; 0199 family = 'density'; 0200 na = 2; 0201 end; 0202 if mod(nargin-na,2)==0 0203 error( 'All arguments other than x, y must be name,value pairs' ); 0204 end 0205 0206 0207 wdata = ones(n,1); 0208 cdata = 0; 0209 base = 0; 0210 style = 'n'; 0211 scale = 1; 0212 xl = zeros(2,d); 0213 0214 alpha = [0 0 0]; 0215 deg = 2; 0216 link = 'default'; 0217 acri = 'none'; 0218 kern = 'tcub'; 0219 kt = 'sph'; 0220 0221 ev = 'atree'; 0222 ll = zeros(1,d); 0223 ur = zeros(1,d); 0224 mg = 10; 0225 maxk = 100; 0226 deriv=0; 0227 cut = 0.8; 0228 mdl = 'std'; 0229 0230 while na < length(varargin) 0231 inc = 0; 0232 if (varargin{na}=='y') 0233 ydata = double(varargin{na+1}); 0234 family = 'qgauss'; 0235 inc = 2; 0236 if (any(size(ydata) ~= [n 1])); error('y must be n*1 column vector'); end; 0237 end 0238 if (strcmp(varargin{na},'weights')) 0239 wdata = double(varargin{na+1}); 0240 inc = 2; 0241 if (any(size(wdata) ~= [n 1])); error('weights must be n*1 column vector'); end; 0242 end 0243 if (strcmp(varargin{na},'cens')) 0244 cdata = double(varargin{na+1}); 0245 inc = 2; 0246 if (any(size(cdata) ~= [n 1])); error('cens must be n*1 column vector'); end; 0247 end 0248 if (strcmp(varargin{na},'base')) % numeric vector, n*1 or 1*1. 0249 base = double(varargin{na+1}); 0250 if (length(base)==1); base = base*ones(n,1); end; 0251 inc = 2; 0252 end 0253 if (strcmp(varargin{na},'style')) % character string of length d. 0254 style = varargin{na+1}; 0255 inc = 2; 0256 end; 0257 if (strcmp(varargin{na},'scale')) % row vector, length 1 or d. 0258 scale = varargin{na+1}; 0259 if (scale==0) 0260 scale = zeros(1,d); 0261 for i=1:d 0262 scale(i) = sqrt(var(xdata(:,i))); 0263 end; 0264 end; 0265 inc = 2; 0266 end; 0267 if (strcmp(varargin{na},'xlim')) % 2*d numeric matrix. 0268 xl = varargin{na+1}; 0269 inc = 2; 0270 end 0271 if (strcmp(varargin{na},'alpha')) % row vector of length 1, 2 or 3. 0272 alpha = [varargin{na+1} 0 0 0]; 0273 alpha = alpha(1:3); 0274 inc = 2; 0275 end 0276 if (strcmp(varargin{na},'nn')) % scalar 0277 alpha(1) = varargin{na+1}; 0278 inc = 2; 0279 end 0280 if (strcmp(varargin{na},'h')) % scalar 0281 alpha(2) = varargin{na+1}; 0282 inc = 2; 0283 end; 0284 if (strcmp(varargin{na},'pen')) % scalar 0285 alpha(3) = varargin{na+1}; 0286 inc = 2; 0287 end; 0288 if (strcmp(varargin{na},'acri')) % string 0289 acri = varargin{na+1}; 0290 inc = 2; 0291 end 0292 if (strcmp(varargin{na},'deg')) % positive integer. 0293 deg = varargin{na+1}; 0294 inc = 2; 0295 end; 0296 if (strcmp(varargin{na},'family')) % character string. 0297 family = varargin{na+1}; 0298 inc = 2; 0299 end; 0300 if (strcmp(varargin{na},'link')) % character string. 0301 link = varargin{na+1}; 0302 inc = 2; 0303 end; 0304 if (strcmp(varargin{na},'kern')) % character string. 0305 kern = varargin{na+1}; 0306 inc = 2; 0307 end; 0308 if (strcmp(varargin{na},'kt')) % character string. 0309 kt = varargin{na+1}; 0310 inc = 2; 0311 end; 0312 if (strcmp(varargin{na},'ev')) % char. string, or matrix with d columns. 0313 ev = varargin{na+1}; 0314 if (isnumeric(ev)); ev = ev'; end; 0315 inc = 2; 0316 end; 0317 if (strcmp(varargin{na},'ll')) % row vector of length d. 0318 ll = varargin{na+1}; 0319 inc = 2; 0320 end; 0321 if (strcmp(varargin{na},'ur')) % row vector of length d. 0322 ur = varargin{na+1}; 0323 inc = 2; 0324 end; 0325 if (strcmp(varargin{na},'mg')) % row vector of length d. 0326 mg = varargin{na+1}; 0327 inc = 2; 0328 end; 0329 if (strcmp(varargin{na},'cut')) % positive scalar. 0330 cut = varargin{na+1}; 0331 inc = 2; 0332 end; 0333 if (strcmp(varargin{na},'what')) % string. 0334 mdl = varargin{na+1}; 0335 inc = 2; 0336 end; 0337 if (strcmp(varargin{na},'maxk')) % positive integer. 0338 maxk = varargin{na+1}; 0339 inc = 2; 0340 end; 0341 if (strcmp(varargin{na},'deriv')) % numeric row vector, up to deg elements. 0342 deriv = varargin{na+1}; 0343 inc = 2; 0344 end; 0345 if (inc==0) 0346 disp(varargin{na}); 0347 error('Unknown Input Argument.'); 0348 end; 0349 na=na+2; 0350 end 0351 0352 data = {xdata ydata wdata cdata base style scale xl}; 0353 evs = {ev mdl ll ur mg cut maxk deriv}; 0354 if (alpha==0); alpha = [0.7 0 0]; end; 0355 sp = {alpha acri deg family link kern kt}; 0356 [fpc pcomp] = mexlf(data,evs,sp); 0357 fit = {data evs sp fpc pcomp}; 0358 0359 return;