


Smoothing noisy data using Local Regression and Likelihood.
arguments still to add: dc renorm itype mint maxit debug
Usage: fit = locfit(x,y) % local regression fit of x and y.
fit = locfit(x) % density estimation of x.
Smoothing with locfit is a two-step procedure. The locfit()
function evaluates the local regression smooth at a set of points
(can be specified through an evaluation structure). Then, use
the predict() function to interpolate this fit to other points.
Additional arguments to locfit() are specified as 'name',value pairs, e.g.:
locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 );
Data-related inputs:
x is a vector or matrix of the independent (or predictor) variables.
Rows of x represent subjects, columns represent variables.
Generally, local regression would be used with 1-4 independent
variables. In higher dimensions, the curse-of-dimensionality,
as well as the difficulty of visualizing higher dimensional
surfaces, may limit usefulness.
y is the column vector of the dependent (or response) variable.
For density families, 'y' is omitted.
NOTE: x and y are the first two arguments. All other arguments require
the 'name',value notation.
'weights' Prior weights for observations (reciprocal of variance, or
sample size).
'cens' Censoring indicators for hazard rate or censored regression.
The coding is '1' (or 'TRUE') for a censored observation, and
'0' (or 'FALSE') for uncensored observations.
'base' Baseline parameter estimate. If a baseline is provided,
the local regression model is fitted as
Y_i = b_i + m(x_i) + epsilon_i,
with Locfit estimating the m(x) term. For regression models,
this effectively subtracts b_i from Y_i. The advantage of the
'base' formulation is that it extends to likelihood
regression models.
'scale' A scale to apply to each variable. This is especially
important for multivariate fitting, where variables may be
measured in non-comparable units. It is also used to specify
the frequency for variables with the 'a' (angular) style.
'sty' Character string (length d) of styles for each predictor variable.
n denotes `normal'; a denotes angular (or periodic); l and r
denotes one-sided left and right; c is conditionally parametric.
Smoothing Parameters and Bandwidths:
The bandwidth (or more accurately, half-width) of the smoothing window
controls the amount of smoothing. Locfit allows specification of constant
(fixed), nearest neighbor, certain locally adaptive variable bandwidths,
and combinations of these. Also related to the smoothing parameter
are the local polynmial degree and weight function.
'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5
means that the width of each smoothing neighborhood is chosen
to cover 50% of the data.
'h' A constant (or fixed) bandwidth parameter. For example, 'h',2
means that the smoothing windows have constant half-width
(or radius) 2. Note that h is applied after scaling.
'pen' penalty parameter for adaptive smoothing. Needs to be used
with care.
'alpha' The old way of specifying smoothing parameters, as used in
my book. alpha is equivalent to the vector [nn,h,pen].
If multiple componenents are non-zero, the largest corresponding
bandwidth is used. The default (if none of alpha,nn,h,pen
are provided) is [0.7 0 0].
'deg' Degree of local polynomial. Default: 2 (local quadratic).
Degrees 0 to 3 are supported by almost all parts of the
Locfit code. Higher degrees may work in some cases.
'kern' Weight function, default = 'tcub'. Other choices are
'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'.
Choices may be restricted when derivatives are
required; e.g. for confidence bands and some bandwidth
selectors.
'kt' Kernel type, 'sph' (default); 'prod'. In multivariate
problems, 'prod' uses a simplified product model which
speeds up computations.
'acri' Criterion for adaptive bandwidth selection.
Derivative Estimation.
Generally I recommend caution when using derivative estimation
(and especially higher order derivative estimation) -- can you
really estimate derivatives from noisy data? Any derivative
estimate is inherently more dependent on an assumed smoothness
(expressed through the bandwidth) than the data. Warnings aside...
'deriv' Derivative estimation. 'deriv',1 specifies the first derivative
(or more correctly, an estimate of the local slope is returned.
'deriv',[1 1] specifies the second derivative. For bivariate fits
'deriv',2 specifies the first partial derivative wrt x2.
'deriv',[1 2] is mixed second-order derivative.
Fitting family.
'family' is used to specify the local likelihood family.
Regression-type families are 'gaussian', 'binomial',
'poisson', 'gamma' and 'geom'. If the family is preceded
by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is
used; in particular, a dispersion estimate is computed.
Preceding by an 'r' makes an attempt at robust (outlier-resistant)
estimation. Combining q and r (e.g. 'family','qrpois') may
work, if you're lucky.
Density estimation-type families are 'dens', 'rate' and 'hazard'
(hazard or failure rate). Note that `dens' scales the output
to be a statistical density estimate (i.e. scaled to integrate
to 1). 'rate' estimates the rate or intensity function (events
per unit time, or events per unit area), which may be called
density in some fields.
The default family is 'qgauss' if a response (y argument) has been
provided, and 'dens' if no response is given.
'link' Link function for local likelihood fitting. Depending on the
family, choices may be 'ident', 'log', 'logit',
'inverse', 'sqrt' and 'arcsin'.
Evaluation structures.
By default, locfit chooses a set of points, depending on the data
and smoothing parameters, to evaluate at. This is controlled by
the evaluation structure.
'ev' Specify the evaluation structure. Default is 'tree'.
Other choices include 'phull' (triangulation), 'grid' (a grid
of points), 'data' (each data point), 'crossval' (data,
but use leave-one-out cross validation), 'none' (no evaluation
points, effectively producing the global parametric fit).
Alternatively, a vector/matrix of evaluation points may be
provided.
(kd trees not currently supported in mlocfit)
'll' and 'ur' -- row vectors specifying the upper and lower limits
for the bounding box used by the evaluation structure.
They default to the data range.
'mg' For the 'grid' evaluation structure, 'mg' specifies the
number of points on each margin. Default 10. Can be either a
single number or vector.
'cut' Refinement parameter for adaptive partitions. Default 0.8;
smaller values result in more refined partitions.
'maxk' Controls space assignment for evaluation structures. For the
adaptive evaluation structures, it is impossible to be sure
in advance how many vertices will be generated. If you get
warnings about `Insufficient vertex space', Locfit's default
assigment can be increased by increasing 'maxk'. The default
is 'maxk','100'.
'xlim' For density estimation, Locfit allows the density to be
supported on a bounded interval (or rectangle, in more than
one dimension). The format should be [ll;ul] (ie, matrix with
two rows, d columns) where ll is the lower left corner of
the rectangle, and ur is the upper right corner.
One-sided bounds, such as [0,infty), are not supported, but can be
effectively specified by specifying a very large upper
bound.
'what' is used by other functions calling locfit(),
and should not be used directly.
The output of locfit() is a Matlab Cell array:
fit{1} = data.
fit{2} = evaluation structure.
fit{3} = smoothing parameter structure.
fit{4}{1} = fit points matrix.
fit{4}{2} = matrix of fitted values etc.
Note that these are not back-transformed, and may have the
parametric component removed.
fit{4}{3} = various details of the evaluation points.
fit{4}{4} = fit limits.
fit{4}{5} = family,link.
fit{5} = parametric component values.

0001 function fit=locfit(varargin) 0002 0003 % Smoothing noisy data using Local Regression and Likelihood. 0004 % 0005 % arguments still to add: dc renorm itype mint maxit debug 0006 % 0007 % Usage: fit = locfit(x,y) % local regression fit of x and y. 0008 % fit = locfit(x) % density estimation of x. 0009 % 0010 % Smoothing with locfit is a two-step procedure. The locfit() 0011 % function evaluates the local regression smooth at a set of points 0012 % (can be specified through an evaluation structure). Then, use 0013 % the predict() function to interpolate this fit to other points. 0014 % 0015 % Additional arguments to locfit() are specified as 'name',value pairs, e.g.: 0016 % locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 ); 0017 % 0018 % 0019 % Data-related inputs: 0020 % 0021 % x is a vector or matrix of the independent (or predictor) variables. 0022 % Rows of x represent subjects, columns represent variables. 0023 % Generally, local regression would be used with 1-4 independent 0024 % variables. In higher dimensions, the curse-of-dimensionality, 0025 % as well as the difficulty of visualizing higher dimensional 0026 % surfaces, may limit usefulness. 0027 % 0028 % y is the column vector of the dependent (or response) variable. 0029 % For density families, 'y' is omitted. 0030 % NOTE: x and y are the first two arguments. All other arguments require 0031 % the 'name',value notation. 0032 % 0033 % 'weights' Prior weights for observations (reciprocal of variance, or 0034 % sample size). 0035 % 'cens' Censoring indicators for hazard rate or censored regression. 0036 % The coding is '1' (or 'TRUE') for a censored observation, and 0037 % '0' (or 'FALSE') for uncensored observations. 0038 % 'base' Baseline parameter estimate. If a baseline is provided, 0039 % the local regression model is fitted as 0040 % Y_i = b_i + m(x_i) + epsilon_i, 0041 % with Locfit estimating the m(x) term. For regression models, 0042 % this effectively subtracts b_i from Y_i. The advantage of the 0043 % 'base' formulation is that it extends to likelihood 0044 % regression models. 0045 % 'scale' A scale to apply to each variable. This is especially 0046 % important for multivariate fitting, where variables may be 0047 % measured in non-comparable units. It is also used to specify 0048 % the frequency for variables with the 'a' (angular) style. 0049 % 'sty' Character string (length d) of styles for each predictor variable. 0050 % n denotes `normal'; a denotes angular (or periodic); l and r 0051 % denotes one-sided left and right; c is conditionally parametric. 0052 % 0053 % 0054 % Smoothing Parameters and Bandwidths: 0055 % The bandwidth (or more accurately, half-width) of the smoothing window 0056 % controls the amount of smoothing. Locfit allows specification of constant 0057 % (fixed), nearest neighbor, certain locally adaptive variable bandwidths, 0058 % and combinations of these. Also related to the smoothing parameter 0059 % are the local polynmial degree and weight function. 0060 % 0061 % 'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5 0062 % means that the width of each smoothing neighborhood is chosen 0063 % to cover 50% of the data. 0064 % 0065 % 'h' A constant (or fixed) bandwidth parameter. For example, 'h',2 0066 % means that the smoothing windows have constant half-width 0067 % (or radius) 2. Note that h is applied after scaling. 0068 % 0069 % 'pen' penalty parameter for adaptive smoothing. Needs to be used 0070 % with care. 0071 % 0072 % 'alpha' The old way of specifying smoothing parameters, as used in 0073 % my book. alpha is equivalent to the vector [nn,h,pen]. 0074 % If multiple componenents are non-zero, the largest corresponding 0075 % bandwidth is used. The default (if none of alpha,nn,h,pen 0076 % are provided) is [0.7 0 0]. 0077 % 0078 % 'deg' Degree of local polynomial. Default: 2 (local quadratic). 0079 % Degrees 0 to 3 are supported by almost all parts of the 0080 % Locfit code. Higher degrees may work in some cases. 0081 % 0082 % 'kern' Weight function, default = 'tcub'. Other choices are 0083 % 'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'. 0084 % Choices may be restricted when derivatives are 0085 % required; e.g. for confidence bands and some bandwidth 0086 % selectors. 0087 % 0088 % 'kt' Kernel type, 'sph' (default); 'prod'. In multivariate 0089 % problems, 'prod' uses a simplified product model which 0090 % speeds up computations. 0091 % 0092 % 'acri' Criterion for adaptive bandwidth selection. 0093 % 0094 % 0095 % Derivative Estimation. 0096 % Generally I recommend caution when using derivative estimation 0097 % (and especially higher order derivative estimation) -- can you 0098 % really estimate derivatives from noisy data? Any derivative 0099 % estimate is inherently more dependent on an assumed smoothness 0100 % (expressed through the bandwidth) than the data. Warnings aside... 0101 % 0102 % 'deriv' Derivative estimation. 'deriv',1 specifies the first derivative 0103 % (or more correctly, an estimate of the local slope is returned. 0104 % 'deriv',[1 1] specifies the second derivative. For bivariate fits 0105 % 'deriv',2 specifies the first partial derivative wrt x2. 0106 % 'deriv',[1 2] is mixed second-order derivative. 0107 % 0108 % Fitting family. 0109 % 'family' is used to specify the local likelihood family. 0110 % Regression-type families are 'gaussian', 'binomial', 0111 % 'poisson', 'gamma' and 'geom'. If the family is preceded 0112 % by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is 0113 % used; in particular, a dispersion estimate is computed. 0114 % Preceding by an 'r' makes an attempt at robust (outlier-resistant) 0115 % estimation. Combining q and r (e.g. 'family','qrpois') may 0116 % work, if you're lucky. 0117 % Density estimation-type families are 'dens', 'rate' and 'hazard' 0118 % (hazard or failure rate). Note that `dens' scales the output 0119 % to be a statistical density estimate (i.e. scaled to integrate 0120 % to 1). 'rate' estimates the rate or intensity function (events 0121 % per unit time, or events per unit area), which may be called 0122 % density in some fields. 0123 % The default family is 'qgauss' if a response (y argument) has been 0124 % provided, and 'dens' if no response is given. 0125 % 'link' Link function for local likelihood fitting. Depending on the 0126 % family, choices may be 'ident', 'log', 'logit', 0127 % 'inverse', 'sqrt' and 'arcsin'. 0128 % 0129 % Evaluation structures. 0130 % By default, locfit chooses a set of points, depending on the data 0131 % and smoothing parameters, to evaluate at. This is controlled by 0132 % the evaluation structure. 0133 % 'ev' Specify the evaluation structure. Default is 'tree'. 0134 % Other choices include 'phull' (triangulation), 'grid' (a grid 0135 % of points), 'data' (each data point), 'crossval' (data, 0136 % but use leave-one-out cross validation), 'none' (no evaluation 0137 % points, effectively producing the global parametric fit). 0138 % Alternatively, a vector/matrix of evaluation points may be 0139 % provided. 0140 % (kd trees not currently supported in mlocfit) 0141 % 'll' and 'ur' -- row vectors specifying the upper and lower limits 0142 % for the bounding box used by the evaluation structure. 0143 % They default to the data range. 0144 % 'mg' For the 'grid' evaluation structure, 'mg' specifies the 0145 % number of points on each margin. Default 10. Can be either a 0146 % single number or vector. 0147 % 'cut' Refinement parameter for adaptive partitions. Default 0.8; 0148 % smaller values result in more refined partitions. 0149 % 'maxk' Controls space assignment for evaluation structures. For the 0150 % adaptive evaluation structures, it is impossible to be sure 0151 % in advance how many vertices will be generated. If you get 0152 % warnings about `Insufficient vertex space', Locfit's default 0153 % assigment can be increased by increasing 'maxk'. The default 0154 % is 'maxk','100'. 0155 % 0156 % 'xlim' For density estimation, Locfit allows the density to be 0157 % supported on a bounded interval (or rectangle, in more than 0158 % one dimension). The format should be [ll;ul] (ie, matrix with 0159 % two rows, d columns) where ll is the lower left corner of 0160 % the rectangle, and ur is the upper right corner. 0161 % One-sided bounds, such as [0,infty), are not supported, but can be 0162 % effectively specified by specifying a very large upper 0163 % bound. 0164 % 0165 % 'what' is used by other functions calling locfit(), 0166 % and should not be used directly. 0167 % 0168 % 0169 % The output of locfit() is a Matlab Cell array: 0170 % fit{1} = data. 0171 % fit{2} = evaluation structure. 0172 % fit{3} = smoothing parameter structure. 0173 % fit{4}{1} = fit points matrix. 0174 % fit{4}{2} = matrix of fitted values etc. 0175 % Note that these are not back-transformed, and may have the 0176 % parametric component removed. 0177 % fit{4}{3} = various details of the evaluation points. 0178 % fit{4}{4} = fit limits. 0179 % fit{4}{5} = family,link. 0180 % fit{5} = parametric component values. 0181 % 0182 0183 0184 0185 % Minimal input validation 0186 if nargin < 1 0187 error( 'At least one input argument required' ); 0188 end 0189 0190 xdata = double(varargin{1}); 0191 d = size(xdata,2); 0192 n = size(xdata,1); 0193 if ((nargin>1) && (~ischar(varargin{2}))) 0194 ydata = double(varargin{2}); 0195 if (any(size(ydata) ~= [n 1])) error('y must be n*1 column vector'); end; 0196 family = 'qgauss'; 0197 na = 3; 0198 else 0199 ydata = 0; 0200 family = 'density'; 0201 na = 2; 0202 end; 0203 if mod(nargin-na,2)==0 0204 error( 'All arguments other than x, y must be name,value pairs' ); 0205 end 0206 0207 0208 wdata = ones(n,1); 0209 cdata = 0; 0210 base = 0; 0211 style = 'n'; 0212 scale = 1; 0213 xl = zeros(2,d); 0214 0215 alpha = [0 0 0]; 0216 deg = 2; 0217 link = 'default'; 0218 acri = 'none'; 0219 kern = 'tcub'; 0220 kt = 'sph'; 0221 0222 ev = 'atree'; 0223 ll = zeros(1,d); 0224 ur = zeros(1,d); 0225 mg = 10; 0226 maxk = 100; 0227 deriv=0; 0228 cut = 0.8; 0229 mdl = 'std'; 0230 0231 while na < length(varargin) 0232 inc = 0; 0233 if (varargin{na}=='y') 0234 ydata = double(varargin{na+1}); 0235 family = 'qgauss'; 0236 inc = 2; 0237 if (any(size(ydata) ~= [n 1])) error('y must be n*1 column vector'); end; 0238 end 0239 if (strcmp(varargin{na},'weights')) 0240 wdata = double(varargin{na+1}); 0241 inc = 2; 0242 if (any(size(wdata) ~= [n 1])) error('weights must be n*1 column vector'); end; 0243 end 0244 if (strcmp(varargin{na},'cens')) 0245 cdata = double(varargin{na+1}); 0246 inc = 2; 0247 if (any(size(cdata) ~= [n 1])) error('cens must be n*1 column vector'); end; 0248 end 0249 if (strcmp(varargin{na},'base')) % numeric vector, n*1 or 1*1. 0250 base = double(varargin{na+1}); 0251 if (length(base)==1) base = base*ones(n,1); end; 0252 inc = 2; 0253 end 0254 if (strcmp(varargin{na},'style')) % character string of length d. 0255 style = varargin{na+1}; 0256 inc = 2; 0257 end; 0258 if (strcmp(varargin{na},'scale')) % row vector, length 1 or d. 0259 scale = varargin{na+1}; 0260 if (scale==0) 0261 scale = zeros(1,d); 0262 for i=1:d 0263 scale(i) = sqrt(var(xdata(:,i))); 0264 end; 0265 end; 0266 inc = 2; 0267 end; 0268 if (strcmp(varargin{na},'xlim')) % 2*d numeric matrix. 0269 xl = varargin{na+1}; 0270 inc = 2; 0271 end 0272 if (strcmp(varargin{na},'alpha')) % row vector of length 1, 2 or 3. 0273 alpha = [varargin{na+1} 0 0 0]; 0274 alpha = alpha(1:3); 0275 inc = 2; 0276 end 0277 if (strcmp(varargin{na},'nn')) % scalar 0278 alpha(1) = varargin{na+1}; 0279 inc = 2; 0280 end 0281 if (strcmp(varargin{na},'h')) % scalar 0282 alpha(2) = varargin{na+1}; 0283 inc = 2; 0284 end; 0285 if (strcmp(varargin{na},'pen')) % scalar 0286 alpha(3) = varargin{na+1}; 0287 inc = 2; 0288 end; 0289 if (strcmp(varargin{na},'acri')) % string 0290 acri = varargin{na+1}; 0291 inc = 2; 0292 end 0293 if (strcmp(varargin{na},'deg')) % positive integer. 0294 deg = varargin{na+1}; 0295 inc = 2; 0296 end; 0297 if (strcmp(varargin{na},'family')) % character string. 0298 family = varargin{na+1}; 0299 inc = 2; 0300 end; 0301 if (strcmp(varargin{na},'link')) % character string. 0302 link = varargin{na+1}; 0303 inc = 2; 0304 end; 0305 if (strcmp(varargin{na},'kern')) % character string. 0306 kern = varargin{na+1}; 0307 inc = 2; 0308 end; 0309 if (strcmp(varargin{na},'kt')) % character string. 0310 kt = varargin{na+1}; 0311 inc = 2; 0312 end; 0313 if (strcmp(varargin{na},'ev')) % char. string, or matrix with d columns. 0314 ev = varargin{na+1}; 0315 if (isnumeric(ev)) ev = ev'; end; 0316 inc = 2; 0317 end; 0318 if (strcmp(varargin{na},'ll')) % row vector of length d. 0319 ll = varargin{na+1}; 0320 inc = 2; 0321 end; 0322 if (strcmp(varargin{na},'ur')) % row vector of length d. 0323 ur = varargin{na+1}; 0324 inc = 2; 0325 end; 0326 if (strcmp(varargin{na},'mg')) % row vector of length d. 0327 mg = varargin{na+1}; 0328 inc = 2; 0329 end; 0330 if (strcmp(varargin{na},'cut')) % positive scalar. 0331 cut = varargin{na+1}; 0332 inc = 2; 0333 end; 0334 if (strcmp(varargin{na},'what')) % string. 0335 mdl = varargin{na+1}; 0336 inc = 2; 0337 end; 0338 if (strcmp(varargin{na},'maxk')) % positive integer. 0339 maxk = varargin{na+1}; 0340 inc = 2; 0341 end; 0342 if (strcmp(varargin{na},'deriv')) % numeric row vector, up to deg elements. 0343 deriv = varargin{na+1}; 0344 inc = 2; 0345 end; 0346 if (inc==0) 0347 disp(varargin{na}); 0348 error('Unknown Input Argument.'); 0349 end; 0350 na=na+2; 0351 end 0352 0353 data = {xdata ydata wdata cdata base style scale xl}; 0354 evs = {ev mdl ll ur mg cut maxk deriv}; 0355 if (alpha==0) alpha = [0.7 0 0]; end; 0356 sp = {alpha acri deg family link kern kt}; 0357 [fpc pcomp] = mexlf(data,evs,sp); 0358 fit = {data evs sp fpc pcomp}; 0359 0360 return;