Home > chronux_1_1 > locfit > locfit.m

locfit

PURPOSE ^

Smoothing noisy data using Local Regression and Likelihood.

SYNOPSIS ^

function fit=locfit(varargin)

DESCRIPTION ^

 Smoothing noisy data using Local Regression and Likelihood.

 arguments still to add: dc renorm itype mint maxit debug

  Usage: fit = locfit(x,y)   % local regression fit of x and y.
         fit = locfit(x)     % density estimation of x.

  Smoothing with locfit is a two-step procedure. The locfit()
  function evaluates the local regression smooth at a set of points
  (can be specified through an evaluation structure). Then, use
  the predict() function to interpolate this fit to other points.

  Additional arguments to locfit() are specified as 'name',value pairs, e.g.:
  locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 ); 


  Data-related inputs:

    x is a vector or matrix of the independent (or predictor) variables.
      Rows of x represent subjects, columns represent variables.
      Generally, local regression would be used with 1-4 independent
      variables. In higher dimensions, the curse-of-dimensionality,
      as well as the difficulty of visualizing higher dimensional
      surfaces, may limit usefulness.

    y is the column vector of the dependent (or response) variable.
      For density families, 'y' is omitted. 
 NOTE: x and y are the first two arguments. All other arguments require
        the 'name',value notation.

    'weights' Prior weights for observations (reciprocal of variance, or
           sample size). 
    'cens' Censoring indicators for hazard rate or censored regression.
           The coding is '1' (or 'TRUE') for a censored observation, and
           '0' (or 'FALSE') for uncensored observations. 
    'base' Baseline parameter estimate. If a baseline is provided,
           the local regression model is fitted as
                        Y_i = b_i + m(x_i) + epsilon_i,
           with Locfit estimating the m(x) term. For regression models,
           this effectively subtracts b_i from Y_i. The advantage of the
           'base' formulation is that it extends to likelihood
           regression models. 
    'scale' A scale to apply to each variable. This is especially
           important for multivariate fitting, where variables may be
           measured in non-comparable units. It is also used to specify
           the frequency for variables with the 'a' (angular) style.
     'sty' Character string (length d) of styles for each predictor variable.
           n denotes `normal'; a denotes angular (or periodic); l and r
           denotes one-sided left and right; c is conditionally parametric.
 

  Smoothing Parameters and Bandwidths:
  The bandwidth (or more accurately, half-width) of the smoothing window
  controls the amount of smoothing. Locfit allows specification of constant
  (fixed), nearest neighbor, certain locally adaptive variable bandwidths,
  and combinations of these. Also related to the smoothing parameter
  are the local polynmial degree and weight function.

    'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5
         means that the width of each smoothing neighborhood is chosen
         to cover 50% of the data.

     'h' A constant (or fixed) bandwidth parameter. For example, 'h',2
         means that the smoothing windows have constant half-width
         (or radius) 2. Note that h is applied after scaling.

   'pen' penalty parameter for adaptive smoothing. Needs to be used
         with care.

  'alpha' The old way of specifying smoothing parameters, as used in
         my book. alpha is equivalent to the vector [nn,h,pen].
         If multiple componenents are non-zero, the largest corresponding
         bandwidth is used. The default (if none of alpha,nn,h,pen
         are provided) is [0.7 0 0].

   'deg' Degree of local polynomial. Default: 2 (local quadratic).
         Degrees 0 to 3 are supported by almost all parts of the
         Locfit code. Higher degrees may work in some cases. 
 
  'kern' Weight function, default = 'tcub'. Other choices are
         'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'.
         Choices may be restricted when derivatives are
         required; e.g. for confidence bands and some bandwidth
         selectors. 
 
    'kt' Kernel type, 'sph' (default); 'prod'. In multivariate
         problems, 'prod' uses a simplified product model which
         speeds up computations. 
 
  'acri' Criterion for adaptive bandwidth selection.
 

  Derivative Estimation.
  Generally I recommend caution when using derivative estimation
  (and especially higher order derivative estimation) -- can you
  really estimate derivatives from noisy data? Any derivative
  estimate is inherently more dependent on an assumed smoothness
  (expressed through the bandwidth) than the data. Warnings aside...

  'deriv' Derivative estimation. 'deriv',1 specifies the first derivative
         (or more correctly, an estimate of the local slope is returned.
         'deriv',[1 1] specifies the second derivative. For bivariate fits
         'deriv',2 specifies the first partial derivative wrt x2.
         'deriv',[1 2] is mixed second-order derivative.
 
  Fitting family.
  'family' is used to specify the local likelihood family.
         Regression-type families are 'gaussian', 'binomial',
           'poisson', 'gamma' and 'geom'. If the family is preceded
           by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is
           used; in particular, a dispersion estimate is computed.
           Preceding by an 'r' makes an attempt at robust (outlier-resistant)
           estimation. Combining q and r (e.g. 'family','qrpois') may
           work, if you're lucky.
         Density estimation-type families are 'dens', 'rate' and 'hazard'
           (hazard or failure rate). Note that `dens' scales the output
           to be a statistical density estimate (i.e. scaled to integrate
           to 1). 'rate' estimates the rate or intensity function (events
           per unit time, or events per unit area), which may be called
           density in some fields.
         The default family is 'qgauss' if a response (y argument) has been
         provided, and 'dens' if no response is given.
    'link' Link function for local likelihood fitting. Depending on the
           family, choices may be 'ident', 'log', 'logit',
           'inverse', 'sqrt' and 'arcsin'. 
 
  Evaluation structures.
    By default, locfit chooses a set of points, depending on the data
    and smoothing parameters, to evaluate at. This is controlled by
    the evaluation structure.
      'ev' Specify the evaluation structure. Default is 'tree'.
           Other choices include 'phull' (triangulation), 'grid' (a grid
           of points), 'data' (each data point), 'crossval' (data,
           but use leave-one-out cross validation), 'none' (no evaluation
           points, effectively producing the global parametric fit).
           Alternatively, a vector/matrix of evaluation points may be
           provided. 
           (kd trees not currently supported in mlocfit)
     'll' and 'ur' -- row vectors specifying the upper and lower limits
           for the bounding box used by the evaluation structure.
           They default to the data range. 
     'mg' For the 'grid' evaluation structure, 'mg' specifies the
           number of points on each margin. Default 10. Can be either a
           single number or vector. 
    'cut' Refinement parameter for adaptive partitions. Default 0.8;
           smaller values result in more refined partitions. 
    'maxk' Controls space assignment for evaluation structures. For the
           adaptive evaluation structures, it is impossible to be sure
           in advance how many vertices will be generated. If you get
           warnings about `Insufficient vertex space', Locfit's default
           assigment can be increased by increasing 'maxk'. The default
           is 'maxk','100'. 

    'xlim' For density estimation, Locfit allows the density to be
           supported on a bounded interval (or rectangle, in more than
           one dimension). The format should be [ll;ul] (ie, matrix with
           two rows, d columns) where ll is the lower left corner of
           the rectangle, and ur is the upper right corner.
           One-sided bounds, such as [0,infty), are not supported, but can be
           effectively specified by specifying a very large upper
           bound. 
 
      'what' is used by other functions calling locfit(),
      and should not be used directly.
 
 
  The output of locfit() is a Matlab Cell array:
    fit{1} = data.
    fit{2} = evaluation structure.
    fit{3} = smoothing parameter structure.
    fit{4}{1} = fit points matrix.
    fit{4}{2} = matrix of fitted values etc.
           Note that these are not back-transformed, and may have the
           parametric component removed.
    fit{4}{3} = various details of the evaluation points.
    fit{4}{4} = fit limits.
    fit{4}{5} = family,link.
    fit{5} = parametric component values.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function fit=locfit(varargin)
0002 
0003 % Smoothing noisy data using Local Regression and Likelihood.
0004 %
0005 % arguments still to add: dc renorm itype mint maxit debug
0006 %
0007 %  Usage: fit = locfit(x,y)   % local regression fit of x and y.
0008 %         fit = locfit(x)     % density estimation of x.
0009 %
0010 %  Smoothing with locfit is a two-step procedure. The locfit()
0011 %  function evaluates the local regression smooth at a set of points
0012 %  (can be specified through an evaluation structure). Then, use
0013 %  the predict() function to interpolate this fit to other points.
0014 %
0015 %  Additional arguments to locfit() are specified as 'name',value pairs, e.g.:
0016 %  locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 );
0017 %
0018 %
0019 %  Data-related inputs:
0020 %
0021 %    x is a vector or matrix of the independent (or predictor) variables.
0022 %      Rows of x represent subjects, columns represent variables.
0023 %      Generally, local regression would be used with 1-4 independent
0024 %      variables. In higher dimensions, the curse-of-dimensionality,
0025 %      as well as the difficulty of visualizing higher dimensional
0026 %      surfaces, may limit usefulness.
0027 %
0028 %    y is the column vector of the dependent (or response) variable.
0029 %      For density families, 'y' is omitted.
0030 % NOTE: x and y are the first two arguments. All other arguments require
0031 %        the 'name',value notation.
0032 %
0033 %    'weights' Prior weights for observations (reciprocal of variance, or
0034 %           sample size).
0035 %    'cens' Censoring indicators for hazard rate or censored regression.
0036 %           The coding is '1' (or 'TRUE') for a censored observation, and
0037 %           '0' (or 'FALSE') for uncensored observations.
0038 %    'base' Baseline parameter estimate. If a baseline is provided,
0039 %           the local regression model is fitted as
0040 %                        Y_i = b_i + m(x_i) + epsilon_i,
0041 %           with Locfit estimating the m(x) term. For regression models,
0042 %           this effectively subtracts b_i from Y_i. The advantage of the
0043 %           'base' formulation is that it extends to likelihood
0044 %           regression models.
0045 %    'scale' A scale to apply to each variable. This is especially
0046 %           important for multivariate fitting, where variables may be
0047 %           measured in non-comparable units. It is also used to specify
0048 %           the frequency for variables with the 'a' (angular) style.
0049 %     'sty' Character string (length d) of styles for each predictor variable.
0050 %           n denotes `normal'; a denotes angular (or periodic); l and r
0051 %           denotes one-sided left and right; c is conditionally parametric.
0052 %
0053 %
0054 %  Smoothing Parameters and Bandwidths:
0055 %  The bandwidth (or more accurately, half-width) of the smoothing window
0056 %  controls the amount of smoothing. Locfit allows specification of constant
0057 %  (fixed), nearest neighbor, certain locally adaptive variable bandwidths,
0058 %  and combinations of these. Also related to the smoothing parameter
0059 %  are the local polynmial degree and weight function.
0060 %
0061 %    'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5
0062 %         means that the width of each smoothing neighborhood is chosen
0063 %         to cover 50% of the data.
0064 %
0065 %     'h' A constant (or fixed) bandwidth parameter. For example, 'h',2
0066 %         means that the smoothing windows have constant half-width
0067 %         (or radius) 2. Note that h is applied after scaling.
0068 %
0069 %   'pen' penalty parameter for adaptive smoothing. Needs to be used
0070 %         with care.
0071 %
0072 %  'alpha' The old way of specifying smoothing parameters, as used in
0073 %         my book. alpha is equivalent to the vector [nn,h,pen].
0074 %         If multiple componenents are non-zero, the largest corresponding
0075 %         bandwidth is used. The default (if none of alpha,nn,h,pen
0076 %         are provided) is [0.7 0 0].
0077 %
0078 %   'deg' Degree of local polynomial. Default: 2 (local quadratic).
0079 %         Degrees 0 to 3 are supported by almost all parts of the
0080 %         Locfit code. Higher degrees may work in some cases.
0081 %
0082 %  'kern' Weight function, default = 'tcub'. Other choices are
0083 %         'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'.
0084 %         Choices may be restricted when derivatives are
0085 %         required; e.g. for confidence bands and some bandwidth
0086 %         selectors.
0087 %
0088 %    'kt' Kernel type, 'sph' (default); 'prod'. In multivariate
0089 %         problems, 'prod' uses a simplified product model which
0090 %         speeds up computations.
0091 %
0092 %  'acri' Criterion for adaptive bandwidth selection.
0093 %
0094 %
0095 %  Derivative Estimation.
0096 %  Generally I recommend caution when using derivative estimation
0097 %  (and especially higher order derivative estimation) -- can you
0098 %  really estimate derivatives from noisy data? Any derivative
0099 %  estimate is inherently more dependent on an assumed smoothness
0100 %  (expressed through the bandwidth) than the data. Warnings aside...
0101 %
0102 %  'deriv' Derivative estimation. 'deriv',1 specifies the first derivative
0103 %         (or more correctly, an estimate of the local slope is returned.
0104 %         'deriv',[1 1] specifies the second derivative. For bivariate fits
0105 %         'deriv',2 specifies the first partial derivative wrt x2.
0106 %         'deriv',[1 2] is mixed second-order derivative.
0107 %
0108 %  Fitting family.
0109 %  'family' is used to specify the local likelihood family.
0110 %         Regression-type families are 'gaussian', 'binomial',
0111 %           'poisson', 'gamma' and 'geom'. If the family is preceded
0112 %           by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is
0113 %           used; in particular, a dispersion estimate is computed.
0114 %           Preceding by an 'r' makes an attempt at robust (outlier-resistant)
0115 %           estimation. Combining q and r (e.g. 'family','qrpois') may
0116 %           work, if you're lucky.
0117 %         Density estimation-type families are 'dens', 'rate' and 'hazard'
0118 %           (hazard or failure rate). Note that `dens' scales the output
0119 %           to be a statistical density estimate (i.e. scaled to integrate
0120 %           to 1). 'rate' estimates the rate or intensity function (events
0121 %           per unit time, or events per unit area), which may be called
0122 %           density in some fields.
0123 %         The default family is 'qgauss' if a response (y argument) has been
0124 %         provided, and 'dens' if no response is given.
0125 %    'link' Link function for local likelihood fitting. Depending on the
0126 %           family, choices may be 'ident', 'log', 'logit',
0127 %           'inverse', 'sqrt' and 'arcsin'.
0128 %
0129 %  Evaluation structures.
0130 %    By default, locfit chooses a set of points, depending on the data
0131 %    and smoothing parameters, to evaluate at. This is controlled by
0132 %    the evaluation structure.
0133 %      'ev' Specify the evaluation structure. Default is 'tree'.
0134 %           Other choices include 'phull' (triangulation), 'grid' (a grid
0135 %           of points), 'data' (each data point), 'crossval' (data,
0136 %           but use leave-one-out cross validation), 'none' (no evaluation
0137 %           points, effectively producing the global parametric fit).
0138 %           Alternatively, a vector/matrix of evaluation points may be
0139 %           provided.
0140 %           (kd trees not currently supported in mlocfit)
0141 %     'll' and 'ur' -- row vectors specifying the upper and lower limits
0142 %           for the bounding box used by the evaluation structure.
0143 %           They default to the data range.
0144 %     'mg' For the 'grid' evaluation structure, 'mg' specifies the
0145 %           number of points on each margin. Default 10. Can be either a
0146 %           single number or vector.
0147 %    'cut' Refinement parameter for adaptive partitions. Default 0.8;
0148 %           smaller values result in more refined partitions.
0149 %    'maxk' Controls space assignment for evaluation structures. For the
0150 %           adaptive evaluation structures, it is impossible to be sure
0151 %           in advance how many vertices will be generated. If you get
0152 %           warnings about `Insufficient vertex space', Locfit's default
0153 %           assigment can be increased by increasing 'maxk'. The default
0154 %           is 'maxk','100'.
0155 %
0156 %    'xlim' For density estimation, Locfit allows the density to be
0157 %           supported on a bounded interval (or rectangle, in more than
0158 %           one dimension). The format should be [ll;ul] (ie, matrix with
0159 %           two rows, d columns) where ll is the lower left corner of
0160 %           the rectangle, and ur is the upper right corner.
0161 %           One-sided bounds, such as [0,infty), are not supported, but can be
0162 %           effectively specified by specifying a very large upper
0163 %           bound.
0164 %
0165 %      'what' is used by other functions calling locfit(),
0166 %      and should not be used directly.
0167 %
0168 %
0169 %  The output of locfit() is a Matlab Cell array:
0170 %    fit{1} = data.
0171 %    fit{2} = evaluation structure.
0172 %    fit{3} = smoothing parameter structure.
0173 %    fit{4}{1} = fit points matrix.
0174 %    fit{4}{2} = matrix of fitted values etc.
0175 %           Note that these are not back-transformed, and may have the
0176 %           parametric component removed.
0177 %    fit{4}{3} = various details of the evaluation points.
0178 %    fit{4}{4} = fit limits.
0179 %    fit{4}{5} = family,link.
0180 %    fit{5} = parametric component values.
0181 %
0182 
0183 
0184 
0185 % Minimal input validation
0186 if nargin < 1
0187    error( 'At least one input argument required' );
0188 end
0189 xdata = double(varargin{1});
0190 d = size(xdata,2);
0191 n = size(xdata,1);
0192 if ((nargin>1) && (~ischar(varargin{2})))
0193   ydata = double(varargin{2});
0194   if (any(size(ydata) ~= [n 1])) error('y must be n*1 column vector'); end;
0195   family = 'qgauss';
0196   na = 3;
0197 else
0198   ydata = 0;
0199   family = 'density';
0200   na = 2;
0201 end;
0202 if mod(nargin-na,2)==0
0203   error( 'All arguments other than x, y must be name,value pairs' );
0204 end
0205 
0206 
0207 wdata = ones(n,1);
0208 cdata = 0;
0209 base  = 0;
0210 style = 'n';
0211 scale = 1;
0212 xl = zeros(2,d);
0213 
0214 alpha = [0 0 0];
0215 deg = 2;
0216 link = 'default';
0217 acri = 'none';
0218 kern = 'tcub';
0219 kt = 'sph';
0220 
0221 ev = 'atree';
0222 ll = zeros(1,d);
0223 ur = zeros(1,d);
0224 mg = 10;
0225 maxk = 100;
0226 deriv=0;
0227 cut = 0.8;
0228 mdl = 'std';
0229 
0230 while na < length(varargin)
0231     inc = 0;
0232     if (varargin{na}=='y')
0233         ydata = double(varargin{na+1});
0234         family = 'qgauss';
0235         inc = 2;
0236         if (any(size(ydata) ~= [n 1])) error('y must be n*1 column vector'); end;
0237     end
0238     if (strcmp(varargin{na},'weights'))
0239         wdata = double(varargin{na+1});
0240         inc = 2;
0241         if (any(size(wdata) ~= [n 1])) error('weights must be n*1 column vector'); end;
0242     end
0243     if (strcmp(varargin{na},'cens'))
0244         cdata = double(varargin{na+1});
0245         inc = 2;
0246         if (any(size(cdata) ~= [n 1])) error('cens must be n*1 column vector'); end;
0247     end
0248     if (strcmp(varargin{na},'base')) % numeric vector, n*1 or 1*1.
0249         base = double(varargin{na+1});
0250         if (length(base)==1) base = base*ones(n,1); end;
0251         inc = 2;
0252     end
0253     if (strcmp(varargin{na},'style')) % character string of length d.
0254         style = varargin{na+1};
0255         inc = 2;
0256     end;
0257     if (strcmp(varargin{na},'scale')) % row vector, length 1 or d.
0258         scale = varargin{na+1};
0259         if (scale==0)
0260           scale = zeros(1,d);
0261           for i=1:d
0262             scale(i) = sqrt(var(xdata(:,i)));
0263           end;
0264         end;
0265         inc = 2;
0266     end;
0267     if (strcmp(varargin{na},'xlim')) % 2*d numeric matrix.
0268         xl = varargin{na+1};
0269         inc = 2;
0270     end
0271     if (strcmp(varargin{na},'alpha')) % row vector of length 1, 2 or 3.
0272         alpha = [varargin{na+1} 0 0 0];
0273         alpha = alpha(1:3);
0274         inc = 2;
0275     end
0276     if (strcmp(varargin{na},'nn')) % scalar
0277         alpha(1) = varargin{na+1};
0278         inc = 2;
0279     end
0280     if (strcmp(varargin{na},'h')) % scalar
0281         alpha(2) = varargin{na+1};
0282         inc = 2;
0283     end;
0284     if (strcmp(varargin{na},'pen')) % scalar
0285         alpha(3) = varargin{na+1};
0286         inc = 2;
0287     end;
0288     if (strcmp(varargin{na},'acri')) % string
0289         acri = varargin{na+1};
0290         inc = 2;
0291     end
0292     if (strcmp(varargin{na},'deg')) % positive integer.
0293         deg = varargin{na+1};
0294         inc = 2;
0295     end;
0296     if (strcmp(varargin{na},'family')) % character string.
0297         family = varargin{na+1};
0298         inc = 2;
0299     end;
0300     if (strcmp(varargin{na},'link')) % character string.
0301         link = varargin{na+1};
0302         inc = 2;
0303     end;
0304     if (strcmp(varargin{na},'kern')) % character string.
0305         kern = varargin{na+1};
0306         inc = 2;
0307     end;
0308     if (strcmp(varargin{na},'kt')) % character string.
0309         kt = varargin{na+1};
0310         inc = 2;
0311     end;
0312     if (strcmp(varargin{na},'ev')) % char. string, or matrix with d columns.
0313         ev = varargin{na+1};
0314         if (isnumeric(ev)) ev = ev'; end;
0315         inc = 2;
0316     end;
0317     if (strcmp(varargin{na},'ll')) % row vector of length d.
0318         ll = varargin{na+1};
0319         inc = 2;
0320     end;
0321     if (strcmp(varargin{na},'ur')) % row vector of length d.
0322         ur = varargin{na+1};
0323         inc = 2;
0324     end;
0325     if (strcmp(varargin{na},'mg')) % row vector of length d.
0326         mg = varargin{na+1};
0327         inc = 2;
0328     end;
0329     if (strcmp(varargin{na},'cut')) % positive scalar.
0330         cut = varargin{na+1};
0331         inc = 2;
0332     end;
0333     if (strcmp(varargin{na},'what')) % string.
0334         mdl = varargin{na+1};
0335         inc = 2;
0336     end;
0337     if (strcmp(varargin{na},'maxk')) % positive integer.
0338         maxk = varargin{na+1};
0339         inc = 2;
0340     end;
0341     if (strcmp(varargin{na},'deriv')) % numeric row vector, up to deg elements.
0342         deriv = varargin{na+1};
0343         inc = 2;
0344     end;
0345     if (inc==0)
0346       disp(varargin{na});
0347       error('Unknown Input Argument.');
0348     end;
0349     na=na+2;
0350 end
0351 
0352 data = {xdata ydata wdata cdata base style scale xl};
0353 evs = {ev mdl ll ur mg cut maxk deriv};
0354 if (alpha==0) alpha = [0.7 0 0]; end;
0355 sp  = {alpha acri deg family link kern kt};
0356 [fpc pcomp] = mexlf(data,evs,sp);
0357 fit = {data evs sp fpc pcomp};
0358 
0359 return;

Generated on Sun 13-Aug-2006 11:49:44 by m2html © 2003