Home > chronux_1_0 > locfit > locfit.m

locfit

PURPOSE ^

Smoothing noisy data using Local Regression and Likelihood.

SYNOPSIS ^

function fit=locfit(varargin)

DESCRIPTION ^

 Smoothing noisy data using Local Regression and Likelihood.

 arguments still to add: dc renorm itype mint maxit debug

  Usage: fit = locfit(x,y)   % local regression fit of x and y.
         fit = locfit(x)     % density estimation of x.

  Smoothing with locfit is a two-step procedure. The locfit()
  function evaluates the local regression smooth at a set of points
  (can be specified through an evaluation structure). Then, use
  the predict() function to interpolate this fit to other points.

  Additional arguments to locfit() are specified as 'name',value pairs, e.g.:
  locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 ); 


  Data-related inputs:

    x is a vector or matrix of the independent (or predictor) variables.
      Rows of x represent subjects, columns represent variables.
      Generally, local regression would be used with 1-4 independent
      variables. In higher dimensions, the curse-of-dimensionality,
      as well as the difficulty of visualizing higher dimensional
      surfaces, may limit usefulness.

    y is the column vector of the dependent (or response) variable.
      For density families, 'y' is omitted. 
 NOTE: x and y are the first two arguments. All other arguments require
        the 'name',value notation.

    'weights' Prior weights for observations (reciprocal of variance, or
           sample size). 
    'cens' Censoring indicators for hazard rate or censored regression.
           The coding is '1' (or 'TRUE') for a censored observation, and
           '0' (or 'FALSE') for uncensored observations. 
    'base' Baseline parameter estimate. If a baseline is provided,
           the local regression model is fitted as
                        Y_i = b_i + m(x_i) + epsilon_i,
           with Locfit estimating the m(x) term. For regression models,
           this effectively subtracts b_i from Y_i. The advantage of the
           'base' formulation is that it extends to likelihood
           regression models. 
    'scale' A scale to apply to each variable. This is especially
           important for multivariate fitting, where variables may be
           measured in non-comparable units. It is also used to specify
           the frequency for variables with the 'a' (angular) style.
     'sty' Character string (length d) of styles for each predictor variable.
           n denotes `normal'; a denotes angular (or periodic); l and r
           denotes one-sided left and right; c is conditionally parametric.
 

  Smoothing Parameters and Bandwidths:
  The bandwidth (or more accurately, half-width) of the smoothing window
  controls the amount of smoothing. Locfit allows specification of constant
  (fixed), nearest neighbor, certain locally adaptive variable bandwidths,
  and combinations of these. Also related to the smoothing parameter
  are the local polynmial degree and weight function.

    'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5
         means that the width of each smoothing neighborhood is chosen
         to cover 50% of the data.

     'h' A constant (or fixed) bandwidth parameter. For example, 'h',2
         means that the smoothing windows have constant half-width
         (or radius) 2. Note that h is applied after scaling.

   'pen' penalty parameter for adaptive smoothing. Needs to be used
         with care.

  'alpha' The old way of specifying smoothing parameters, as used in
         my book. alpha is equivalent to the vector [nn,h,pen].
         If multiple componenents are non-zero, the largest corresponding
         bandwidth is used. The default (if none of alpha,nn,h,pen
         are provided) is [0.7 0 0].

   'deg' Degree of local polynomial. Default: 2 (local quadratic).
         Degrees 0 to 3 are supported by almost all parts of the
         Locfit code. Higher degrees may work in some cases. 
 
  'kern' Weight function, default = 'tcub'. Other choices are
         'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'.
         Choices may be restricted when derivatives are
         required; e.g. for confidence bands and some bandwidth
         selectors. 
 
    'kt' Kernel type, 'sph' (default); 'prod'. In multivariate
         problems, 'prod' uses a simplified product model which
         speeds up computations. 
 
  'acri' Criterion for adaptive bandwidth selection.
 

  Derivative Estimation.
  Generally I recommend caution when using derivative estimation
  (and especially higher order derivative estimation) -- can you
  really estimate derivatives from noisy data? Any derivative
  estimate is inherently more dependent on an assumed smoothness
  (expressed through the bandwidth) than the data. Warnings aside...

  'deriv' Derivative estimation. 'deriv',1 specifies the first derivative
         (or more correctly, an estimate of the local slope is returned.
         'deriv',[1 1] specifies the second derivative. For bivariate fits
         'deriv',2 specifies the first partial derivative wrt x2.
         'deriv',[1 2] is mixed second-order derivative.
 
  Fitting family.
  'family' is used to specify the local likelihood family.
         Regression-type families are 'gaussian', 'binomial',
           'poisson', 'gamma' and 'geom'. If the family is preceded
           by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is
           used; in particular, a dispersion estimate is computed.
           Preceding by an 'r' makes an attempt at robust (outlier-resistant)
           estimation. Combining q and r (e.g. 'family','qrpois') may
           work, if you're lucky.
         Density estimation-type families are 'dens', 'rate' and 'hazard'
           (hazard or failure rate). Note that `dens' scales the output
           to be a statistical density estimate (i.e. scaled to integrate
           to 1). 'rate' estimates the rate or intensity function (events
           per unit time, or events per unit area), which may be called
           density in some fields.
         The default family is 'qgauss' if a response (y argument) has been
         provided, and 'dens' if no response is given.
    'link' Link function for local likelihood fitting. Depending on the
           family, choices may be 'ident', 'log', 'logit',
           'inverse', 'sqrt' and 'arcsin'. 
 
  Evaluation structures.
    By default, locfit chooses a set of points, depending on the data
    and smoothing parameters, to evaluate at. This is controlled by
    the evaluation structure.
      'ev' Specify the evaluation structure. Default is 'tree'.
           Other choices include 'phull' (triangulation), 'grid' (a grid
           of points), 'data' (each data point), 'crossval' (data,
           but use leave-one-out cross validation), 'none' (no evaluation
           points, effectively producing the global parametric fit).
           Alternatively, a vector/matrix of evaluation points may be
           provided. 
           (kd trees not currently supported in mlocfit)
     'll' and 'ur' -- row vectors specifying the upper and lower limits
           for the bounding box used by the evaluation structure.
           They default to the data range. 
     'mg' For the 'grid' evaluation structure, 'mg' specifies the
           number of points on each margin. Default 10. Can be either a
           single number or vector. 
    'cut' Refinement parameter for adaptive partitions. Default 0.8;
           smaller values result in more refined partitions. 
    'maxk' Controls space assignment for evaluation structures. For the
           adaptive evaluation structures, it is impossible to be sure
           in advance how many vertices will be generated. If you get
           warnings about `Insufficient vertex space', Locfit's default
           assigment can be increased by increasing 'maxk'. The default
           is 'maxk','100'. 

    'xlim' For density estimation, Locfit allows the density to be
           supported on a bounded interval (or rectangle, in more than
           one dimension). The format should be [ll;ul] (ie, matrix with
           two rows, d columns) where ll is the lower left corner of
           the rectangle, and ur is the upper right corner.
           One-sided bounds, such as [0,infty), are not supported, but can be
           effectively specified by specifying a very large upper
           bound. 
 
      'what' is used by other functions calling locfit(),
      and should not be used directly.
 
 
  The output of locfit() is a Matlab Cell array:
    fit{1} = data.
    fit{2} = evaluation structure.
    fit{3} = smoothing parameter structure.
    fit{4}{1} = fit points matrix.
    fit{4}{2} = matrix of fitted values etc.
           Note that these are not back-transformed, and may have the
           parametric component removed.
    fit{4}{3} = various details of the evaluation points.
    fit{4}{4} = fit limits.
    fit{4}{5} = family,link.
    fit{5} = parametric component values.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function fit=locfit(varargin)
0002 
0003 % Smoothing noisy data using Local Regression and Likelihood.
0004 %
0005 % arguments still to add: dc renorm itype mint maxit debug
0006 %
0007 %  Usage: fit = locfit(x,y)   % local regression fit of x and y.
0008 %         fit = locfit(x)     % density estimation of x.
0009 %
0010 %  Smoothing with locfit is a two-step procedure. The locfit()
0011 %  function evaluates the local regression smooth at a set of points
0012 %  (can be specified through an evaluation structure). Then, use
0013 %  the predict() function to interpolate this fit to other points.
0014 %
0015 %  Additional arguments to locfit() are specified as 'name',value pairs, e.g.:
0016 %  locfit( x, 'alpha',[0.7,1.5] , 'family','rate' , 'ev','grid' , 'mg',100 );
0017 %
0018 %
0019 %  Data-related inputs:
0020 %
0021 %    x is a vector or matrix of the independent (or predictor) variables.
0022 %      Rows of x represent subjects, columns represent variables.
0023 %      Generally, local regression would be used with 1-4 independent
0024 %      variables. In higher dimensions, the curse-of-dimensionality,
0025 %      as well as the difficulty of visualizing higher dimensional
0026 %      surfaces, may limit usefulness.
0027 %
0028 %    y is the column vector of the dependent (or response) variable.
0029 %      For density families, 'y' is omitted.
0030 % NOTE: x and y are the first two arguments. All other arguments require
0031 %        the 'name',value notation.
0032 %
0033 %    'weights' Prior weights for observations (reciprocal of variance, or
0034 %           sample size).
0035 %    'cens' Censoring indicators for hazard rate or censored regression.
0036 %           The coding is '1' (or 'TRUE') for a censored observation, and
0037 %           '0' (or 'FALSE') for uncensored observations.
0038 %    'base' Baseline parameter estimate. If a baseline is provided,
0039 %           the local regression model is fitted as
0040 %                        Y_i = b_i + m(x_i) + epsilon_i,
0041 %           with Locfit estimating the m(x) term. For regression models,
0042 %           this effectively subtracts b_i from Y_i. The advantage of the
0043 %           'base' formulation is that it extends to likelihood
0044 %           regression models.
0045 %    'scale' A scale to apply to each variable. This is especially
0046 %           important for multivariate fitting, where variables may be
0047 %           measured in non-comparable units. It is also used to specify
0048 %           the frequency for variables with the 'a' (angular) style.
0049 %     'sty' Character string (length d) of styles for each predictor variable.
0050 %           n denotes `normal'; a denotes angular (or periodic); l and r
0051 %           denotes one-sided left and right; c is conditionally parametric.
0052 %
0053 %
0054 %  Smoothing Parameters and Bandwidths:
0055 %  The bandwidth (or more accurately, half-width) of the smoothing window
0056 %  controls the amount of smoothing. Locfit allows specification of constant
0057 %  (fixed), nearest neighbor, certain locally adaptive variable bandwidths,
0058 %  and combinations of these. Also related to the smoothing parameter
0059 %  are the local polynmial degree and weight function.
0060 %
0061 %    'nn' 'Nearest neighbor' smoothing parameter. Specifying 'nn',0.5
0062 %         means that the width of each smoothing neighborhood is chosen
0063 %         to cover 50% of the data.
0064 %
0065 %     'h' A constant (or fixed) bandwidth parameter. For example, 'h',2
0066 %         means that the smoothing windows have constant half-width
0067 %         (or radius) 2. Note that h is applied after scaling.
0068 %
0069 %   'pen' penalty parameter for adaptive smoothing. Needs to be used
0070 %         with care.
0071 %
0072 %  'alpha' The old way of specifying smoothing parameters, as used in
0073 %         my book. alpha is equivalent to the vector [nn,h,pen].
0074 %         If multiple componenents are non-zero, the largest corresponding
0075 %         bandwidth is used. The default (if none of alpha,nn,h,pen
0076 %         are provided) is [0.7 0 0].
0077 %
0078 %   'deg' Degree of local polynomial. Default: 2 (local quadratic).
0079 %         Degrees 0 to 3 are supported by almost all parts of the
0080 %         Locfit code. Higher degrees may work in some cases.
0081 %
0082 %  'kern' Weight function, default = 'tcub'. Other choices are
0083 %         'rect', 'trwt', 'tria', 'epan', 'bisq' and 'gauss'.
0084 %         Choices may be restricted when derivatives are
0085 %         required; e.g. for confidence bands and some bandwidth
0086 %         selectors.
0087 %
0088 %    'kt' Kernel type, 'sph' (default); 'prod'. In multivariate
0089 %         problems, 'prod' uses a simplified product model which
0090 %         speeds up computations.
0091 %
0092 %  'acri' Criterion for adaptive bandwidth selection.
0093 %
0094 %
0095 %  Derivative Estimation.
0096 %  Generally I recommend caution when using derivative estimation
0097 %  (and especially higher order derivative estimation) -- can you
0098 %  really estimate derivatives from noisy data? Any derivative
0099 %  estimate is inherently more dependent on an assumed smoothness
0100 %  (expressed through the bandwidth) than the data. Warnings aside...
0101 %
0102 %  'deriv' Derivative estimation. 'deriv',1 specifies the first derivative
0103 %         (or more correctly, an estimate of the local slope is returned.
0104 %         'deriv',[1 1] specifies the second derivative. For bivariate fits
0105 %         'deriv',2 specifies the first partial derivative wrt x2.
0106 %         'deriv',[1 2] is mixed second-order derivative.
0107 %
0108 %  Fitting family.
0109 %  'family' is used to specify the local likelihood family.
0110 %         Regression-type families are 'gaussian', 'binomial',
0111 %           'poisson', 'gamma' and 'geom'. If the family is preceded
0112 %           by a q (e.g. 'qgauss', or 'qpois') then quasi-likelihood is
0113 %           used; in particular, a dispersion estimate is computed.
0114 %           Preceding by an 'r' makes an attempt at robust (outlier-resistant)
0115 %           estimation. Combining q and r (e.g. 'family','qrpois') may
0116 %           work, if you're lucky.
0117 %         Density estimation-type families are 'dens', 'rate' and 'hazard'
0118 %           (hazard or failure rate). Note that `dens' scales the output
0119 %           to be a statistical density estimate (i.e. scaled to integrate
0120 %           to 1). 'rate' estimates the rate or intensity function (events
0121 %           per unit time, or events per unit area), which may be called
0122 %           density in some fields.
0123 %         The default family is 'qgauss' if a response (y argument) has been
0124 %         provided, and 'dens' if no response is given.
0125 %    'link' Link function for local likelihood fitting. Depending on the
0126 %           family, choices may be 'ident', 'log', 'logit',
0127 %           'inverse', 'sqrt' and 'arcsin'.
0128 %
0129 %  Evaluation structures.
0130 %    By default, locfit chooses a set of points, depending on the data
0131 %    and smoothing parameters, to evaluate at. This is controlled by
0132 %    the evaluation structure.
0133 %      'ev' Specify the evaluation structure. Default is 'tree'.
0134 %           Other choices include 'phull' (triangulation), 'grid' (a grid
0135 %           of points), 'data' (each data point), 'crossval' (data,
0136 %           but use leave-one-out cross validation), 'none' (no evaluation
0137 %           points, effectively producing the global parametric fit).
0138 %           Alternatively, a vector/matrix of evaluation points may be
0139 %           provided.
0140 %           (kd trees not currently supported in mlocfit)
0141 %     'll' and 'ur' -- row vectors specifying the upper and lower limits
0142 %           for the bounding box used by the evaluation structure.
0143 %           They default to the data range.
0144 %     'mg' For the 'grid' evaluation structure, 'mg' specifies the
0145 %           number of points on each margin. Default 10. Can be either a
0146 %           single number or vector.
0147 %    'cut' Refinement parameter for adaptive partitions. Default 0.8;
0148 %           smaller values result in more refined partitions.
0149 %    'maxk' Controls space assignment for evaluation structures. For the
0150 %           adaptive evaluation structures, it is impossible to be sure
0151 %           in advance how many vertices will be generated. If you get
0152 %           warnings about `Insufficient vertex space', Locfit's default
0153 %           assigment can be increased by increasing 'maxk'. The default
0154 %           is 'maxk','100'.
0155 %
0156 %    'xlim' For density estimation, Locfit allows the density to be
0157 %           supported on a bounded interval (or rectangle, in more than
0158 %           one dimension). The format should be [ll;ul] (ie, matrix with
0159 %           two rows, d columns) where ll is the lower left corner of
0160 %           the rectangle, and ur is the upper right corner.
0161 %           One-sided bounds, such as [0,infty), are not supported, but can be
0162 %           effectively specified by specifying a very large upper
0163 %           bound.
0164 %
0165 %      'what' is used by other functions calling locfit(),
0166 %      and should not be used directly.
0167 %
0168 %
0169 %  The output of locfit() is a Matlab Cell array:
0170 %    fit{1} = data.
0171 %    fit{2} = evaluation structure.
0172 %    fit{3} = smoothing parameter structure.
0173 %    fit{4}{1} = fit points matrix.
0174 %    fit{4}{2} = matrix of fitted values etc.
0175 %           Note that these are not back-transformed, and may have the
0176 %           parametric component removed.
0177 %    fit{4}{3} = various details of the evaluation points.
0178 %    fit{4}{4} = fit limits.
0179 %    fit{4}{5} = family,link.
0180 %    fit{5} = parametric component values.
0181 %
0182 
0183 
0184 
0185 % Minimal input validation
0186 if nargin < 1
0187    error( 'At least one input argument required' );
0188 end
0189 
0190 xdata = double(varargin{1});
0191 d = size(xdata,2);
0192 n = size(xdata,1);
0193 if ((nargin>1) && (~ischar(varargin{2})))
0194   ydata = double(varargin{2});
0195   if (any(size(ydata) ~= [n 1])) error('y must be n*1 column vector'); end;
0196   family = 'qgauss';
0197   na = 3;
0198 else
0199   ydata = 0;
0200   family = 'density';
0201   na = 2;
0202 end;
0203 if mod(nargin-na,2)==0
0204   error( 'All arguments other than x, y must be name,value pairs' );
0205 end
0206 
0207 
0208 wdata = ones(n,1);
0209 cdata = 0;
0210 base  = 0;
0211 style = 'n';
0212 scale = 1;
0213 xl = zeros(2,d);
0214 
0215 alpha = [0 0 0];
0216 deg = 2;
0217 link = 'default';
0218 acri = 'none';
0219 kern = 'tcub';
0220 kt = 'sph';
0221 
0222 ev = 'atree';
0223 ll = zeros(1,d);
0224 ur = zeros(1,d);
0225 mg = 10;
0226 maxk = 100;
0227 deriv=0;
0228 cut = 0.8;
0229 mdl = 'std';
0230 
0231 while na < length(varargin)
0232     inc = 0;
0233     if (varargin{na}=='y')
0234         ydata = double(varargin{na+1});
0235         family = 'qgauss';
0236         inc = 2;
0237         if (any(size(ydata) ~= [n 1])) error('y must be n*1 column vector'); end;
0238     end
0239     if (strcmp(varargin{na},'weights'))
0240         wdata = double(varargin{na+1});
0241         inc = 2;
0242         if (any(size(wdata) ~= [n 1])) error('weights must be n*1 column vector'); end;
0243     end
0244     if (strcmp(varargin{na},'cens'))
0245         cdata = double(varargin{na+1});
0246         inc = 2;
0247         if (any(size(cdata) ~= [n 1])) error('cens must be n*1 column vector'); end;
0248     end
0249     if (strcmp(varargin{na},'base')) % numeric vector, n*1 or 1*1.
0250         base = double(varargin{na+1});
0251         if (length(base)==1) base = base*ones(n,1); end;
0252         inc = 2;
0253     end
0254     if (strcmp(varargin{na},'style')) % character string of length d.
0255         style = varargin{na+1};
0256         inc = 2;
0257     end;
0258     if (strcmp(varargin{na},'scale')) % row vector, length 1 or d.
0259         scale = varargin{na+1};
0260         if (scale==0)
0261           scale = zeros(1,d);
0262           for i=1:d
0263             scale(i) = sqrt(var(xdata(:,i)));
0264           end;
0265         end;
0266         inc = 2;
0267     end;
0268     if (strcmp(varargin{na},'xlim')) % 2*d numeric matrix.
0269         xl = varargin{na+1};
0270         inc = 2;
0271     end
0272     if (strcmp(varargin{na},'alpha')) % row vector of length 1, 2 or 3.
0273         alpha = [varargin{na+1} 0 0 0];
0274         alpha = alpha(1:3);
0275         inc = 2;
0276     end
0277     if (strcmp(varargin{na},'nn')) % scalar
0278         alpha(1) = varargin{na+1};
0279         inc = 2;
0280     end
0281     if (strcmp(varargin{na},'h')) % scalar
0282         alpha(2) = varargin{na+1};
0283         inc = 2;
0284     end;
0285     if (strcmp(varargin{na},'pen')) % scalar
0286         alpha(3) = varargin{na+1};
0287         inc = 2;
0288     end;
0289     if (strcmp(varargin{na},'acri')) % string
0290         acri = varargin{na+1};
0291         inc = 2;
0292     end
0293     if (strcmp(varargin{na},'deg')) % positive integer.
0294         deg = varargin{na+1};
0295         inc = 2;
0296     end;
0297     if (strcmp(varargin{na},'family')) % character string.
0298         family = varargin{na+1};
0299         inc = 2;
0300     end;
0301     if (strcmp(varargin{na},'link')) % character string.
0302         link = varargin{na+1};
0303         inc = 2;
0304     end;
0305     if (strcmp(varargin{na},'kern')) % character string.
0306         kern = varargin{na+1};
0307         inc = 2;
0308     end;
0309     if (strcmp(varargin{na},'kt')) % character string.
0310         kt = varargin{na+1};
0311         inc = 2;
0312     end;
0313     if (strcmp(varargin{na},'ev')) % char. string, or matrix with d columns.
0314         ev = varargin{na+1};
0315         if (isnumeric(ev)) ev = ev'; end;
0316         inc = 2;
0317     end;
0318     if (strcmp(varargin{na},'ll')) % row vector of length d.
0319         ll = varargin{na+1};
0320         inc = 2;
0321     end;
0322     if (strcmp(varargin{na},'ur')) % row vector of length d.
0323         ur = varargin{na+1};
0324         inc = 2;
0325     end;
0326     if (strcmp(varargin{na},'mg')) % row vector of length d.
0327         mg = varargin{na+1};
0328         inc = 2;
0329     end;
0330     if (strcmp(varargin{na},'cut')) % positive scalar.
0331         cut = varargin{na+1};
0332         inc = 2;
0333     end;
0334     if (strcmp(varargin{na},'what')) % string.
0335         mdl = varargin{na+1};
0336         inc = 2;
0337     end;
0338     if (strcmp(varargin{na},'maxk')) % positive integer.
0339         maxk = varargin{na+1};
0340         inc = 2;
0341     end;
0342     if (strcmp(varargin{na},'deriv')) % numeric row vector, up to deg elements.
0343         deriv = varargin{na+1};
0344         inc = 2;
0345     end;
0346     if (inc==0)
0347       disp(varargin{na});
0348       error('Unknown Input Argument.');
0349     end;
0350     na=na+2;
0351 end
0352 
0353 data = {xdata ydata wdata cdata base style scale xl};
0354 evs = {ev mdl ll ur mg cut maxk deriv};
0355 if (alpha==0) alpha = [0.7 0 0]; end;
0356 sp  = {alpha acri deg family link kern kt};
0357 [fpc pcomp] = mexlf(data,evs,sp);
0358 fit = {data evs sp fpc pcomp};
0359 
0360 return;

Generated on Fri 09-Jun-2006 23:38:05 by m2html © 2003