Regression analysis
From Wikipedia, the free encyclopedia
In statistics, regression analysis is used to model relationships between variables and determine the magnitude of those relationships. The models can be used to make predictions.
Contents |
[edit] Introduction
Regression analysis models the relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named <math>Y</math>), and the predictors (also called independent variables, explanatory variables, control variables, or regressors,) usually named <math>X_1, ..., X_p</math>). Multivariate regression describes models that have more than one response variable.
[edit] Types of regression
[edit] Simple and multiple linear regression
Simple linear regression and multiple linear regression are related statistical methods for modeling the relationship between two or more random variables using a linear equation. Simple linear regression refers to a regression on two variables while multiple regression refers to a regression on more than two variables. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors).
[edit] Nonlinear regression models
If the relationship between the variables being analyzed is not linear in parameters, a number of nonlinear regression techniques may be used to obtain a more accurate regression.
[edit] Other models
Although these three types are the most common, there also exist Poisson regression, supervised learning, and unit-weighted regression.
[edit] Linear models
Predictor variables may be defined quantitatively (i.e., continuous) or qualitatively (i.e., categorical). Categorical predictors are sometimes called factors. Although the method of estimating the model is the same for each case, different situations are sometimes known by different names for historical reasons:
- If the predictors are all quantitative, we speak of multiple regression.
- If the predictors are all qualitative, one performs analysis of variance.
- If some predictors are quantitative and some qualitative, one performs an analysis of covariance.
The linear model usually assumes that the dependent variable is continuous. If least squares estimation is used, then if it is assumed that the error component is normally distributed, the model is fully parametric. If it is not assumed that the data are normally distributed, the model is semi-parametric. If the data are not normally distributed, there are often better approaches to fitting than least squares. In particular, if the data contain outliers, robust regression might be preferred.
If two or more independent variables are correlated, we say that the variables are multicollinear. Multicollinearity results in parameter estimates that are unbiased and consistent, but inefficient.
If the regression error is not normally distributed but is assumed to come from an exponential family, generalized linear models should be used. For example, if the response variable can take only binary values (for example, a Boolean or Yes/No variable), logistic regression is preferred. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the predictors.
[edit] Regression and Bayesian statistics
Maximum likelihood is one method of estimating the parameters of a regression model, which behaves well for large samples. However, for small amounts of data, the estimates can have high variance or bias. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative, a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available. They are exact, not asymptotic, and thus work well for small data sets if some contextual information is available to be used in the prior. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior mode. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.
[edit] Examples
To illustrate the various goals of regression, we will give three examples.
[edit] Prediction of future observations
The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
| Height (in) | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
| Weight (lbs) | 115 | 117 | 120 | 123 | 126 | 129 | 132 | 135 | 139 | 142 | 146 | 150 | 154 | 159 | 164 |
We would like to see how the weight of these women depends on their height. We are therefore looking for a function <math>\eta</math> such that <math>Y=\eta(X)+\varepsilon</math>, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height. A plot of the data set confirms this supposition:
<math>\vec{X}</math> will denote the vector containing all the measured heights (<math>\vec{X}=(58,59,60,\cdots)</math>) and <math>\vec{Y}=(115,117,120,\cdots)</math> is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients <math>\theta^0, \theta^1</math> and <math>\theta^2</math> satisfying as well as possible (in the sense of the least-squares estimator) the equation:
- <math>\vec{Y}=\theta^0 + \theta^1 \vec{X} + \theta^2 \vec{X}^3+\vec{\varepsilon}</math>
Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables <math>1, X</math> and <math>X^3</math>. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (<math>X^3</math>). The realization of this matrix (i.e. for the data at hand) can be written:
| <math>1</math> | <math>x</math> | <math>x^3</math> |
| 1 | 58 | 195112 |
| 1 | 59 | 205379 |
| 1 | 60 | 216000 |
| 1 | 61 | 226981 |
| 1 | 62 | 238328 |
| 1 | 63 | 250047 |
| 1 | 64 | 262144 |
| 1 | 65 | 274625 |
| 1 | 66 | 287496 |
| 1 | 67 | 300763 |
| 1 | 68 | 314432 |
| 1 | 69 | 328509 |
| 1 | 70 | 343000 |
| 1 | 71 | 357911 |
| 1 | 72 | 373248 |
The matrix <math>(\mathbf{X}^t \mathbf{X})^{-1}</math> (sometimes called "information matrix" or "dispersion matrix") is:
<math> \left[\begin{matrix} 1.9\cdot10^3&-45&3.5\cdot 10^{-3}\\ -45&1.0&-8.1\cdot 10^{-5}\\ 3.5\cdot 10^{-3}&-8.1\cdot 10^{-5}&6.4\cdot 10^{-9} \end{matrix}\right]</math>
Vector <math>\widehat{\theta}_{LS}</math> is therefore:
<math>\widehat{\theta}_{LS}=(X^tX)^{-1}X^{t}y= (147, -2.0, 4.3\cdot 10^{-4})</math>
hence <math>\eta(X) = 147 - 2.0 X + 4.3\cdot 10^{-4} X^3</math>
A plot of this function shows that it lies quite closely to the data set:
The confidence intervals are computed using:
- <math>[\widehat{\theta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\theta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]</math>
with:
- <math>\widehat{\sigma}=0.52</math>
- <math>s_1=1.\cdot 10^3, s_2=1.0, s_3=6.4\cdot 10^{-9}\;</math>
- <math>\alpha=5\%</math>
- <math>t_{n-p;1-\frac{\alpha}{2}}=2.2</math>
Therefore, we can say that the 95% confidence intervals are:
- <math>\theta^0\in[112 , 181]</math>
- <math>\theta^1\in[-2.8 , -1.2]</math>
- <math>\theta^2\in[3.6\cdot 10^{-4} , 4.9\cdot 10^{-4}]</math>
[edit] See also
- Confidence interval
- Extrapolation
- Kriging
- Forecasting
- Prediction interval
- Statistics
- Trend estimation
- Robust regression
- multivariate normal distribution
- important publications in regression analysis.
[edit] References
- Abdi, H. "[1] (2003). Partial least squares regression (PLS-regression). In M. Lewis-Beck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. pp. 792-795.]".
- Abdi, H. "[2] (2003). Partial regression coefficients. In M. Lewis-Beck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. pp. 792-795.]".
- Abdi, H. "[3] ((2007). Coefficients of correlation, alienation and determination. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.".
- Abdi, H. "[4] ((2007). Part and partial correlation. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.".
- Abdi, H. "[5] ((2007). Multiple correlation coefficient. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.".
- Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
- Birkes, David and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
- Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11 121-135.
- Fox, J., Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
- Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
- Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14 413-430.
[edit] Software
- All major statistical software packages, e.g. SAS System, SPSS or Stata, perform various types of regression analysis correctly and user-friendly
- Simpler regression can be done in spreadsheets like MS Excel or OpenOffice.org Calc
- Experts can run complex types of regression using special programming languages like Mathematica, R programming language or Matlab.
- There are many minor softwares specialized in a niche form of regression
[edit] External links
- Simple Regression Page Enter sample data points and this JavaScript easily calculates five types of regression, including a graph of prediction vs. actual.
- SixSigmaFirst - Intro to regression analysis, and linear regression example
- Curvefit - Online ten-point demo
- Curvefit: A complete guide to nonlinear regression - Online textbook
- Exegeses on Linear Models - Some comments on linear regression models by Bill Venables.
- Mazoo's Learning Blog - Example of linear regression. Shows how to find the linear regression equation, variances, standard errors, coefficients of correlation and determination, and confidence interval.
- Regression of Weakly Correlated Data - How linear regression mistakes can appear when Y-range is much smaller than X-rangede:Regressionsanalyse
ja:回帰分析 nl:Regressie-analyse pl:Regresja (statystyka) pt:Regressão ru:Статистическая регрессия

