Usually, more than one independent variable influences the dependent
variable. You can imagine in the above example that sales are
influenced by advertising as well as other factors, such as the number
of sales representatives and the commission percentage paid to sales
representatives. When one independent variable is used in a regression,
it is called a simple regression; when two or more independent variables are used, it is called a multiple regression.
Regression models can be either linear or nonlinear. A linear model
assumes the relationships between variables are straightline
relationships, while a nonlinear model assumes the relationships
between variables are represented by curved lines. In business you will
often see the relationship between the return of an individual stock
and the returns of the market modeled as a linear relationship, while
the relationship between the price of an item and the demand for it is
often modeled as a nonlinear relationship.
As you can see, there are several different classes of regression
procedures, with each having varying degrees of complexity and
explanatory power. The most basic type of regression is that of simple linear regression.
A simple linear regression uses only one independent variable, and it
describes the relationship between the independent variable and
dependent variable as a straight line. This review will focus on the
basic case of a simple linear regression.
How does regression work to enable prediction? View the following
animation for a brief explanation of the basics of simple linear
regression. The subsequent text will develop ideas mentioned in the
animation.
Scatter Plots
As indicated by the animation, one of the first steps in
regression is to plot your data on a scatter plot. The following
table lists the monthly sales and advertising expenditures for all of
last year by a digital electronics company.
In this case, you would plot last year's data for monthly sales
and advertising expenditures as shown on the scatter plot below.
(Data for independent and dependent variables must be from the same
period of time.)
Scatter plots are effective in visually identifying relationships
between variables. These relationships can be expressed
mathematically in terms of a correlation coefficient, which is
commonly referred to as a correlation. Correlations are indices of
the strength of the relationship between two variables. They can be
any value from –1 to +1. (Correlations are covered in greater detail
in the Covariance and Correlation topic of this section.)
When you use regression to predict future values of the dependent
variable, the ideal correlation between the independent and
dependent variable is high—in absolute value
terms, somewhere in the range between .5 to .99. Viewing the scatter
plot above, you can see that there appears to be some degree of
correlation between the level of advertising expenditure and product
awareness. When calculated, this correlation equals .89. This
historical data will enable you to predict the relationship between
the two variables in the future, before any further expense is
incurred. In order to make these predictions, a regression line must
be drawn from the information appearing in the scatter plot.
Regression Line
The figure below is the same as the scatter plot above, with the
addition of a regression line fitted to the historical data.
The regression line is the line with the smallest possible set of
distances between itself and each data point. As you can see, the
regression line touches some data points, but not others. The
distances of the data points from the regression line are called error terms.
A regression line will always contain error terms because, in
reality, independent variables are never perfect predictors of the
dependent variables. There are many
uncontrollable factors in the business world. The error term exists
because a regression model can never include all possible variables;
some predictive capacity will always be absent, particularly in
simple regression.
The typical procedure
for finding the line of best fit is called the leastsquares method.
This calculation is usually performed using computer software. In
this calculation, the best fit is found by taking the difference
between each data point and the line, squaring each difference, and
adding the values together. The leastsquares method is based upon
the principle that the sum of the squared errors should be made as
small as possible so the regression line has the least error.
Once this line is determined, it can be extended beyond the
historical data to predict future levels of product awareness, given
a particular level of advertising expenditure.
The extension of the line of regression requires the assumption
that the underlying process causing the relationship beween the two
variables is valid beyond the range of the sample data. Regression is
a powerful business tool due to its ability to predict
future relationships between variables such as these.
When you run a regression in Excel or in a statistics program,
the program will provide you with a report. The details of these
reports, and the definition of all the terms included in the report,
are beyond the scope of the course.
Equation of a Regression Line
You may recall the equation of a straight line from your review
of the Linear Functions topic in the Algebra section of this
course.
Variables, constants, and coefficients are represented in the
equation of a line as
x represents the independent variable
f(x) represents the dependent variable
the constant b denotes the yintercept—this will be the value of the dependent variable if
the independent variable is equal to zero
the coefficient m describes the movement in the
dependent variable as a result of a given movement in the independent
variable
In finance, linear regressions are commonly used to describe the
returns of an individual security (dependent variable) compared to
the returns of the market in general (independent variable). The
equation for the simple linear regressions used to describe security
movements is also a straight line and is expressed in a format, which,
while similar, does contain a couple of twists. The equation below is
a regression equation for a straight line describing the relationship
between the returns of security I and the market in general.
r_{i} represents the return of security I and is
the dependent variable
r_{m} represents the return of the market in
general and is the independent variable
b is the slope of the regression line, and
it describes the level of movement in security I as a result of a
unit of movement in the market in general
a is the yintercept of the
regression line
I is an error term that describes the
distance between an actual data point and the corresponding point on
the regression line
The graph below provides a visual depiction of this regression
line. The returns of the market in general are represented in this
graph by the returns of the S&P 500—a common surrogate for market returns.
You may be familiar with discussions in financial circles about
the beta (b) of a security being a measure
of the security's risk. The risk measure of beta is calculated using
regression techniques. Beta, the slope of the regression
line, was described above as the level of movement in the returns of
a given security for each unit of movement in the market in general.
A security with a high beta is considered risky and will experience
big swings in its returns as compared to those of the market. A
security with a low beta is considered less risky and will have
returns that fluctuate less than those of the market. The alpha term
(a) in the regression equation of a
security represents the security's propensity to move independent of
the market. The alpha and beta of a security cannot be observed
directly but are estimated, based on the past performance of a
security, through regression analysis.
1. The placement office of a graduate business school would like to
predict the starting salaries of its students. The placement office
administrators are highly confident (based on their collective past
experience) that starting salaries depend on a combination of factors,
including the number of years of previous work experience, the
student's graduate school GPA, and the student's GMAT score. Is it
appropriate for the placement office to use a simple linear regression
to predict the starting salaries of its students? Why or why not?
Solution 1
2. The following two graphs illustrate simple linear regressions. Which has a higher predictive quality and why?
Solution 2
3. Consider the following scatter plot with a regression line of the
advertising dollars spent (in thousands) and sales (in millions). If
the company were to spend $275,000 on advertising, what would you
predict the sales level to be?
Solution 3
