cross sectional regression analysis

The simple regression model can be used to study the relationship between two
variables. For reasons we will see, the simple regression model has limitations
as a general tool for empirical analysis. Nevertheless, it is sometimes
appropriate as an empirical tool. Learning how to interpret the simple regression
model is good practice for studying multiple regression, which we’ll do in subsequent
chapters.
2.1 DEFINITION OF THE SIMPLE REGRESSION MODEL
Much of applied econometric analysis begins with the following premise: y and x are
two variables, representating some population, and we are interested in “explaining y in
terms of x,” or in “studying how y varies with changes in x.”We discussed some examples
in Chapter 1, including: y is soybean crop yield and x is amount of fertilizer; y is
hourly wage and x is years of education; y is a community crime rate and x is number
of police officers.
In writing down a model that will “explain y in terms of x,” we must confront three
issues. First, since there is never an exact relationship between two variables, how do
we allow for other factors to affect y? Second, what is the functional relationship
between y and x? And third, how can we be sure we are capturing a ceteris paribus relationship
between y and x (if that is a desired goal)?
We can resolve these ambiguities by writing down an equation relating y to x. A
simple equation is
y 0 1x u. (2.1)
Equation (2.1), which is assumed to hold in the population of interest, defines the simple
linear regression model. It is also called the two-variable linear regression model
or bivariate linear regression model because it relates the two variables x and y. We now
discuss the meaning of each of the quantities in (2.1). (Incidentally, the term “regression”
has origins that are not especially important for most modern econometric applications,
so we will not explain it here. See Stigler [1986] for an engaging history of
regression analysis.)
22
C h a p t e r Two
The Simple Regression Model
When related by (2.1), the variables y and x have several different names used
interchangeably, as follows. y is called the dependent variable, the explained variable,
the response variable, the predicted variable, or the regressand. x is called
the independent variable, the explanatory variable, the control variable, the predictor
variable, or the regressor. (The term covariate is also used for x.) The terms
“dependent variable” and “independent variable” are frequently used in econometrics.
But be aware that the label “independent” here does not refer to the statistical
notion of independence between random variables (see Appendix B).
The terms “explained” and “explanatory” variables are probably the most descriptive.
“Response” and “control” are used mostly in the experimental sciences, where the
variable x is under the experimenter’s control. We will not use the terms “predicted variable”
and “predictor,” although you sometimes see these. Our terminology for simple
regression is summarized in Table 2.1.
Table 2.1
Terminology for Simple Regression
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
The variable u, called the error term or disturbance in the relationship, represents
factors other than x that affect y. A simple regression analysis effectively treats all factors
affecting y other than x as being unobserved. You can usefully think of u as standing
for “unobserved.”
Equation (2.1) also addresses the issue of the functional relationship between y and
x. If the other factors in u are held fixed, so that the change in u is zero, u 0, then x
has a linear effect on y:
y 1 x if u 0. (2.2)
Thus, the change in y is simply 1 multiplied by the change in x. This means that 1 is
the slope parameter in the relationship between y and x holding the other factors in u
fixed; it is of primary interest in applied economics. The intercept parameter 0 also
has its uses, although it is rarely central to an analysis.
Chapter 2 The Simple Regression Model
23
E X A M P L E 2 . 1
( S o y b e a n Y i e l d a n d F e r t i l i z e r )
Suppose that soybean yield is determined by the model
yield 0 1fertilizer u, (2.3)
so that y yield and x fertilizer. The agricultural researcher is interested in the effect of
fertilizer on yield, holding other factors fixed. This effect is given by 1. The error term u
contains factors such as land quality, rainfall, and so on. The coefficient 1 measures the
effect of fertilizer on yield, holding other factors fixed: yield 1 fertilizer.
E X A M P L E 2 . 2
( A S i m p l e Wa g e E q u a t i o n )
A model relating a person’s wage to observed education and other unobserved factors is
wage 0 1educ u. (2.4)
If wage is measured in dollars per hour and educ is years of education, then 1 measures
the change in hourly wage given another year of education, holding all other factors fixed.
Some of those factors include labor force experience, innate ability, tenure with current
employer, work ethics, and innumerable other things.
The linearity of (2.1) implies that a one-unit change in x has the same effect on y,
regardless of the initial value of x. This is unrealistic for many economic applications.
For example, in the wage-education example, we might want to allow for increasing
returns: the next year of education has a larger effect on wages than did the previous
year. We will see how to allow for such possibilities in Section 2.4.
The most difficult issue to address is whether model (2.1) really allows us to draw
ceteris paribus conclusions about how x affects y. We just saw in equation (2.2) that 1
does measure the effect of x on y, holding all other factors (in u) fixed. Is this the end
of the causality issue? Unfortunately, no. How can we hope to learn in general about
the ceteris paribus effect of x on y, holding other factors fixed, when we are ignoring all
those other factors?
As we will see in Section 2.5, we are only able to get reliable estimators of 0 and
1 from a random sample of data when we make an assumption restricting how the
unobservable u is related to the explanatory variable x. Without such a restriction, we
will not be able to estimate the ceteris paribus effect, 1. Because u and x are random
variables, we need a concept grounded in probability.
Before we state the key assumption about how x and u are related, there is one assumption
about u that we can always make. As long as the intercept 0 is included in the equation,
nothing is lost by assuming that the average value of u in the population is zero.
Part 1 Regression Analysis with Cross-Sectional Data
24
Mathematically,
E(u) 0. (2.5)
Importantly, assume (2.5) says nothing about the relationship between u and x but simply
makes a statement about the distribution of the unobservables in the population.
Using the previous examples for illustration, we can see that assumption (2.5) is not very
restrictive. In Example 2.1, we lose nothing by normalizing the unobserved factors affecting
soybean yield, such as land quality, to have an average of zero in the population of
all cultivated plots. The same is true of the unobserved factors in Example 2.2. Without
loss of generality, we can assume that things such as average ability are zero in the population
of all working people. If you are not convinced, you can work through Problem
2.2 to see that we can always redefine the intercept in equation (2.1) to make (2.5) true.
We now turn to the crucial assumption regarding how u and x are related. A natural
measure of the association between two random variables is the correlation coefficient.
(See Appendix B for definition and properties.) If u and x are uncorrelated, then, as random
variables, they are not linearly related. Assuming that u and x are uncorrelated goes
a long way toward defining the sense in which u and x should be unrelated in equation
(2.1). But it does not go far enough, because correlation measures only linear dependence
between u and x. Correlation has a somewhat counterintuitive feature: it is possible
for u to be uncorrelated with x while being correlated with functions of x, such as
x2. (See Section B.4 for further discussion.) This possibility is not acceptable for most
regression purposes, as it causes problems for interpretating the model and for deriving
statistical properties. A better assumption involves the expected value of u given x.
Because u and x are random variables, we can define the conditional distribution of
u given any value of x. In particular, for any x, we can obtain the expected (or average)
value of u for that slice of the population described by the value of x. The crucial
assumption is that the average value of u does not depend on the value of x. We can
write this as
E(u x) E(u) 0, (2.6)
where the second equality follows from (2.5). The first equality in equation (2.6) is the
new assumption, called the zero conditional mean assumption. It says that, for any
given value of x, the average of the unobservables is the same and therefore must equal
the average value of u in the entire population.
Let us see what (2.6) entails in the wage example. To simplify the discussion,
assume that u is the same as innate ability. Then (2.6) requires that the average level of
ability is the same regardless of years of education. For example, if E(abil 8) denotes
the average ability for the group of all people with eight years of education, and
E(abil 16) denotes the average ability among people in the population with 16 years of
education, then (2.6) implies that these must be the same. In fact, the average ability
level must be the same for all education levels. If, for example, we think that average
ability increases with years of education, then (2.6) is false. (This would happen if, on
average, people with more ability choose to become more educated.) As we cannot
observe innate ability, we have no way of knowing whether or not average ability is the
Chapter 2 The Simple Regression Model
25
same for all education levels. But this is an issue that we must address before applying
simple regression analysis.

Comments

Popular posts from this blog

ft

karpatkey