The LPM, Logit and Probit Models

Firstly, download the excel file called " 98 exam data .xls " from the "Sample Data" of Econ3600 homepage. (This exercise is based on the the sample data of Table 11.8 in "Econometric Models and Economic Forecasts" by Pindyck and Rubinfeld, 4/e., 1998, McGraw-Hill.)

Then create a new Workfile and import the data from the excel file with the specified variable as "case, pub12, pub34 pub5, priv, years, school loginc, ptcon and yesym". 

First of all, we assume that the decision of "voting Yes" (yesvm) depends on many explanatory variables. The regression expression is 

        Yesvm = b0 + b1X1 + b2X2 + ... + bkXk + e

When we regress "yesvm" on all the explanatory variables: "case, loginc, priv, ptcon, pub12, pub34 pub5,  school, years". The result is:

  

We observe that only loginc, ptcon and school have statistically significant effects on Yesvm, so we may only choose them as the best explanatory variables. That is 

yesvm = b0 + b1loginc + b2ptcon + b3school + e 

Now, we are ready to run LMP, Logit and Probit Models separately and compared which model is the best for dealing regression on dummy dependent variable.

1. Linear Probability Model (LPM):

Step 1. Run OLS regression on yesvm = b0 + b1loginc + b2ptcon + b3school + e, the result is:

The slope value of 0.383648 means that for SCHOOL (with relation to school is 1, without relation to school is 0), on the average the probability of decision on voting yes increases by 0.3836 or about 38.36 percent. From the estimated result, the estimated "Yesvm" need not be necessarily  between 1 and 0. And given the values for loginc=10, ptcon=7 and school=1 of a particular person, we can know the estimated probability to vote yes of that particular person is

Probability = -1.29927 +0.412916(10) -0.3229(7) +0.3830(1) = 0.95259 or 95.25%

We can use the truncate method and the weighted least-squares (WLS) procedures to obtain the more efficient estimates. 

Firstly, calculate the estimated Yesvm and name it "yesvmhat".  Click "Genr" and type "yesymhat = yesvm - residual" in the dialog box. Then use it to solve another problem for LPM: That is the Non-fulfillment of probability between 0 and 1. The solution is to truncate all the estimated yesvmhat greater than 1 and smaller than zero from the obtained yesvmhat. 

Secondly, generate a dummy variable by clicking the "GENR" and type "dummy = yesvmhat < 1" and generate the same dummy variable by typing "dummy = yesvmhat > 0", these two procedures are to remove the all the estimated yesvmhat that lie outside 1 and 0.

Thirdly, similarly, generate a variable call "yesvmtruncate" =yesvmhat*dummy and then open it to check for which observation is 0, then type "NA" to replace 0 just like below

(Remark: This method is useful if there were a large number of observations beyond 1 and 0, then you just need to check 0 and replace it with "NA", otherwise you need to check the all the values lie outside and 1 and replace them with NA.)

Fourthly, calculate the weight values, wi = yesvmtruncate*(1-yesvmtruncate) and take square root of it as "GENR sqrtw =@sqrt(w)". And then generate the new weighted variables, such as "GENR overw = 1/sqrtw", "GENR yesvmw = yesvmtruncate/sqrtw", "GENR schoolw = school/sqrtw", "GENR ptconw = ptcon/sqrtw", "GENR logincw = loginc/sqrtw". (Note: The WLS estimation method and the calculation of wi are discussed in "Basic Econometrics" by Gujarati, 3/e., 1995, pp.547-548, equation 16.4.2). Now the WLS regression is 

yesvmtruncate/sqrt(wi)  = b0 (1/sqrt(wi) + b1loginc/sqrt(wi) + b2ptcon/sqrt(wi

                                            + b3school /sqrt(wi)    + e  

The WLS result is:

As you can see, this result is improved and is better than that in previous result of LPM in terms of higher adjusted R2 and larger t-statistics. We can use it to measure how the explanatory variables that have the probability effect to the yes decision of the qualitative dependent variable

The interpretation is fairly straightforward. Thus, other things being constant, the coefficient of SCHOOL will raise the probability of YESVM by 0.4527 or 45.27%. (larger probability than in the LPM). And given the values for loginc=10, ptcon=7 and school=1 of a certain person, we can know the estimated probability to vote yes of that particular person is

Probability= -0.927906 + 0.452753(1) + 0.421885(7) - 0.390537(10)= 1.0099 or 100% 

(Remark: From the above WLS regression, we can get a very important implication: The weighted LPM gives the more efficient estimated probability between 0 and 1. However, the number of observations are reduced because the estimated "Yesvmhat" that lies outside 1 and 0 interval was truncated. Therefore, the WLS procedure will not be efficient for finite samples. Also, the WLS procedure is sensitive to errors of specification. This are the weakness of the LP or WLP model which rooms its unpopularity.) 

2. Logit Model:

In order to run the logit model of equation yesvm = b0 + b1loginc + b2ptcon + b3school+e, simply click "Quick", "Equation specification", the following dialogue box will appear, then highlight the method to choose "BINARY - Binary choice [logit, probit, extreme value]" just like below:

Then, another dialogue box will appear like below:

In the box, type "yesvm c loginc ptcon school" and then click "OK", the result of logit regression is:

How do we interpret the coefficients for the Logit Model? Still take the estimated coefficients of school as example. it means that if people with relation to school, on average, the Log Odds Ratio (L) will increase by 2.998423. Try to explain the remaining coefficients by yourself. Moreover, assumed the known values for loginc=10, ptcon=7 and school=1 of a certain person, we can know the estimated probability to vote yes of that particular person is

log[Pi/(1-Pi)] = Logit= -8.895 + 2.218(10) -1.856(7) +2.998(1) = 3.2898 

How to convert this figure into probability sense? it can be done by: 

Anti-log(3.2898)= 26.837 =[Pi/(1-Pi)], then Pi=0.9640 or 96.40%. 

(Remark: From the above transformation to probability (Anti-log(Logit)= (Pi/1-Pi)). we can get a very important implication: The Logit result can give probability lied outside 0 and 1. This is the main advantage of the Logit model which rooms its popularity.) 

In addition, as you can see, there are more summary statistical information that are provided in Eviews's Logit output. Three important information we may need to know:

(1) The greater the "LR  statistic" would be preferred, it means the more reliable explanatory variables to be added in the "Logit equation" will have a larger effect and probability of decision on the dummy dependent variable.

(2) The  smaller the SEE, the better the model.

(3) The higher the McFadden R-squared, the better the model.

3. Probit Model:

To run the probit model of  yesvm = b0 + b1loginc + b2ptcon + b3school + e, the steps are similar as in running the logit model but now you need to select the "probit" command in the dialogue box, that is:

After clicking "OK" : the result is:

In general, the Probit Model assumes that the higher the value of "Probit", the greater the probability to vote yes. How do we interpret the coefficients for the Probit Model? Again, we take the estimated coefficients of school as example. it means that if people attend one additional year of school education, on average, the Probit will increase by 1.794002.  Try to explain the remaining coefficients by yourself. Moreover, assumed the given values for loginc=10, ptcon=7 and school=1 of one particular person, we can calculate the estimated probability to vote yes for that person. That is 

Probit = -5.131 + 1.328(10) - 1.139(7) + 1.794(1) = 1.97 

How to convert this figure into probability sense? We can check from the Z-table, and the Pr(Z>1.97) is 

Pr(Probit=1.97) = 0.4756 

But in practice, 0.4765 isn't the answer, the answer is acquired by adding 0.5, that is 0.9756 or 97.56%. So we can see not only the probability result of Probit model is similar to the result of Logit model, but also the other statistical information such as SEE, LR and McFadden R2

(Remark: Probit Model also give us the probability to vote yes lies between 0 and 1. However, there is a weakness for Probit Model that the Z-table only provides the Z-values with two decimal points, so the most accurate probability cannot be calculated.)

Summary of statistics
LPM Weighted LPM Logit  Probit
constant -1.2992
(-0.987)
-0.9279
(-0.737)
-8.895
(-1.323)
-5.131
(-1.283)
SCHOOL 0.3836
(2.514)
0.4527
(3.609)
2.998
(2.054)
1.7940
(2.149)
PTCON -0.3229
(-1.854)
-0.3905
(-2.626)
-1.856
(-1.900)
-1.139
(-1.940)
LOGINC 0.4129
(3.060)
0.4218
(4.626)
2.2179
(2.217)
-1.3227
(-3.014)
Adjust R2 0.1084 0.5620
SEE 0.4604 0.9890 0.4597 0.4599
McFadden R2 0.1248 0.1264
LR 15.735 15.941

Conclusion:

From the above demonstrations of four different models, LPM, Weighted LPM, Logit and Probit model, all of them provide similar predicted probability of decision of the same person who is given with the level of "loginc=10, ptcon=7 and school=1". The estimated probability are 95.3%, 100%, 96.41% and 94.76% under LPM, WLPM, Logit and Probit, respectively. Which model is the btter one? Obviously, from these four results, the Logit and Probit Model perform the better results in terms of the smallest SEE and highest adjust R2, and also they solve the weakness of LPM (non-fulfillment of probability between 0 and 1).

The End