# Connexions

You are here: Home » Content » Sample selectivity bias

### Recently Viewed

This feature requires Javascript to be enabled.

# Sample selectivity bias

Module by: Christopher Curran. E-mail the author

Summary: This module contains a brief introduction to the econometric problem of sample selectivity bias using Stata.

## Sample Selection Bias

### Introduction

These notes discuss how to handle one of the more common problems that arise in economic analyses—sample selection bias. Essentially, sample selection bias can arise whenever some potential observations cannot be observed. For instance, the students enrolled in an intermediate microeconomics course are not a random sample of all undergraduates. Students self-select when they enroll in any class or choose a major. While we do not know all of the reasons for this self-selection, we suspect that students choosing to take advanced economics courses have more quantitative skills than students choosing courses in the humanities. Since we do not observe the grades that students who did not enroll in the intermediate microeconomics class would have made had they enrolled, we can never observe the grades that they would have made. Under certain circumstances the omission of potential members of a sample will cause ordinary least squares (OLS) to give biased estimates of the parameters of a model.

In the 1970s James Heckman developed techniques that will correct the bias introduced by sample selection bias. Since then, most econometric computer programs include a command that automatically used Heckman’s method. However, blind use of these commands can lead to errors that would be avoided by a better understanding of his correction technique. This module is intended to provide this understanding.

In the first section I discuss the sources of sample selection bias by examining the basic economic model used to understand the problem. In the second section I present the estimation strategy first developed by Heckman. In the third section I discuss how to estimate the Heckman model in Stata. In the final section I examine an extended example of the technique. An exercise is included at the end of the discussion.

### The model

Assume that there is an unobserved latent variable, y i , y i , MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEamaaDaaaleaacaWGPbaabaGaey4fIOcaaOGaaiilaaaa@39B6@ and an unobserved latent index, d i , d i , MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamizamaaDaaaleaacaWGPbaabaGaey4fIOcaaOGaaiilaaaa@39A1@ such that:

y i = x i β+ ε i where i=1,,N; y i = x i β+ ε i where i=1,,N; MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEamaaDaaaleaacaWGPbaabaGaey4fIOcaaOGaeyypa0JabCiEayaafaWaaSbaaSqaaiaadMgaaeqaaOGaaCOSdiabgUcaRiabew7aLnaaBaaaleaacaWGPbaabeaakiaabccacaqG3bGaaeiAaiaabwgacaqGYbGaaeyzaiaabccacaWGPbGaeyypa0JaaGymaiaacYcacqWIMaYscaGGSaGaamOtaiaacUdaaaa@4DDA@
(1)
d i = z i γ+ ν i where i=1,,N; d i = z i γ+ ν i where i=1,,N; MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamizamaaDaaaleaacaWGPbaabaGaey4fIOcaaOGaeyypa0JabCOEayaafaWaaSbaaSqaaiaadMgaaeqaaOGaaC4SdiabgUcaRiabe27aUnaaBaaaleaacaWGPbaabeaakiaabccacaqG3bGaaeiAaiaabwgacaqGYbGaaeyzaiaabccacaWGPbGaeyypa0JaaGymaiaacYcacqWIMaYscaGGSaGaamOtaiaacUdaaaa@4DD9@
(2)
d i ={ 1 if  d i >0 0 if  d i 0  ; and d i ={ 1 if  d i >0 0 if  d i 0  ; and MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamizamaaBaaaleaacaWGPbaabeaakiabg2da9maaceaaeaqabeaacaaIXaGaaeiiaiaabMgacaqGMbGaaeiiaiaadsgadaqhaaWcbaGaamyAaaqaaiabgEHiQaaakiabg6da+iaaicdaaeaacaaIWaGaaeiiaiaabMgacaqGMbGaaeiiaiaadsgadaqhaaWcbaGaamyAaaqaaiabgEHiQaaakiabgsMiJkaaicdacaqGGaaaaiaawUhaaiaacUdacaqGGaGaaeyyaiaab6gacaqGKbaaaa@50BE@
(3)
(4)

The matrix notation above means (1) that

2. z i γ=[ 1 z 1i z Li ][ γ 0 γ 1 γ L ]= γ 0 + j=1 L γ j z ji . z i γ=[ 1 z 1i z Li ][ γ 0 γ 1 γ L ]= γ 0 + j=1 L γ j z ji . MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGabCOEayaafaWaaSbaaSqaaiaadMgaaeqaaOGaaC4Sdiabg2da9maadmaabaqbaeqabeabaaaabaGaaGymaaqaaiaadQhadaWgaaWcbaGaaGymaiaadMgaaeqaaaGcbaGaeS47IWeabaGaamOEamaaBaaaleaacaWGmbGaamyAaaqabaaaaaGccaGLBbGaayzxaaWaamWaaeaafaqabeabbaaaaeaacqaHZoWzdaWgaaWcbaGaaGimaaqabaaakeaacqaHZoWzdaWgaaWcbaGaaGymaaqabaaakeaacqWIUlstaeaacqaHZoWzdaWgaaWcbaGaamitaaqabaaaaaGccaGLBbGaayzxaaGaeyypa0Jaeq4SdC2aaSbaaSqaaiaaicdaaeqaaOGaey4kaSYaaabCaeaacqaHZoWzdaWgaaWcbaGaamOAaaqabaGccaWG6bWaaSbaaSqaaiaadQgacaWGPbaabeaaaeaacaWGQbGaeyypa0JaaGymaaqaaiaadYeaa0GaeyyeIuoakiaac6caaaa@617E@

Substituting (1), (2) and (3) into (4) gives:

y i ={ x i β+ ε i  if  z i γ+ ν i >0 0           if  z i γ+ ν i 0  . y i ={ x i β+ ε i  if  z i γ+ ν i >0 0           if  z i γ+ ν i 0  . MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEamaaBaaaleaacaWGPbaabeaakiabg2da9maaceaaeaqabeaaceWH4bGbauaadaWgaaWcbaGaamyAaaqabaGccaWHYoGaey4kaSIaeqyTdu2aaSbaaSqaaiaadMgaaeqaaOGaaeiiaiaabMgacaqGMbGaaeiiaiqahQhagaqbamaaBaaaleaacaWGPbaabeaakiaaho7acqGHRaWkcqaH9oGBdaWgaaWcbaGaamyAaaqabaGccqGH+aGpcaaIWaaabaGaaGimaiaabccacaqGGaGaaeiiaiaabccacaqGGaGaaeiiaiaabccacaqGGaGaaeiiaiaabccacaqGGaGaaeyAaiaabAgacaqGGaGabCOEayaafaWaaSbaaSqaaiaadMgaaeqaaOGaaC4SdiabgUcaRiabe27aUnaaBaaaleaacaWGPbaabeaakiabgsMiJkaaicdacaqGGaaaaiaawUhaaiaac6caaaa@628C@
(5)

Note that N is the total sample size and n is the number of observations for which d i =1. d i =1. MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamizamaaBaaaleaacaWGPbaabeaakiabg2da9iaaigdacaGGUaaaaa@3A74@

=[ σ ε 2 σ ευ σ υε σ υ 2 ] =[ σ ε 2 σ ευ σ υε σ υ 2 ] MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeyyeIuUaeyypa0ZaamWaaeaafaqabeGacaaabaGaeq4Wdm3aa0baaSqaaiabew7aLbqaaiaaikdaaaaakeaacqaHdpWCdaWgaaWcbaGaeqyTduMaeqyXduhabeaaaOqaaiabeo8aZnaaBaaaleaacqaHfpqDcqaH1oqzaeqaaaGcbaGaeq4Wdm3aa0baaSqaaiabew8a1bqaaiaaikdaaaaaaaGccaGLBbGaayzxaaGaaiOlaaaa@4EF9@
(6)

and ( ε i , ν i ) ( ε i , ν i ) MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaeWaaeaacqaH1oqzdaWgaaWcbaGaamyAaaqabaGccaGGSaGaeqyVd42aaSbaaSqaaiaadMgaaeqaaaGccaGLOaGaayzkaaaaaa@3DD4@ are independent of z i . z i . MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOEamaaBaaaleaacaWGPbaabeaakiaac6caaaa@38C9@ The selectivity bias arises because σ εν 0. σ εν 0. MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdm3aaSbaaSqaaiabew7aLjabe27aUbqabaGccqGHGjsUcaaIWaGaaiOlaaaa@3E7F@ In effect the residual ε i ε i MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqyTdu2aaSbaaSqaaiaadMgaaeqaaaaa@38B5@ includes the same unobserved characteristics as does the residual ν i ν i MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqyVd42aaSbaaSqaaiaadMgaaeqaaaaa@38C6@ causing the two error terms to be correlated. OLS estimation of equation (1) would have a missing variable—the bias created by the missing observations (due to wage data not being available for women not in the work force). As in other cases of omitted variables, the estimates of the parameters of the model, β ^ , β ^ , MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGabCOSdyaajaGaaiilaaaa@37F2@ would be biased. Heckman (1979) notes in his seminal article on selectivity bias:

One can also show that the least squares estimator of the population variance is downward biased. Second, a symptom of selection bias is that variables that do not belong in the true structural equation (variables in not in may appear to be statistically significant determinants of when regressions are fit on selected samples. Third, the model just outlined contains a variety of previous models as special cases. ...For a more complete development of the relationship between the model developed here and previous models for limited dependent variables, censored samples and truncated samples, see Heckman (1976). Fourth, multivariate extensions of the preceding analysis, while mathematically straightforward, are of consider-able substantive interest. One example is offered. Consider migrants choosing among K possible regions of residence. If the self selection rule is to choose to migrate to that region with the highest income, both the self selection rule and the subsample regression functions can be simply characterized by a direct extension of the previous analysis. (Notation has been altered to match the notation used in this module, see Heckman, 1979: 155)

### Estimation Strategy

Heckman (1979) suggests a two-step estimation strategy. In the first step a probit estimate of equation (2) is used to construct a variable that measures the bias. This variable is known as the “inverse Mills ratio.” Heckman and others demonstrate that

E[ ε i | z i , d i =1 ]= σ εν σ ν 2 { ϕ( z i γ ) Φ( z i γ ) }, E[ ε i | z i , d i =1 ]= σ εν σ ν 2 { ϕ( z i γ ) Φ( z i γ ) }, MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyramaadmaabaGaeqyTdu2aaSbaaSqaaiaadMgaaeqaaOWaaqqaaeaacaWH6bWaaSbaaSqaaiaadMgaaeqaaOGaaiilaiaadsgadaWgaaWcbaGaamyAaaqabaGccqGH9aqpcaaIXaaacaGLhWoaaiaawUfacaGLDbaacqGH9aqpdaWcaaqaaiabeo8aZnaaBaaaleaacqaH1oqzcqaH9oGBaeqaaaGcbaGaeq4Wdm3aa0baaSqaaiabe27aUbqaaiaaikdaaaaaaOWaaiWaaeaadaWcaaqaaiabew9aMnaabmaabaGaaCOEamaaBaaaleaacaWGPbaabeaakmaaCaaaleqabaGccWaGGBOmGikaaiaaho7aaiaawIcacaGLPaaaaeaacqqHMoGrdaqadaqaaiaahQhadaWgaaWcbaGaamyAaaqabaGcdaahaaWcbeqaaOGamai4gkdiIcaacaWHZoaacaGLOaGaayzkaaaaaaGaay5Eaiaaw2haaiaacYcaaaa@64DE@
(7)

where ϕ( z i γ ) ϕ( z i γ ) MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqy1dy2aaeWaaeaacaWH6bWaaSbaaSqaaiaadMgaaeqaaOWaaWbaaSqabeaakiadacUHYaIOaaGaaC4SdaGaayjkaiaawMcaaaaa@3FC9@ and Φ( z i γ ) Φ( z i γ ) MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeuOPdy0aaeWaaeaacaWH6bWaaSbaaSqaaiaadMgaaeqaaOWaaWbaaSqabeaakiadacUHYaIOaaGaaC4SdaGaayjkaiaawMcaaaaa@3F7B@ are the probability density function and the cumulative distribution functions, respectively, evaluated at z i γ. z i γ. MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaCOEamaaBaaaleaacaWGPbaabeaakmaaCaaaleqabaGccWaGGBOmGikaaiaaho7acaGGUaaaaa@3D2A@ 1 The ratio in the brackets in equation (7) is known as the inverse Mills ratio. We will use an estimate of the inverse Mills ratio in the estimation of equation (5) to measure the sample selectivity bias.

The Heckman two-step estimator is relatively easy to implement. In the first step you use a maximum likelihood probit regression on the whole sample to calculate γ ^ γ ^ MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGabC4Sdyaajaaaaa@3743@ from equation (2). You then use γ ^ γ ^ MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGabC4Sdyaajaaaaa@3743@ to estimate the inverse Mills ratio:

λ ^ i = ϕ( z i ν ^ ) Φ( z i ν ^ ) . λ ^ i = ϕ( z i ν ^ ) Φ( z i ν ^ ) . MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaamyAaaqabaGccqGH9aqpdaWcaaqaaiabew9aMnaabmaabaGaaCOEamaaBaaaleaacaWGPbaabeaakmaaCaaaleqabaGccWaGGBOmGikaaiqah27agaqcaaGaayjkaiaawMcaaaqaaiabfA6agnaabmaabaGaaCOEamaaBaaaleaacaWGPbaabeaakmaaCaaaleqabaGccWaGGBOmGikaaiqah27agaqcaaGaayjkaiaawMcaaaaacaGGUaaaaa@4E35@
(8)

In the second step, we estimate:

y i = x i β+μ λ ^ + η i y i = x i β+μ λ ^ + η i MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEamaaBaaaleaacaWGPbaabeaakiabg2da9iqahIhagaqbamaaBaaaleaacaWGPbaabeaakiaahk7acqGHRaWkcqaH8oqBcuaH7oaBgaqcaiabgUcaRiabeE7aOnaaBaaaleaacaWGPbaabeaaaaa@448E@
(9)

using OLS and where E( μ ^ )= σ εν σ ν 2 . E( μ ^ )= σ εν σ ν 2 . MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyramaabmaabaGafqiVd0MbaKaaaiaawIcacaGLPaaacqGH9aqpdaWcaaqaaiabeo8aZnaaBaaaleaacqaH1oqzcqaH9oGBaeqaaaGcbaGaeq4Wdm3aa0baaSqaaiabe27aUbqaaiaaikdaaaaaaOGaaiOlaaaa@459A@ Thus, a t-ratio test of the null hypothesis H 0 :μ=0 H 0 :μ=0 MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaaBaaaleaacaaIWaaabeaakiaacQdacqaH8oqBcqGH9aqpcaaIWaaaaa@3BE5@ is equivalent to testing the null hypothesis H 0 : σ εν =0 H 0 : σ εν =0 MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamisamaaBaaaleaacaaIWaaabeaakiaacQdacqaHdpWCdaWgaaWcbaGaeqyTduMaeqyVd4gabeaakiabg2da9iaaicdaaaa@3F87@ and is a test of existence of the sample selectivity bias.

An alternative approach to the sample selectivity problem is to use a maximum likelihood estimator. Heckman (1974) originally suggested estimating the parameters of the model by maximizing the average log likelihood function:

(10)

where ϕ εν ϕ εν MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqy1dy2aaSbaaSqaaiabew7aLjabe27aUbqabaaaaa@3B47@ is the probability density function for the bivariate normal distribution. Fortunately, Stata offers a single command for calculating either the two-step or the maximum likelihood estimators.

### Estimation in Stata

Estimation of the two versions of the Heckman sample selectivity bias models is straightforward in Stata. The command is:

.heckman depvar [varlist], select(varlist_s) [twostep]

or

.heckman depvar [varlist], select(depvar_s = varlist_s) [twostep]

The syntax for maximum-likelihood estimates is:

.heckman depvar [varlist] [weight] [if exp] [in range], select([depvar_s =] varlist_s [, offset(varname) noconstant]) [ robust cluster(varname) score(newvarlist|stub*) nshazard(newvarname) mills(newvarname) offset(varname) noconstant constraints(numlist) first noskip level(#) iterate(0) nolog maximize_options ]

The predict command has these options, among others:

xb, the default, calculates the linear predictions from the underlying regression equation.

ycond calculates the expected value of the dependent variable conditional on the dependent variable being observed/selected; E(y | y observed).

yexpected calculates the expected value of the dependent variable (y*), where that value is taken to be 0 when it is expected to be unobserved; y* = P(y observed) * E(y | y observed). The assumption of 0 is valid for many cases where nonselection implies non-participation (e.g., unobserved wage levels, insurance claims from those who are uninsured, etc.) but may be inappropriate for some problems (e.g., unobserved disease incidence).

Examples of these two commands are:

. heckman wage educ age, select(married children educ age)

. predict yhat

These two command would use the maximum likelihood estimate of the equations (1) wage as a function of education and age using a selection equation that used marital status, number of children, education level, and age to explain which individuals are participating in the labor force. The help file in Stata provides additional information on the structure of the Heckman command and is well worth printing out if you are dealing with a sample selectivity bias problem.

#### Example 1: Example from Stata

We will illustrate various issues of selection bias using the data set available from the Stata site. Retrieve the data set by entering:

. use http://www.stata-press.com/data/imeus/womenwk, clear

This data set has 2,000 observations of 15 variables. We can use the describe command (.describe) to get a brief description of the data set:

 obs: 2,000 vars: 15 9 Nov 2004 20:23 size: 142,000 (86.5% of memory free) Variable Name Storage Type Display Format Value Label Variable Label c1 double %10.0g c2 double %10.0g u double %10.0g v (7,2) %10.0g country float %9.0g age int %8.0g education int %8.0g married byte %8.0g children int %8.0g select float %9.0g wageful float %9.0g wage float %9.0g lw float %9.0g work float %9.0g lwf float %9.0g

We are interested in only a subset of these data. Table 2 reports the definitions of variables that are relevant for our analysis. We can get further insight into the data set using the summarize command. Table 3 reports the summary statistics for the data set.

 Variable name Definition country County of residence (categorical variable equal to 0, 1, ..., 9) age Age of the woman education Number of years of education of the woman married Dummy variable equal to 1 if the woman is married and 0 otherwise children Number of children that the woman has in their household wage Hourly wage rate of the woman lw Natural logarithm of hourly wage rate work Dummy variable equal to 1 if the individual is in the workforce and 0 otherwise
 Variable Obs Mean Std. Dev Min Max Age 2000 36.208 8.28656 20 59 education 2000 13.084 3.045912 10 20 married 2000 .6705 .4701492 0 1 children 2000 1.6445 1.398963 0 5 wage 1343 23.69217 6.305374 5.88497 45.80979 lw 1343 3.126703 .2865111 1.772402 3.824498 work 2000 .6715 .4697852 0 1

We are interested in modeling two things: (1) the decision of the woman to enter the labor force and (2) determinants of the female wage rate. It might be reasonable to assume that the decision to enter the labor force by a woman is a function of age, marital status, the number of children, and her level of education. Also, the wage rate a woman earns should be a function of her age and education.

##### The decision to enter the labor force

We can use a probit regression to model the decision of a woman to enter the labor force. The results of this estimation are reported in Table 4. However, we can use the predict command to produce some results that we can use to be sure that we understand what the regression results mean. In particular, type in the following two commands:

.predict zbhat, xb

.predict phat, p

These two commands will predict (1) the linear prediction (zbhat) and (2) the predicted probability that the woman will be in the workforce (phat). Table 5 reports the values of these two variables for observations 1 through 10.

 . probit work age education married children Iteration 0: log likelihood = -1266.2225 Iteration 4: log likelihood = -1027.0616 Probit estimates Number of obs = 2000 LR chi2(4) = 478.32 Prob > chi2 = 0.0000 Log likelihood = -1027.0616 Pseudo R2 = 0.1889 work Coef. Std. Err. z P>z [95% Conf. Interval] age .0347211 .0042293 8.21 0.000 .0264318 .0430105 education .0583645 .0109742 5.32 0.000 .0368555 .0798735 married .4308575 .074208 5.81 0.000 .2854125 .5763025 children .4473249 .0287417 15.56 0.000 .3909922 .5036576 _cons -2.467365 .1925635 -12.81 0.000 -2.844782 -2.089948
 Observation zbhat phat 1 -0.68900 0.24541 2 -0.20290 0.41961 3 -0.48067 0.31538 4 -0.16818 0.43322 5 0.34859 0.63630 6 0.58758 0.72159 7 0.97357 0.83486 8 0.45978 0.67716 9 0.01799 0.50718 10 0.32628 0.62790

The interpretation of the numbers in Table 5 is straightforward. Consider individual 1. The z-value predicted for this individual is -0.68. Using the standard normal tables reported in Table 11 it is easy to see:

Φ( z0.69 )=Pr( Individual 1 is in the labor force ) Φ( z0.69 )=Pr( Individual 1 is in the labor force ) MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeuOPdy0aaeWaaeaacaWG6bGaeyizImQaeyOeI0IaaGimaiaac6cacaaI2aGaaGyoaaGaayjkaiaawMcaaiabg2da9iGaccfacaGGYbWaaeWaaeaacaqGjbGaaeOBaiaabsgacaqGPbGaaeODaiaabMgacaqGKbGaaeyDaiaabggacaqGSbGaaeiiaiaabgdacaqGGaGaaeyAaiaabohacaqGGaGaaeyAaiaab6gacaqGGaGaaeiDaiaabIgacaqGLbGaaeiiaiaabYgacaqGHbGaaeOyaiaab+gacaqGYbGaaeiiaiaabAgacaqGVbGaaeOCaiaabogacaqGLbaacaGLOaGaayzkaaaaaa@6154@
(11)
(12)

The difference between this number and the value reported for phat in Table 5 is due to rounding error.

A little later we will want to calculate the inverse Mills ratio. As noted in (8), the formula for the inverse Mills ratio is:

λ ^ i = ϕ( z i ν ^ ) Φ( z i ν ^ ) . λ ^ i = ϕ( z i ν ^ ) Φ( z i ν ^ ) . MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaamyAaaqabaGccqGH9aqpdaWcaaqaaiabew9aMnaabmaabaGaaCOEamaaBaaaleaacaWGPbaabeaakmaaCaaaleqabaGccWaGGBOmGikaaiqah27agaqcaaGaayjkaiaawMcaaaqaaiabfA6agnaabmaabaGaaCOEamaaBaaaleaacaWGPbaabeaakmaaCaaaleqabaGccWaGGBOmGikaaiqah27agaqcaaGaayjkaiaawMcaaaaacaGGUaaaaa@4E35@
(13)

The variable phat is equal to Φ( z i ν ^ ) . Φ( z i ν ^ ) . MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeuOPdy0aaeWaaeaacaWH6bWaaSbaaSqaaiaadMgaaeqaaOWaaWbaaSqabeaakiadacUHYaIOaaGabCyVdyaajaaacaGLOaGaayzkaaaaaa@3F95@ Stata offers an easy way to calculate ϕ( z i ν ^ ) ϕ( z i ν ^ ) MathType@MTEF@5@5@+=feaagyart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqy1dy2aaeWaaeaacaWH6bWaaSbaaSqaaiaadMgaaeqaaOWaaWbaaSqabeaakiadacUHYaIOaaGabCyVdyaajaaacaGLOaGaayzkaaaaaa@3FE3@ with the function “normden(zbhat)” as follows:

.generate imratio = normden(zbhat)/phat

Table 6 repeats Table 5 with the estimate of the inverse Mills ratio for the first 10 observations.

 Observation zbhat phat Inverse Mills Ratio 1 -0.6889973 0.2454125 1.2821240 2 -0.2029016 0.4196060 0.9313837 3 -0.4806706 0.3153753 1.1269680 4 -0.1681804 0.4332207 0.9079438 5 0.3485867 0.6363002 0.5900134 6 0.5875849 0.7215945 0.4652062 7 0.9735670 0.8348642 0.2974918 8 0.4597758 0.6771615 0.5300468 9 0.0179909 0.5071769 0.7864666 10 0.3262833 0.6278950 0.6024283

#### The two Heckman estimates

One of the great advantages of using an econometrics program like Stata is that the authors quite often have created a command that does all of the work for the user. In our case, the commands we need to run to generate the maximum likelihood estimate of the Heckman model are:

. global wage_eqn wage educ age

. global seleqn married children age education

. heckman $wage_eqn, select($seleqn)

Notice that we have used the global command to create a shortcut for referring to each of the two equations in the estimation. The command for the Heckman two-stage estimate is:

.heckman $wage_eqn, select($seleqn) twostage

.predict mymills, mills

 (1) Explanatory variable (2) Maximum likelihood estimate (3) Heckman two-step (4) Probit estimate of the selection equation Wage Equation Education 0.9899537 0.9825259 — (18.59) (18.23) Age 0.2131294 0.2118695 — (10.34) (9.61) Intercept 0.4857752 0.7340391 — (0.45) (0.59) Selection equation Married 0.4451721 0.4308575 0.4308575 (6.61) (5.81) (5.81) Children 0.4387068 0.4473249 0.4473249 (15.79) (15.56) (15.56) Age 0.0365098 0.0347211 0.0347211 (8.79) (8.21) (8.21) Education 0.0557318 0.0583645 0.0583645 (5.19) (5.32) (5.32) Intercept -2.491015 -2.467365 -2.467365 (-13.16) (-12.81) (-12.81) σ σ MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiabeo8aZbaa@37AC@ 0.7035061 0.67284 — λ λ MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiabeU7aSbaa@379D@ 6.004797 5.9473529 — ( Mills )λ ( Mills )λ MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaamaabmaabaGaaeytaiaabMgacaqGSbGaaeiBaiaabohaaiaawIcacaGLPaaacqaH7oaBaaa@3DB6@ 4.224412 4.001615 — (6.60) Observations 2000 2000 2000 Number of women not working 657 657 657 Number of women working 1343 1343 1343 Log likelihood -5178.304 — -1027.0616 Wald  χ 2 ( 2 ) Wald  χ 2 ( 2 ) MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiaabEfacaqGHbGaaeiBaiaabsgacaqGGaGaeq4Xdm2aaWbaaSqabeaacaaIYaaaaOWaaeWaaeaacaaIYaaacaGLOaGaayzkaaaaaa@3F0F@ 508.44 — — Probability>  χ 2 Probability>  χ 2 MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiaabcfacaqGYbGaae4BaiaabkgacaqGHbGaaeOyaiaabMgacaqGSbGaaeyAaiaabshacaqG5bGaeyOpa4JaaeiiaiabeE8aJnaaCaaaleqabaGaaGOmaaaaaaa@4456@ 0.0000 — — Wald  χ 2 ( 4 ) Wald  χ 2 ( 4 ) MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiaabEfacaqGHbGaaeiBaiaabsgacaqGGaGaeq4Xdm2aaWbaaSqabeaacaaIYaaaaOWaaeWaaeaacaaI0aaacaGLOaGaayzkaaaaaa@3F11@ — 551.37 — Probability>  χ 2 Probability>  χ 2 MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiaabcfacaqGYbGaae4BaiaabkgacaqGHbGaaeOyaiaabMgacaqGSbGaaeyAaiaabshacaqG5bGaeyOpa4JaaeiiaiabeE8aJnaaCaaaleqabaGaaGOmaaaaaaa@4456@ — 0.0000 — LR test of independent equations (ρ = 0) χ 2 ( 1 ) χ 2 ( 1 ) MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiabeE8aJnaaCaaaleqabaGaaGOmaaaakmaabmaabaGaaGymaaGaayjkaiaawMcaaaaa@3AD7@ 61.20 — 478.32 Probability> χ 2 Probability> χ 2 MathType@MTEF@5@5@+=feaagyart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqpepeea0=as0Fb9pgeaYRXxe9vr0=vr0=vqpWqaaeaabiGaciaacaqabeaadaqaaqaaaOqaaiaabcfacaqGYbGaae4BaiaabkgacaqGHbGaaeOyaiaabMgacaqGSbGaaeyAaiaabshacaqG5bGaeyOpa4Jaeq4Xdm2aaWbaaSqabeaacaaIYaaaaaaa@43B3@ 0.0000 — 0.0000

The second command reports the estimates of the inverse Mills ratio; we have retrieved these values in order to check our earlier calculations. Table 7 reports the results of these two estimations. Column 2 reports the maximum-likelihood estimates; Column 3 reports the Heckman two-step estimates; and Column 3 reports the probit estimate of selection equation as reported in Table 4. The estimates for the two methods are very similar. Of course, the probit estimates in Column 4 exactly match the results reported for the selection equation in Column 3. As a final check, Table 8 reports the values of the inverse Mills ratio reported in Table 6 with the values of the inverse Mills ratio calculated in the Heckman two-step method. The two estimates are identical except for some rounding errors.

 Observation As calculated from probit estimate As reported by the Heckman two-step 1 1.2821240 1.2821240 2 0.9313837 0.9313837 3 1.1269680 1.1269680 4 0.9079438 0.9079438 5 0.5900134 0.5900134 6 0.4652062 0.4652061 7 0.2974918 0.2974918 8 0.5300468 0.5300469 9 0.7864666 0.7864666 10 0.6024283 0.6024283

### Exercise

##### Exercise 1: The supply of married women in the workforce.

We are interested in understanding the decision of married Portugese women to enter the labor force. We have available data from Portugal. The data set is a sample from Portuguese Employment Survey, from the interview year 1991, and has been provided by the Portuguese National Institute of Statistics (INE). The data are in the Excel file Martins. This file is organized in the following way. There are seven columns, corresponding to seven variables, and 2,339 observations.

a) Estimate the following equation using OLS: Wages=f( age,ag e 2 ,education ) Wages=f( age,ag e 2 ,education ) using the observations for women actually working.

b) What is the potential source of selection bias?

c) Estimate a wage equation for the Portuguese data three ways: (1) using OLS, (2) using the Heckman two-step method, and (3) using the ML method. Report all three estimates in a single table. For consistency, we will assume that the appropriate explanatory variables for wages are (1) age, (2) the square of age, and (3) the years of education. Further, assume that women do not enter the labor force because (1) presence of children under the age of 3, (2) presence of children between 3 and 18, (3) husband's wage level, (4) the level of education of the woman, and (5) the age of the woman.

### Appendix A.

 z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0 0.004 0.008 0.012 0.016 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.091 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.148 0.1517 0.4 0.1554 0.1591 0.1628 0.1664 0.17 0.1736 0.1772 0.1808 0.1844 0.1879 0.5 0.1915 0.195 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.219 0.2224 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.258 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.291 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.334 0.3365 0.3389 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.377 0.379 0.381 0.383 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.398 0.3997 0.4015 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 0.437 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.475 0.4756 0.4761 0.4767 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 0.483 0.4834 0.4838 0.4842 0.4846 0.485 0.4854 0.4857 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.489 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 2.4 0.4918 0.492 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936 2.5 0.4938 0.494 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.496 0.4961 0.4962 0.4963 0.4964 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.497 0.4971 0.4972 0.4973 0.4974 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.498 0.4981 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.499 0.499

### References

Bourguignon, François, Martin Fournier, and Marc Gurgand (2007). Selection Bias Corrections Based on the Multinomial Logit Model: Monte Carlo Comparisons. Journal of Economic Surveys 21(1): 174-205.

Chiburis, Richard and Michael Lokshin (2007). Maximum Likelihood and Two-Step Estimation of an Ordered-Probit Selection Model. The Stata Journal 7(2): 167-182.

Dahl, G. B. (2002). Mobility and the Returns to Education: Testing a Roy Model with Multiple Markets. Econometrica 70(6): 2367-2420.

Dubin, Jeffrey A. and Douglas Rivers (1989). Selection Bias in Linear Regression, Logit and Probit Models. Sociological Methods and Research 18(2 & 3): 360-390.

Heckman, James (1974). Shadow Prices, Market Wages and Labor Supply. Econometrica 42(4):679-694.

Heckman, James (1976) “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models,” The Annals of Economic and Social Measurement5: 475-492.

Heckman, James (1979). Sample Selection Bias as a Specification Error. Econometrica 47(1): 153-161.

Jimenez, Emanuel and Bernardo Kugler (1987). The Earnings Impact of Training Duration in a Developing Country: An Ordered Probit Model of Colombia's Servicio Nacional de Aprendizaje (SENA). Journal of Human Resources 22(2): 230-233.

Lee, Lung-Fei (1983). Generalized Econometric Models with Selectivity. Econometrica 51(2): 507-512.

McFadden, Daniel L. (1973). Conditional Logit Analysis of Qualitative Choice Behavior. In P. Zarembka Frontiers in Econometrics (New York: Academic Press).

Newey, W. K. and Daniel L. McFadden (1994). Large Sample Estimation and Hypothesis Testing. In R. F. Engle and D. L. McFadden (eds.) Handbook of Econometrics (Amsterdam: North Holland).

Schmertmann, Carl P. (1994). Selectivity Bias Correction Methods in Polychotomous Sample Selection Models. Journal of Econometrics 60(1): 101-132.

Vella, Francis (1998). Estimating Models with Sample Selection Bias: A Survey. The Journal of Human Resources 33(1):127-169.

## Footnotes

1. Because the mean and variance of the standard normal distribution are 0 and 1, respectively, its probability density function (pdf) is and the cumulative probability function is .

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks