Until now all variables have been assumed to be quantitative in nature, which is to say that they have been continuous. However, many interesting variables are expressed in qualitative terms, such as gender, educational level, time periods and seasons, private or public and so forth. These qualitative measures have to be transformed into some proxy so that it could be represented and used in a regression. Dummy variables are discrete transformations and used for this purpose. They are artificial variables that work as proxies for qualitative variables and since they are discrete we need to be careful when we interpret them. The purpose of this chapter is to describe different techniques on how to use dummy variables and categorical variables in general and how to interpret them.

Gender is a typical example of a qualitative variable that need to be transformed into a numerical form so that it could be used in a regression. Since gender could be male or female it is a categorical variable with two categories. We therefore need to decide what category the dummy should represent and what category that should be used as a reference. If the dummy variable should represent men, the discrete variable d would take the value 1 for men and the value 0 for all other values in the data set. It is therefore important to be sure that all other observations really represent what you want it to represent. A dummy variable for men could therefore be expressed in this way:

When running the regression you can treat the dummy variable d as any other variables included in the model. The variable d could take other numerical values than 1 and 0, for instance 9 and 8, and it will not have any effect on its coefficient as long as there is a unit difference between the two values. However, the interpretation is easiest when using 1 and 0, which is the reason why we should follow the structure of (8.1) which is standard.

8.1. Intercept dummy variables

The most basic form of application using dummy variables is when only the intercept is affected. Using the categorical variable defined by (8.1) we can form the following model with two explanatory variables.

As can be seen from (8.1) d takes only two values. If we form the conditional expectation with respect to the two categories of d we receive:

The only thing that differs between the two expectations is the coefficient for the dummy variable. When D=1 we see that the conditional expectation in (8.3) consist of two constants B0 and bl which sum represents the intercept in that case. However, when D=0, the conditional expectation will be given by (8.4) which only contain one constant B0. Hence, the model as a whole contains two intercepts B0 and B0+B1.

If we take the difference between the two conditional expectations we receive:

which equals the coefficient for the dummy variable. Since our binary variable d is discrete, we can not take the derivative of y with respect to D, since a derivative requires a continues variable, and therefore is undefined here. In order to find the corresponding marginal effect in this case we have to form the difference given by (8.5) and conclude that when d moves from 0 to 1, the conditional expectation of y change by B1 units, which represents the marginal effect for the linear model. When working with the linear model, it makes no difference if we treat the dummy variable as it was continuous when calculating the marginal effect since they become the same. But with other functional forms it makes a difference.

Example 8.1

Assume the following regression result from a model given by (8.2) with y being the hourly wage rate, d a dummy for men, and x a variable for years of schooling. The dependent variable is expressed in Swedish kronor (SEK). Standard errors are given within parenthesis:

Use the regression results to calculate how much higher the average hourly wage rate is for men. First we have to check if the coefficient for the male dummy is significant. With a f-value equal to 5 the coefficient is significantly different from zero at any conventional significance levels. The marginal effect measured with this regression says that men earn 21.9 SEK/hours more than women do on average, controlling for years of schooling.

In the empirical human capital literature the functional form most often used is the log-linear, which means that our model would look like this:

In order to find the marginal effect here, we have to remember that it is the effect on y that is of interest, not lnY. Therefore, the first step must be to transform the regression equation using the anti log and form the conditional expectation of y. Doing that we receive:

where oU represents the population variance of the error term. Hence, in order to receive the conditional expectation given by (8.7) we have to assume that u is a normally distributed variable, with mean zero, and variance equal oU . When that is the case, it is possible to show that E eU = eo'U 2 .

b1 would have represented the relative change in y from a unit change in d if d had been continuous. Since d is not continuous, we have to calculate the relative change in y using the conditional expectation given by (8.7) instead. Doing that we receive:

Relative change:

Hence, in order to find the relative change in the conditional expectation of Y, we simply use the estimated value of bl and apply the formula given above. In order to find the corresponding standard error of the relative change we apply a linear approximation to the non-linear expression. If we do that we end up with the following formula:

Example 8.2

Assume the following regression results, from a model given by (8.6), with y being the hourly wage rate, d a dummy for men, and x a variable for years of schooling. The dependent variable is expressed in Swedish kronor (SEK). Standard errors are given within parenthesis:

Since we are interested in the marginal effect of d on Y, we have to calculate it using the regression results. By (8.8) and (8.9) we receive:

The f-value for the marginal effect equals 8.2, and is well above the critical value of any conventional level of significance. This implies a positive and significant relative change of 19.7 percent. That is, men earns on average 19.7 percent more per hour than women, controlling for education.

Observe that the estimated value is very close to the calculated relative change given by (8.8). It turns out that when the estimated coefficient is lower than 0.3 in absolute terms, the coefficient it self is a very good approximation to the exact value given by (8.8), and is therefore often used directly as such.

Observe that

therefore researcher often use b1 directly instead of the calculated value given by (8.8).

Found a mistake? Please highlight the word and press Shift + Enter