"his friend Mr. Darcy soon drew the attention of the room by his fine, tall person, handsome features, noble mien; and the report which was in general circulation within five minutes after his entrance, of his having ten thousand a year." Jane Austen, Pride and Prejudice.
Taller men are more desirable marriage partners.
One theory is that there is a social norm that husbands should be taller than their wives. A man who is over 6'3" has his pick of socially acceptable partners: almost any woman will fit the "wife shorter than husband" norm. A man who is 5'4" has fewer partners who are socially acceptable, height wise, especially since some of the women in his height range may marry taller men.
If a larger pool of potential partners mean that one is more likely to make a good match, and "making a good match" consists of getting married and staying married, we would expect to see a positive relationship between a man's height and the probability of him being married.
Another theory - one based on biological research - is that a man's height is a signal of his health and reproductive fitness. According to this theory, however, height is only valuable up to a point - excessively tall man might find it harder to find partners.
Step 1: Define your terms.
There are four marital status category in the Canadian Community Health Survey:
1. married 2. common-law 3. widowed/separated/divorced and 4. single/never married.
The hypothesis: "taller men are more likely to be married" could be interpreted in any number of ways:
a. taller men are more likely to be in category 1 (married) as opposed to any of the other three categories.
But in Canada, unlike the US, there is no difference between marriage and cohabitation when it comes to the taxes people pay or for most other practical purposes. There is more of a difference in Quebec - but people in that province mostly opt for cohabitation instead of marriage. If by "marriage" what we really mean is "in a committed long-term relationship" perhaps "taller men are more likely to be married" should be interpreted as:
b. taller men are more likely to be category 1 (married) or category 2 (cohabiting) as opposed to being single/married/widowed/divorced.
Ideally I would like to drop widowers from the sample, because I don't know whether they are more like the married people (they would have a good match if their spouse hadn't died) or the single/divorced people. One way out of this difficulty is just to drop the widowed (and also the separated and divorced, since those categories are lumped together) and examine the hypothesis:
c. taller men are more likely to be in category 1 (married) as opposed to category 3 (single).
Alternatively, the impact of widowers on the sample results can be minimized by focussing on a younger age group, say 25 to 50 year olds. (This also resolves the question of how to deal with people shrinking as they age).
A good applied economist is methodical and careful, and always asks: does it matter?
So I estimated the relationship between height and marital status for Canadian men between 25 and 50.
Without controlling for any of the myriad other factors that influence the probability of being married, these are the relationships I found using a logit regression analysis.
A height of 190 cm is just under 6'3", or just taller than Greg Mankiw. What this graph tells us is that a man who is 190 cm tall has about a 49 percent chance of being married (as opposed to cohabiting/being single/divorced/widowed etc). He has about a 64 percent chance of being married/cohabiting, as opposed to being single/divorced/separated/widowed. And if we take both the cohabiting folks and the single/divorced/separated/widowed folks out of the equation, the pattern doesn't change much.
But the more interesting economic question is: how much does a change in height change the probability of being married. A visual inspection of the three lines here suggests that they all have similar slopes - a 10 cm increase in height increases the predicted probability of being married by about 2 percentage points. This is, in a sense, reassuring - so far it appears that the height-marriage relationship is not extremely sensitive to our definition of marriage. It also tells us that the action is being generated by the married and single people, as opposed to the cohabiting, separated, divorced, or widowed ones.
My own view is that the interesting research question is whether or not people are in committed, long-term relationships, not whether or not they have a ring on the fourth finger of their left hand. So throughout the rest of this analysis, I will focus on the probability of being married/cohabiting as opposed to single/divorced/separated/widowed.
Step 2: Decide upon estimation technique.
The first form of regression analysis students are taught is least squares regression:
regress marriedcohabit HWTGHTM if male==1 & age>=25 & age<=50
It works. It runs quickly. It provides easy-to-interpret coefficients: a 1 cm increase in height increases the probability of being married by 0.18 percentage points, significant at p=0.000.
The problem with least squares regression, as applied to probability models, is that there is nothing to stop the model from predicting probabilities greater than one or less than zero. As a result, econometricians typically recommend using either "logit" or "probit" models to explain zero-one data.
A "logistic regression" fits data to a curve that is constrained to fit between zero and one (thank you Wikipedia for the nice picture).
Mathematically, a logistic regression is estimating a function:
where:
z=a+b1x1+b2x2
It can be estimated in Stata by using the command logit, for example:
logit marriedcohabit HWTGHTM if male==1 & age>=25 & age<=50
Now the most important thing to remember about logit is that it is not a WYSIWYG, that is, what you see is not what you get. A logit regression will give you the values for b1, b2 etc in the equation for z above. In the height/marital status example, they look something like this - and tell you that b1 is 0.75 - whatever that is supposed to mean.
What you are interested in - what you want to get - is the slope of the red line in the picture. That slope tells you the relationship between height and marital status. There's no simple way of figuring it out from the coefficient on your explanatory variables in your regression output.
So, except for a brief glance at the p values (are they nice and small? is it worth going on?) and the descriptive stats (everything looking o.k.?) you should mostly ignore the coefficients coming out of your logit regression. The way to figure out what the slope of the red line is to calculate the marginal effects - that's the effect of a change in height on the probability of being married for a person of average height. The marginal effect command can only be run after you've done your logistic regression, and sounds like the name of an econometricians' rock band:
mfx
The value of "dy/dx" from the marginal effects command is what you're interested in. At 0.18 it is pretty much exactly the same as the results from the linear regression.
There's another alternative, namely the probit model. Most logit commands also work with probit. One thing that's handy with probit, however, is the command dprobit, for example:
dprobit marriedcohabit HWTGHTM if male==1 & age>=25 & age<=50
This computes the marginal effects directly (again, generating an estimate of 0.18). There is no need to run the mfx command after the regression.
In the model estimated here - the effect of height on marital status - logit, probit and linear models all give exactly the same result, as is shown in the accompanying diagram (this picture is the equivalent of the red curve in the logit diagram above). So which is better? David Giles here argues for the careful and judicious use of logit, although the discussion that follows his blog post refers to Mostly Harmless Econometrics's defense of linear models on 103-7.
Personally I tend to use logit or probit - but I try to be careful to always report the marginal effects.
Step 3: Sample selection
So far, neither the definition of marriage nor the choice of estimation technique have made much difference to the predicted impact of height on marriage. So which modelling choices actually matter?
The first choice that matters is sample selection: who are you analyzing? The graph below shows what happens when I estimate the impact of height on the probability of being married or cohabiting for three groups of people: males 25-50, males 20-65 and males and females 20-65.
The positive height/marriage relationship disappears once the population is expanded to include young and older men, and to include females also.
It's straightforward to understand why age matters. Young people are taller than old people, in part because the average height of Canadians has been increasing over time, and part because people start to shrink once they hit middle age. People in their early 20s are unlikely to be married as many are still completing their education and searching for a partner. They're also tall. Throwing unmarried 21-year-olds into the regression without including a control for age leads to a negative height/marriage relationship (especially if the population also includes a lot of shorter, older, married folks).
Gender affects the results also. Recall the hypothesis: taller men are more likely to be married because of the social norm stating that a husband should be taller than his wife. But this social norm predicts a negative relationship between a woman's height and her probability of marriage. This seems to be coming out of our data, but the women's results, like the men's ones, could be driven by an age effect.
These results show that sample selection matters. So what sample should one choose?
A few considerations are important.
First, who does your theory apply to? Suppose, for example, you are trying to decide "should I run my sample for men only, women only, both men and women separately, or pool men and women into one sample"?
If men and women face basically the same trade-offs, are predicted to react in similar ways in a given situation, then it makes sense to combine men and women into a single regression estimate. For example, if the research question is "how much do people's driving habits change when the price of gasoline increases?" the most interesting answer will be one derived from an analysis of the whole population.
In other situations, however, men and women might be expected to behave differently. Take, for example, the research question "how does providing care for aging parents affect labour force decisions?" If daughters are more likely to provide hands-on care, such as household or nursing care, whereas sons are more likely to provide financial assistance, then population aging would be expected to have a greater impact on women's labour force participation. In this case, it might be a good idea to analyze men and women separately.
But even then separate regressions are not always necessary. By including interaction terms, for example, height*male, it is possible to allow a particular variable to affect men and women differently within the context of a single regression equation.
A second consideration is: who do I have reliable information about? For example, suppose you are trying to estimate the impact of university education on earnings, and your sample includes immigrants. You don't know whether those immigrants received their university education in Canada or elsewhere. But you do know that foreign educational credentials are not valued in the Canadian labour market, so you want to exclude anyone with a foreign educational credential from your analysis. A quick way to do that is to drop immigrants.
But again this demonstrates how dangerous dropping people can be - a Canadian university degree might be worth more (or less) to immigrants than the native born, so dropping immigrants could lead to an inaccurate estimate of the value of university education. This leads to a third consideration:
Ultimately, who are you interested in? For example, suppose you are interested in predicting how much labour supply would change in response to a decrease in tax rates. If the aim is to generalize results to the entire Canadian labour force, it is best to have a sample of the entire labour force.
Again, the only good advice is: be careful, be methodical, and check the sensitivity of your results to your assumptions. Provide the reader with information about how much your choice of sample affects the results, and let the reader decide what to believe.
Step 4: Choose your explanatory variables
Height is not the only predictor of marital status. Age matters, too - it takes time to find a partner. Education provides a way to meet people, and there is also some research that suggests people who have the stick-with-it-ness to complete a university degree are less likely to get divorced, hence education is related to marital status. Ethnic and cultural variables are likely important also, as cultures vary on the importance attached to marriage. Also, being a member of a minority group may affect the probability of finding and sticking with a good match.
This is where the economics comes in: coming up with an explanation of what should be expected to matter, and why. Reading other papers in the literature can help - but sometimes other papers get it wrong, so "other people have done it" is not the most convincing justification in the world.
Interestingly, although age, education, language and ethnicity are significant predictors of the probability of being married, including these variables does not affect the estimated impact of height on marriage. (I also re-ran the regressions with an upper age limit of 60. The impact of height in the "no other controls" regression was reduced, but the impact of height in the second model, with controls for age and so on, was similar to that reported below: 0.18).
Because I wanted an example of something that might change the estimated impact of height on marriage, I threw income into the regression equation too.
The inclusion of income substantially reduces the estimated impact of height on marital status.
What does that tell us?
It's hard to say. Because people work to provide for their families, the positive relationship between income and marital status might be coming from the fact that marriage leads to higher incomes, rather than the other way around. (The male "marriage premium" is a well-established phenomenon, but whether high incomes cause marriage or marriage increases incomes is a source of debate.)
Including income, therefore, might cause us to under-estimate the impact of height on marriage.
On other other hand, in a previous post I showed results establishing a positive relationship between height and income. If we don't include income, we might think that taller men are more likely to get married - when in fact it's not their height, but their income, that makes them desirable partners.
Sometimes there ways of dealing with these situations and getting at underlying issues of causality.
But what if you're an undergraduate writing an honours essay, and you're struggling to work out the basics of logit, probit, and so on, and you feel (rightly) uncomfortable with running complex models that you don't really understand?
The best strategy is to present a variety of results, explain your choices, and let the reader decide. Sometimes it's useful to learn see how things are related, even if the issue of causality - which came first, the marriage or the income? - can never be completely and fully resolved.
Update: after I published this post, a number of commentators point out that the relationship between height and marriage is likely non-linear. I added a fourth regression specification, and found that it is. This fits with the theory that extreme height doesn't signal reproductive fitness as well as above average height.
A few notes on these regression results:
Generating tables from standard Stata output is a hugely time intensive and cumbersome task. I generated these tables using the command outreg2, for example:
dprobit marriedcohabit HWTGHTM if male==1 & age>=25 & age<=50
outreg2 HWTGHTM using height_table, replace bdec(3) aster(se) excel bracket(se) addstat(Pseudo R-squared, `e(r2_p)', LR Chi2, e(chi2))
Unfortunately this is not a standard stata command (at least, it didn't come with the version I have on my computer). If you have your own copy of stata, it is easy to download and install outreg2. But I don't know if it is on the versions of stata in computing labs.
Chris Auld has recently put up a post on producing tables with Stata here. Unfortunately I'm not familiar with all of the short-cuts he uses in that post, so I ended up resorting to outreg2. (Update, Chris explains in more detail below.)
There are a few things about this regression output that are a bit sloppy - places where I would take marks off if I was grading myself. For example, instead of dropping everyone with missing observations at the beginning, I let my sample size change as I added more variables. Once income was added, the sample size dropped by 5-10 percent because of missing income responses. This is poor practice. It means that I can't tell whether the changes in the coefficient on height are due to changes in the sample or changes in the explanatory variables.
It is also unnecessary, in this example, to drop observations. Since I have a huge sample size, and I am entering income as a series of dummy variable anyways, I could just create a variable indicating "income missing" and include that as an explanatory variable. (This would necessitate recoding the other income dummies also, unfortunately). This is a good general rule in applied economics: never throw away valuable information unnecessarily. (Update: I added a dummy for "missing income" which improved the fit).
So what is the answer to the question asked in the title of this post: what actually matters?
Sample selection and choice of explanatory variables matter. A lot. Judgement calls have to be made. The best thing to do is be careful, make choices informed by economic theory, and provide enough information to make the reader make up his or her own mind about your results.
Frances: Nice post. I knew we weren't that far apart!
BTW: I used to play in a rock band about 45 years ago. In the unlikely event that we re-form, I'll be pushing for calling us "mfx".
Posted by: David Giles | October 12, 2011 at 11:59 AM
Frances, thanks for these posts. It has been so long since I have done any empirical work that I don't even really know how to start...these help.
Posted by: Linda Welling | October 12, 2011 at 01:55 PM
"It is a social norm: a husband should be taller than his wife."
Really? That last time someone told me that was my grandmother when I was in grade school. I thought that it was old-fashioned when I first heard it. (I am not Canadian, though. :))
Suppose that a woman is 5'4" tall. A man who is 6'1" is taller than she, but so is a man who is 5'6". Yet she may well prefer the taller man. (In fact, tallness in men is generally a favored trait.) Even if our only explanation is in terms of physical characteristics, tall men may be favored in the "marriage market", even without a social norm that says that husbands should be taller than their wives. Maybe the point is that they are taller than other men. Any comparison of men to men misses that distinction.
Posted by: Min | October 12, 2011 at 02:00 PM
Good post Frances. A few thoughts:
- Logit, probit, and LPM are all generally misspecified models. The correct question is not which one is "correct" (we know the answer to that question), the question is which is most useful, or perhaps least misleading, in some specific context.
- In any context in which LPM is clearly the wrong model it is probably also true that probit and logit are also no good. You want to use a semiparametric or other flexible approach if you're really worried about this aspect of the specification. Such estimators are now routine in estimating propensity scores, for example, a context in which LPM is usually no good.
- Using a robust covariance matrix estimator gets around the heteroskedasticity problem introduced by LPM.
- Dealing with interaction terms in nonlinear models is a pain. I think I'll put something on my blog about this issue later today.
- I think the example I put up of how to easily produce nice tables in Stata was a little overly complex. Try removing the looping over estimators, ie, just write in "regress" or whatever you like where I've got `estimator'. A block of code to produce a nice table in .html looks like:
regress y x1
estimates store e1
regress y x1 x2
estimates store e2
esttab *, b(%8.3f) t(%7.2f) html
Posted by: Chris Auld | October 12, 2011 at 02:36 PM
Dave, Linda, Chris - thank you for the comments, and for the code too. It's really valuable to get your reaction.
Min - a couple of students in class today had reactions similar to yours.
Yes, there is a literature suggesting that, within a relatively homogeneous population, height is correlated with health and other good outcomes, presumably via nutrition and early environment. So a preference for height could be just a preference for health etc. This tends to be the way that the biological literature approaches the issue - women prefer taller men (up to a point) because height signals health.
And it is also true that, when you look at women, you don't pick up a strong inverse height/marriage relationship, though the biological literature that I've read finds that very tall women have a lesser chance of "reproductive success" (as the biologists put it), and the women with the maximum fertility are those with slightly below average height, see for example, http://rspb.royalsocietypublishing.org/content/269/1503/1919.short
Posted by: Frances Woolley | October 12, 2011 at 03:46 PM
Great post Frances.
Before running the regression, I like to have a quick look at what the "descriptive data" looks like. In this case, I would probably have created height categories (1.50 to 1.55 m, 1.55 to 1.60m , etc..) and graphed the actual proportion of married men by height category.
I think it adds to the story to be able to say "Men that are 1.80m or more are married 10 percentage point more often than men that are 1.60m or less. What explains this?" Then a Blinder–Oaxaca decomposition would (if this were an OLS model) allow us to say that half of this difference is explained by income or age and the other half is only explainable by the "height difference"
Also, if you're interested in computing "average marginal effect" instead of "marginal effect at the mean", you might want to consider the *margeff* stata function. I'm not sure if it is still up to date as I havent used Stata in a while. It is presented here:
http://econpapers.repec.org/software/bocbocode/s445001.htm
Posted by: SimonC | October 12, 2011 at 04:00 PM
SimonC, yes, that would be a very good idea. I was thinking of adding height squared to capture non-linearities in the effect of height, but I just got so carried away with those diagrams. Alternatively one could include dummy variables indicating that a person is in either tail of the height distribution.
Posted by: Frances Woolley | October 12, 2011 at 04:10 PM
As someone whose wife thinks she's taller than I am, I found this interesting. Maybe my higher income makes up for it.
Now can you explain what I fairly often observe, the pairing up of overly tall men with relatively short women?
Posted by: Jim Sentance | October 12, 2011 at 06:27 PM
Jim - in response to your second point - perhaps there's a marriage market penalty to being overly tall. When I updated the table to include a height squared term it fit the data much better, suggesting that the height/marital status relationship is indeed non-linear.
Posted by: Frances Woolley | October 12, 2011 at 11:35 PM
Just a nit-pick - In Quebec, cohabitation is most certainly not legally the same as marriage:
http://www.justice.gouv.qc.ca/english/publications/generale/union-a.htm
In much of the rest of Canada cohabitation does, in the limit, approach being 'real' marriage.
But I get what you meant - as a practical matter, people in QC who think of themselves as married are, more often than not, not in fact married.
Posted by: Patrick | October 13, 2011 at 12:50 AM
Patrick -
Yes, you're right, and I should have been clearer on that, and I've changed the post. What I had in my mind is that in terms of economic and other variables, cohabiting couples in Quebec are more like married couples in the rest of Canada. E.g. if you have a model and you have "married" and "cohabiting" as two different explanatory variables, then it's a good idea to include an interaction variable cohabiting*Quebec to capture the fact that cohabiting couples in Quebec are different from cohabiting couples elsewhere.
But they're different in part because people choose cohabitation in Quebec in order to avoid particular property regimes.
Posted by: Frances Woolley | October 13, 2011 at 07:48 AM
Thanks Frances, I wish I had read this on Wednesday when you wrote it! I was working with one of our policy guys who had run a logit regression and he was asking me to help him with the specification and the problem of collinearity amongst his variables (parallel in this case is the height-income relationship). I have to admit I was getting myself stuck a step before that with fully understanding the coefficients and how to present the results. It's one thing to conclude the model specification is correct but it still needs to be explained in a verbal language!
As you say, Who/what does my data apply to - and how is that understood by my audience? Choose the model carefully (grounded in theory) and be able to explain it in a way that makes sense. At the end of the day, he went with a set of scenario's to present as comparators e.g. Tall & Rich in Ontario vs Short & Rich in Ontario vs Tall & Rich in PEI.
Key point, and worth stressing numerous times, there needs to be theory first (otherwise use OLS and allow calculations of 115% - without theory, why not? The coeeficients would even look nice y = .43 + .02*Height + 0.00005*AnnualIncome + .35*Maritimes - .2*Prairies + .005*Age)
Posted by: Peter | October 14, 2011 at 10:46 PM