A followup to my previous post on university retention and males.
Assume boys and girls are identical, except: there's something in the water at high schools that causes boys to do worse than girls; and there's something in the water at universities that causes boys to do worse than girls.
Suppose you had a data set for all university students, that told you for each student i: that student's performance at university Ui; that student's performance at high school Hi; and that student's sex Si (Si=1 for male, Si=0 for female).
Suppose you ran a multiple regression of Ui on Hi and Si:
Ui = a + bHi + cSi + ei
What would you expect to find?
You would expect to find b to be positive. Girls who did better at high school will probably do better at university than girls who did worse. And boys who did better at high school will probably do better at university than boys who did worse.
What about c?
It could be either positive or negative, or zero.
0. Suppose the water at university was the same as the water at high school, and had exactly the same adverse effect on boys whether at university or at high school. And suppose that Ui and Hi were both perfect proxies for the same ability Ai. So the underlying model is:
Hi = Ai - Si and Ui = Ai - Si, which means: Ui = Hi, so a=0, b=1, and c=0.
That's the benchmark case, in which we should expect to see boys and girls having exactly the same performance at university once we control for performance at high school.
But that does not mean there isn't something in the water at universities that is adversely affecting boys' performance.
1. Let's relax the assumption that university water is the same as high school water and has exactly the same adverse effect on boys.
If the water at university has a smaller adverse effect on boys' performance than the water at high school, we should expect to find c is positive. Boys would do better than girls at university, controlling for high school performance. If the water at university has a bigger adverse effect on boys' performance than the water at high school, we should expect to find c is negative. Boys would do worse than girls at university, controlling for high school performance.
But either way, that does not mean there isn't something in the water at universities that is adversely affecting boys' performance.
2. Let's relax the assumption that Hi and Ui are both perfect proxies for the same underlying ability Ai. Suppose they are imperfect proxies. But go back to the benchmark assumption that the water is the same and has the same effect.
If a random subset of boys and girls went on to university, this would make the estimate for b smaller (errors in variables), but it wouldn't affect the estimate of c, which would still be zero.
But it's not going to be a random subset. Universities select who they admit, based largely on high school performance. Suppose universities have a cut-off, and only admit students for whom Hi > Hbar. And suppose (for simplicity) that all those for whom Hi > Hbar do go to university. That means that fewer boys than girls will go to university. And (I think) that in turn means that c will be estimated to be negative.
To see why this is so, take a simple case where:
Hi = Ai - Si + vi, where vi is some random error, Ui = Ai - Si, so Ui = Hi - vi
Even if the mean vi is zero for the population, those students who go to university will on average have a positive vi. (They were the ones who got lucky at high school). And it will be even more positive for boys who go to university, because fewer boys will be admitted than girls, so the boys who get admitted will be an even luckier subset of all boys.
So if the water is the same at high schools and universities, and has the same adverse effect on boys, but high school performance is an imperfect proxy for university performance, we should expect to find that boys do worse than girls at university, controlling for high school performance.
Conclusion: in a multiple regression of university retention on sex and high school grades, we have to be really careful how we interpret the effect of sex. If boys have worse retention than girls, but the effect disappears when we control for high school grades, that doesn't mean there isn't something in the water at universities.
Background:
1. Frances found me a very good article on university retention by Martin Dooley, Abigail Payne, and Leslie Robb. [Update: ungated earlier version here.] I think they recognise the problem I am talking about here, right at the end of their article, where they say:
"Our key finding of the importance of high school grades in "explaining" success in university, however, leaves a very big question unanswered. What lies behind the variation in high school grades? On this point, it [is] fair to say that our research leaves the policy analyst with a sizable "black box". One possibility is that [high school] grades stand as a proxy, in part, for family level variation in economic resources [in my example: something in the water that adversely affects boys],..." (bits in [.] added by me).
Despite all this, they still find that boys do slightly worse than girls at university, even controlling for high school grades and a lot of other things.
2. And I'm having a running argument with the (also very good) university econometrician about all this.
Nick, excellent post.
If I had access to the university econometrician, and I was trying to figure out if there was something in the water, I would go on a fishing trip, and here's what I'd look for.
I'd begin by looking for multi-section courses, and see if boys who have male instructors in the multi-section courses do better than the ones who have female instructors. I'd compare multi-section courses with/without common exams, and see if instructor gender makes more/less difference with common exams.
I'd look at the % male students in the classes, and see if that predicts success/failure. Interact gender of instructor with % male students in the class.
Can you separate local students from students from elsewhere? Perhaps being in residence and being away from home has more of a negative effect on male performance than on female?
There's lots of fish in the Rideau, along with the odd snapping turtle, so I hope you catch something.
Posted by: Frances Woolley | March 14, 2013 at 10:28 AM
Also see if boys who take a gap year do better. That would be interesting.
Posted by: Frances Woolley | March 14, 2013 at 10:33 AM
Frances: thanks! That sample selection/imperfect proxy stuff is what I was trying (and failing) to get my head around yesterday. I was also trying to get my head around what would happen if students have some information on their own ability at university that is not captured in high-school grades ("Was I slacking off at school and will I work harder at university?"). So we had two sorts of sample-selection effects interacting. But it was too hard for me to think about.
The fishing trips you suggest might be interesting. My hunch would be on number of assignments vs exams though. And yes, lots of different fish may be in the river, and the outside job opportunities is probably a big one. But the fish I would go hunting would only be the fish we could eat. If there is no policy that could reasonably be implemented to do something about it, there's no point in catching it. (Except for scientific curiosity, of course, which is fine, but I've got to think with my policy hat on nowadays.)
Posted by: Nick Rowe | March 14, 2013 at 10:46 AM
Also talk to Louis-Philippe Morin at U of O who has done some work looking at the performance of boys and girls in the double-cohort, and argues that boys do better in more competitive environments.
"My hunch would be on number of assignments vs exams though."
Correlated with whether the course is quantitative or qualitative? Male students may tend to select out of essay courses also, and highly motivated students may select into sections with more assignments (and student who figure they're smart enough to get by without the assignments may select into ones with fewer). Wonder if there's any way of getting exogenous variation in sections? Could always just look at # of assignments v. exams on multi-section required courses, controlling for instructor quality and gender. Course there are some really annoying instructors like the prof for ECON 1000 section A, who don't specify the number of assignments on their course outline.
Posted by: Frances Woolley | March 14, 2013 at 11:08 AM
Ooops!
Posted by: Nick Rowe | March 14, 2013 at 11:11 AM
Another thing to try: single section required courses, ideally ones that are offered only once a year. See if male students do better/worse if the course is offered at 8:30 a.m.
Posted by: Frances Woolley | March 14, 2013 at 01:21 PM
Nick,
"But the fish I would go hunting would only be the fish we could eat."
That doesn't sound like a great strategy. If your dependent variable is entirely explained by fish you can't eat (i.e. there are no edible fish in the pond), then that's a really useful piece of information. Even discovering that you have an 80% R-score on useless variables (at least 80% of the fish are inedible) tells you whether it's still *worthwhile* to keep fishing. Trying to find out exactly what's in the lake is going to be a lot more efficient than wading around for months trying to catch salmon that may not even exist because it's all catfish.
If I were doing your job, I would go after the most likely fish first in order to find out exactly what's in the pond. It'll be faster, and even catfish can be pretty tasty if you are hungry enough and you get a bit creative about cooking them.
Posted by: K | March 14, 2013 at 02:13 PM
Frances: I see via email the university econometrician has been very quick to test that. Looks to me like the "8.30am effect" is roughly the same on both males and females! Or maybe just slightly worse for males.
K: Hmm. Fair point. Maybe.
(Catfish taste very good to me, btw.)
Posted by: Nick Rowe | March 14, 2013 at 03:18 PM
Nick (or the university econometrician) - is the 8:30 a.m. effect positive/negative/statistically significant?
Posted by: Frances Woolley | March 14, 2013 at 03:43 PM
Ignore that last comment, just checked my email.
Posted by: Frances Woolley | March 14, 2013 at 03:48 PM
From the abstract, it does sound like they do not have data on parents income and education level. That is not relevant for the gender issue, but those two should be the important factors that both cause good grades at high school and low dropout rates at University.
Posted by: hix | March 14, 2013 at 05:58 PM
hix: You can download an ungated earlier version of the paper here
They don't know parents' income and education, but they do know parent's neighbourhood, and they know income and education for each neighbourhood, so they can proxy for parents' income and education.
It isn't directly relevant to the gender issue, but indirectly it is very relevant, because the sort of effect I am talking about here for gender could apply equally well for parents' income and education.
Posted by: Nick Rowe | March 14, 2013 at 06:08 PM
Dont think neigbourhood as a proxy can work, at least not without further corrections (which they might well have done and i just overread it). One would expect that a much higher percentage of the age cohort go to college in the richer neighbourhoods and the paper even suggests as much. Those who do go to college from the poorer ones could have just as well-off or high educated parents as the other ones. Regarding the gender issue, men are heavily underdiagnosed for mental ilnesses compared to women. That is directly relevant, since put bluntly, maybe 2 or 3% more men might have graduated if someone had given them antidpressants and also hints at a broader problem with gender roles. College can cause lots of anxiety and the traditional male gender role expectation is to not talk about it or admit it. That does not help.
Posted by: hix | March 15, 2013 at 10:00 AM
hix: I'm not very familiar with the use of neighbourhood proxies, but here's how I think it's supposed to work:
You take average income in a neighbourhood as a proxy for the income of a family living in that neighbourhood. (And average education as a proxy for education.) Obviously (unless all families in a neighbourhood have the same income) the proxy will be imperfect. And this means that the estimated coefficient on income will be smaller than the true coefficient, because of the errors in variables bias. But as long as the proxy is correlated with the true income, the estimated coefficient should still have the right sign. The danger is if the errors in the proxy are correlated with some other variables in the regression, because that would bias the estimated coefficients on those other variables.
Interesting hunch on mental illness/males.
Posted by: Nick Rowe | March 15, 2013 at 10:26 AM
This is somewhat off-topic…but doesn’t the apparently close empirical relationship between high school and university performance thoroughly undermine the idea of university education as a signal? The evidence seems to suggest the same abilities, traits and habits make for success in both high school and university. Why would employers ever ascribe signalling value to (and pay premiums to people possessing) university degrees when they can obtain pretty much the same information about a person’s intrinsic characteristics from their high school record?
Posted by: Giovanni | March 15, 2013 at 03:08 PM
Giovanni: off-topic, but nevertheless a very interesting point. So it's allowed. If you can predict beforehand who will signal, is the signal still a signal? Hmmmm. dunno. I need to think about that one.
Posted by: Nick Rowe | March 15, 2013 at 03:37 PM
The idea that the boys who get into college are "luckier" than the girls is not numerate enough for me. The boys are further into the tail of the distribution, and I don't know what the tail looks like, for either boys or girls. (I suppose that we are assuming that the shape of the distributions is the same, something that I doubt in real life. At the very least, having only one X chromosome should make for more genetic variability among the boys.)
For simplicity, let's suppose that the effect of the water is the same, and that the distributions are triangular. Suppose that 10 boys get into college (1 + 2 + 3 + 4) and the 21 girls get in (1 + 2 + . . . + 5 + 6). Let's say that half of those who barely made it don't really belong in college, and drop out. That's 2/10, or 1/5 of the boys, and 3/21, or 1/7 of the girls. So a greater percentage of boys than girls drop out, but that is an artifact.
Posted by: Min | March 16, 2013 at 11:16 AM
Giovanni: "doesn’t the apparently close empirical relationship between high school and university performance thoroughly undermine the idea of university education as a signal?"
Depends upon what is being signaled. In my father's day, and to some extent in mine, there were a lot of highly competent people who did not go to college or did not finish it. In his day a B. A. was a signal of class. In my day scholarships were more available, and college was expected of good students. College was considered as time before entering the work world in which to "find yourself", a kind of luxury for many students. The practice of using the lack of a college degree to eliminate job applicants was decried. Nowadays it is widely accepted, and scholarships are fewer, so the class signal is stronger again.
Posted by: Min | March 16, 2013 at 11:34 AM
Min: if the boys' and girls' distributions of Ai are the same (which is what I was semi-explicitly assuming) then the boys who get into university are (on average) "luckier" than the girls who get into university.
If instead we assume that boys have a higher variance but the same mean as girls (which AFAIK is roughly realistic, but which I wouldn't say if I wanted to be university president), then I'm less sure about that. I *think* it would still be true that boys would perform worse at university, controlling for high school grades, but I would have to do some math to check. I can't quite follow your triangular distribution example.
On the signalling. OK, if high school grades were a perfect predictor of success at university, conditional on going to university, but were an imperfect predictor of who would go to university, then I think you are right: going to university could still be a signal. Good point.
Posted by: Nick Rowe | March 16, 2013 at 12:33 PM
@ Nick
About the triangular distribution example:
I hope this shows up right. ;
Boys Girls
*
**
* ***
** ****
*** *****
**** ******
------------------------------ Admissions cutoff
The triangles are the tails of the boys' and girls' distributions. 10 boys get admitted, 21 girls get admitted. 4 boys just make the cutoff, 6 girls just make the cutoff. If half of those who just make the cutoff drop out, without regard to gender, that's 2 boys and 3 girls, or 1/5 of the boys and 1/7 of the girls. A smaller proportion of girls drop out, even though dropping out is gender blind.
Posted by: Min | March 16, 2013 at 07:41 PM
No, the triangles did not show up right. Let's try again.
Boys . . . . . . . . . . Girls
. . . . . . . . . . . . . . *
. . . . . . . . . . . . . . **
* . . . . . . . . . . . . . ***
** . . . . . . . . . . . . ****
*** . . . . . . . . . . . *****
**** . . . . . . . . . . ******
---------------------- Admissions cutoff
Posted by: Min | March 16, 2013 at 07:43 PM
Thanks Min. Got it. Pictures help!
A uniform distribution, though even less realistic than your triangular distribution, makes it even clearer. If boys had the same mean but a higher variance than girls, and if there were nothing in the water at either schools or universities, and if Hi were an imperfect proxy for Ui, we would expect to see boys having better retention than girls, because there would be a smaller percentage of boys near the cutoff.
But it will depend on the shape of the distribution, and where the cutoff is. For example, if it were normal distributions, and the cutoff were very low, so more than half went to university, we would have more boys nearer the cutoff.
For a unimodal distribution, I think the direction of the effect will depend on whether the cutoff is above or below the mode.
There are then two effects: the one you are talking about, which is about the relative sizes of f(Hbar) and F(Hbar) and how that is affected by variance; and the "something in the water effect", which shifts boys' whole distribution f(U) to the left.
Posted by: Nick Rowe | March 16, 2013 at 08:10 PM
Nick, you say: "If a random subset of boys and girls went on to university, this would make the estimate for b smaller (errors in variables), but it wouldn't affect the estimate of c, which would still be zero."
That's not generally true. Consider the case you go on to present with U=A-S and H=A-S+e, and we regress U on a constant, H, and S. It is true that the coefficient on H is biased towards zero, but it isn't generally true that the coefficient on c is still centered on zero. One way to see this is to consider an extreme case in which we make the variance of e so large that almost all of the variance in H is due to variance in e, so H is almost just noise. Then we are regressing A-S on irrelevant noise and S, so c will be centered on something very close to -1. If on the other hand we let the variance of e approach zero, we'd be recover your case where we get b=1 and c=0, since we are regressing A-S on A-S and something else.
In the more realistic case in which U and H are both imperfect proxies, the bias in the coefficients depends on the covariance structure of the noise terms.
Here's a little simulation of the case in which H is noisy and U is not. Draw n=10,000 observations on a random dummy S, a standard normal ability term A, and noise in high school e. Then let U=A-S and H=A-S+e.
set obs 10000
gen s=uniform()>0.5
gen a=invnorm(uniform())
gen e=invnorm(uniform())
gen u=a-s
gen h=a-s+e
And now regress U on H and S (apologies for the screwed up formatting):
. reg u h s
Source | SS df MS Number of obs = 100000
-------------+------------------------------ F( 2, 99997) =31762.65
Model | 126072.803 2 63036.4013 Prob > F = 0.0000
Residual | 198454.844 99997 1.98460798 R-squared = 0.3885
-------------+------------------------------ Adj R-squared = 0.3885
Total | 324527.647 99999 3.24530892 Root MSE = 1.4088
------------------------------------------------------------------------------
u | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
h | .5001726 .0015813 316.31 0.000 .4970734 .5032719
s | -.5028246 .0047261 -106.39 0.000 -.5120877 -.4935614
_cons | .0049026 .0031496 1.56 0.120 -.0012706 .0110758
------------------------------------------------------------------------------
So in this little model the true effect of S on H and on U is -1, but this (misspecified) regression recovers an estimate of S on U, conditional on H, of about -0.5.
It turns out in this case that selection into university has little effect on the estimates, although I haven't sat down and figured out why. If we suppose that only kids with H>0 go to university,
gen dropout=h<0
and estimate using only the subsample who attend university, we get very little change in the estimated parameters:
. reg u h s if ~dropout
Source | SS df MS Number of obs = 25234
-------------+------------------------------ F( 2, 25231) = 5119.52
Model | 5125.49736 2 2562.74868 Prob > F = 0.0000
Residual | 12630.2238 25231 .50058356 R-squared = 0.2887
-------------+------------------------------ Adj R-squared = 0.2886
Total | 17755.7212 25233 .703670636 Root MSE = .70752
------------------------------------------------------------------------------
u | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
h | .4959091 .0061616 80.48 0.000 .483832 .5079862
s | -.4910323 .0099647 -49.28 0.000 -.5105636 -.471501
_cons | .008722 .0104105 0.84 0.402 -.0116832 .0291271
------------------------------------------------------------------------------
Posted by: Chris Auld | March 19, 2013 at 05:14 PM
Chris: "That's not generally true."
Hmmm. You are right.
Damn. Someone did a blog post on that very topic just a few weeks back. It wasn't you, was it? I was sort of getting the intuition for it from that blog post, but couldn't fully understand it intuitively, and was too shy to ask dumb questions. Let's see:
Suppose you have Y = a + bX + cZ + e
Now suppose we start measuring X with error. That biases b towards zero. OK so far. What about c? Does c get biased away from zero if X and Z are positively correlated? (That sort of feels right to me.) And in my example, S and H *are* correlated.
Am I on the right track?
Posted by: Nick Rowe | March 19, 2013 at 06:15 PM
Nick: wasn't my blog post. I think it's correct to say that measurement error in X but not Z yields biased estimates of both coefficients if X and Z are correlated. But here I phrase the problem the issue differently: high school grades are an imperfect proxy for ability because the water affects boys in high school too, so the estimate on the coefficient on the sex dummy is still biased after even conditioning on high school grades.
Posted by: Chris Auld | March 19, 2013 at 10:58 PM
oops, I meant, "But here I would phrase the problem differently...."
Posted by: Chris Auld | March 19, 2013 at 10:59 PM
Nick,
In the errors-in-variables case you describe OLS will produce an estimate of c such that:
(1) E(c^) = [b - E(b^)]cov(X,Z)/var(Z)
where b^,c^ = OLS estimate of b,c. This comes directly from the sum-of-squares minimization condition for c:
(2) cov(X,Z)b^ + var(Z)c^ = cov(Z,Y).
Suppose X = X* + n, where n = measurement error, and that the true relationship governing Y is:
(3) Y = a + bX* + s
Assuming n, s are independent of Z:
(4) cov(Z,Y) = cov(X*,Z)b = cov(X*+ n,Z)b= cov(X,Z)b
Using (4) in (2):
(5) cov(X,Z)b^ + var(Z)c^ = cov(X,Z)b
which gives (1) after taking expectations and rearranging.
So, yes, if X and Z are positively correlated then c^ will be biased away from zero to an extent proportional to the bias of b^. Intuitively, when X is large because X* is large OLS will tend to see Y, X and Z all large together. But when X is large because the measurement error n is large then OLS will tend to see X large with Y and Z close to their means. OLS will account for these latter instances in part by overestimating the influence of Z on Y.
Posted by: Giovanni | March 20, 2013 at 10:20 AM