A followup to my previous post on university retention and males.
Assume boys and girls are identical, except: there's something in the water at high schools that causes boys to do worse than girls; and there's something in the water at universities that causes boys to do worse than girls.
Suppose you had a data set for all university students, that told you for each student i: that student's performance at university Ui; that student's performance at high school Hi; and that student's sex Si (Si=1 for male, Si=0 for female).
Suppose you ran a multiple regression of Ui on Hi and Si:
Ui = a + bHi + cSi + ei
What would you expect to find?
You would expect to find b to be positive. Girls who did better at high school will probably do better at university than girls who did worse. And boys who did better at high school will probably do better at university than boys who did worse.
What about c?
It could be either positive or negative, or zero.
0. Suppose the water at university was the same as the water at high school, and had exactly the same adverse effect on boys whether at university or at high school. And suppose that Ui and Hi were both perfect proxies for the same ability Ai. So the underlying model is:
Hi = Ai - Si and Ui = Ai - Si, which means: Ui = Hi, so a=0, b=1, and c=0.
That's the benchmark case, in which we should expect to see boys and girls having exactly the same performance at university once we control for performance at high school.
But that does not mean there isn't something in the water at universities that is adversely affecting boys' performance.
1. Let's relax the assumption that university water is the same as high school water and has exactly the same adverse effect on boys.
If the water at university has a smaller adverse effect on boys' performance than the water at high school, we should expect to find c is positive. Boys would do better than girls at university, controlling for high school performance. If the water at university has a bigger adverse effect on boys' performance than the water at high school, we should expect to find c is negative. Boys would do worse than girls at university, controlling for high school performance.
But either way, that does not mean there isn't something in the water at universities that is adversely affecting boys' performance.
2. Let's relax the assumption that Hi and Ui are both perfect proxies for the same underlying ability Ai. Suppose they are imperfect proxies. But go back to the benchmark assumption that the water is the same and has the same effect.
If a random subset of boys and girls went on to university, this would make the estimate for b smaller (errors in variables), but it wouldn't affect the estimate of c, which would still be zero.
But it's not going to be a random subset. Universities select who they admit, based largely on high school performance. Suppose universities have a cut-off, and only admit students for whom Hi > Hbar. And suppose (for simplicity) that all those for whom Hi > Hbar do go to university. That means that fewer boys than girls will go to university. And (I think) that in turn means that c will be estimated to be negative.
To see why this is so, take a simple case where:
Hi = Ai - Si + vi, where vi is some random error, Ui = Ai - Si, so Ui = Hi - vi
Even if the mean vi is zero for the population, those students who go to university will on average have a positive vi. (They were the ones who got lucky at high school). And it will be even more positive for boys who go to university, because fewer boys will be admitted than girls, so the boys who get admitted will be an even luckier subset of all boys.
So if the water is the same at high schools and universities, and has the same adverse effect on boys, but high school performance is an imperfect proxy for university performance, we should expect to find that boys do worse than girls at university, controlling for high school performance.
Conclusion: in a multiple regression of university retention on sex and high school grades, we have to be really careful how we interpret the effect of sex. If boys have worse retention than girls, but the effect disappears when we control for high school grades, that doesn't mean there isn't something in the water at universities.
1. Frances found me a very good article on university retention by Martin Dooley, Abigail Payne, and Leslie Robb. [Update: ungated earlier version here.] I think they recognise the problem I am talking about here, right at the end of their article, where they say:
"Our key finding of the importance of high school grades in "explaining" success in university, however, leaves a very big question unanswered. What lies behind the variation in high school grades? On this point, it [is] fair to say that our research leaves the policy analyst with a sizable "black box". One possibility is that [high school] grades stand as a proxy, in part, for family level variation in economic resources [in my example: something in the water that adversely affects boys],..." (bits in [.] added by me).
Despite all this, they still find that boys do slightly worse than girls at university, even controlling for high school grades and a lot of other things.
2. And I'm having a running argument with the (also very good) university econometrician about all this.