I am feeling too stupid to write a proper blog post (Spring doldrums). So I thought I would take the opportunity to make lemonade and write a short post about my own stupidity.
Every educated person should understand at least the very basics of Darwin. This is something very basic that I did not understand until a year or so back. Maybe some of you don't understand it either?
Let me pose it as simply and clearly as I can, as a multiple choice question:
There is a population of critters whose average (adult) height is 100. The height of any individual is determined partly by genetics and partly by environment, and different individuals have different heights because of different genes and different environments. But the population's genes and environment are not changing over time, so the average height stays at 100.
Suppose you take a (large) subset of that population, and isolate them from the main population in an environment that is the same as the original environment, and that also does not change over time. But you choose individuals that are taller than average, so the average height of your subset is 110. That first generation has children, and that second generation has an average height of 105. That is an illustration of regression toward the mean.
Question: if the second generation has children, what would you expect to be the average height of that third generation?
A. 105
B. 102.5
.
.
.
.
.
.
The correct answer is A: 105. Regression toward the mean stops at the second generation. If it continued for all subsequent generations, so average height eventually approaches 100, Darwin would make no sense, and it wouldn't be possible to breed cows for higher milk yields.
Somebody else can explain it better than me. Wikipedia is a good start.
This point had been vaguely puzzling me since I was a teenager. I didn't figure it out until I read one of Razib Khan's posts (I can't find the exact post). It cropped up again in this talk by Greg Cochran to a group of economists, and as you will see from the comments, I am not the only economist who was unclear on that point.
I wonder how many other supposedly educated people are like me?
Interesting post Nick. I hadn't heard this question before. Your explanation makes sense (that the regression would stop after a single new generation), but I'd certainly never contemplated the problem... or if I did I don't remember it. I read Dawkins' "The Selfish Gene" a long time ago and loved it, but I don't recall this particular point being made.
Posted by: Tom Brown | June 18, 2014 at 03:17 PM
There is another uneducated person right here. I vaguely knew about the concept but did not pass the test so it did not count at all. That wiki article was quite interesting, especially that Kahneman quote. Thanks for this.
Posted by: J.V. Dubois | June 18, 2014 at 03:51 PM
Thanks Tom and J.V., that makes me feel better. I find that Wiki is generally very good on the more technical stuff. Or maybe that's just the bias: that Wiki, like journalists, always seems good when it's talking about a subject you know little about.
Posted by: Nick Rowe | June 18, 2014 at 04:28 PM
I think the answer isn't 105, but more like 104 or such. There is still some regression to the mean, just not as much as in the first generation.
It seems to be that the answer is 105 only if you specify in the question that the true correlation between one generation and the next is 0.5. (Or, if you specify that the correlation is unknown and random, but with a mean of 0.5.)
If you don't specify either of those, then the observed correlation of r=0.5 is an estimate of the actual correlation between generations. With no information about the distribution of that correlation, it's more likely to be below 0.5 than above it (since it can be negative, but not greater than 1), and you so have to regress the observed r=0.5 towards the mean of r=0 to get an unbiased estimate of the true correlation.
Posted by: Phil | June 18, 2014 at 08:45 PM
Sorry, I missed that you specified that the subset was large. In that case, the observed r=0.5 is probably very close to the true correlation, so you need to regress only slightly further. Maybe the second generation is 104.99, or something, depending on the largeness of the original sample -- but, in any case, you should expect less than 105.
Posted by: Phil | June 18, 2014 at 08:53 PM
Phil: the way I understand it is this:
The first generation will have "taller" genes than average, but will also have a "taller" environment than average, and those two effects (by assumption) sum to +10.
The second generation will, on average, have the exact same "taller" genes as their parents, but will have an average environment. The effect of the "taller" genes (by assumption) is +5. (So we know that the effect of the first generation's "taller" genes must also have been +5.)
And the third generation will, on average, also have the exact same "taller" genes as their parents, and an average environment, so they too will be 105.
(With a small sample, the average height of the third generation could be either less than or greater than 105, but the expectation should be 105.)
(I'm implicitly assuming no selection effects, or none big enough to matter in 3 generations. If critters over 120 banged their heads on low branches and didn't breed, and critters under 80 couldn't reach the food and didn't breed, the subset would very slowly get selected back from 105 down to 100 again. 110, 105, 104.99, 104.98, etc.)
Posted by: Nick Rowe | June 18, 2014 at 10:24 PM
Nick, this post is more important to economists than you might think. In your example, the reason why the regression to the mean stops is because you have biased the sample by selecting tall critters. Some of the tall critters might be tall by chance, but by selecting tall only, your sample will contain an over representation of the genes that produce tallness. The new population will have an upward biased mean.
This is important to economists because for exactly the same reason a very large fraction of economics research (and empirical relationship driven science) is wrong - regardless of the care taken in the analysis. When most economics students or economists run regressions or perform econometrics work, they often set levels of statistical significance at 5%. One might therefore think that 5% of research results would be mistaken. In fact, it is much, much larger – it is biased upward by publication selection just like the mean height for your critters. The Economist magazine had a great article on this last fall.
Suppose that you have 1000 hypothesis that you wish to test and suppose that only 100 of them are true. Further, suppose your test has a false positive rate of 5%, producing 45 false positive results (5% of the 900 false hypotheses). If the power of the test is 0.8, it confirms only 80 of the true hypothesis, producing 20 false negatives. So, the test labels 125 hypotheses of the 1000 as true of which 45 are in fact false positives. That is, a full 36% of the positive results are wrong, not 5%! Now, like your critter example, if we only publish positive results (which is how the academic world works), we bias the population. The strength of the test is also in its ability to correctly reject false hypotheses, but those results are never published.
This why we should take empirical economic research findings with a grain of salt - even if the analysis is done correctly. (Like research that says people pay less attention to female named hurricanes.)
Posted by: Avon Barksdale | June 19, 2014 at 01:06 AM
Nick, I don't think your answer is strictly correct. Your newly isolated population contains females who are healthier than average. Some of this is due to genes but some of it is luck; they just happen to have been well nourished or whatever. (If we could "see" their genotype we'd find that they are taller than it predicts.) When they become pregnant they will likely produce healthier-than-average children for that reason. Those months in a healthier-than-average womb make a difference. Over time this nutritional inheritance will fade. I'd guess it fades pretty quickly in fact. Eventually the genotype rules. However your first-generation figure of 105 is the output of an upwardly biased estimator. It's notoriously difficult to distinguish between purely genetic effects and the effects of (lucky) maternal health but obviously gestation is very important.
Posted by: Kevin Donoghue | June 19, 2014 at 04:50 AM
Avon: neat! And if we replicate those 125 positive results using a new data set (which economists often don't have) we should observe regression to the mean, with a smaller number X2 (how many?? It's too early in the morning, and my math is too bad) passing the second generation of tests. But if we repeat on all 125 for a third generation of tests, we should expect to find X3 pass the test, where X3=X2, but it won't be the exact same hypotheses in X3 as in X2. Did I get that right? I think I did. Similarly X4=X3=X2 < X1.
Kevin: Aha. So when I said that the "environment" on the new island is exactly the same as the old environment, that may not be possible in practice, if we include the mother's womb as part of that environment.
Posted by: Nick Rowe | June 19, 2014 at 08:39 AM
As along as we refuse to publish negative results, this bias will remain. This is why the event and anomaly literature that "tests" the efficient market hypothesis is almost completely wrong. This is why medical results like, eat copious amounts of vitamin D will reduce cancer risk, are ignorable. We see it in the news every day, "Study X shows that if you do Y, good things happen." or "Economists show that A is attributable to B, and so the government should invent a new program." And it goes on and on. Most of this stuff is wrong.
We should not pray at the alter of statistical significance.
Posted by: Avon Barksdale | June 19, 2014 at 09:33 AM
Avon: But IIRC, there was an interesting finance paper that showed that some anomolies partly remained during the short gap between research and publication. In other words: they weren't false positives, but disappeared once people were aware of them and altered their trading strategies.
If we published *all* negative results, it would be very easy to publish. Not sure what the solution is.
Posted by: Nick Rowe | June 19, 2014 at 09:58 AM
Nick,
Agreed. The only difference in what I'm saying is that, for the second generation, you can't be sure genes are 5, and environment is 0, because the "0" is an assumption that the random variable for environment evens out. If the sample is large enough, the assumption is good enough for all practical purposes. But, it's still more likely to be an underestimate than an overestimate, so you have to regress a tiny bit.
Let me think of an example ... OK, try this.
Suppose, that all species of critters are different. For some, genes are 0.2 of height. For some, 0.5 For others, 0.6. The distribution is normal, with mean 0.4 and SD, I dunno, say, 0.1.
This is a new critter; you don't know the ratio. You observe 0.5. But that's an observed, random variable, and you don't know the actual parameter.
The expected value of the parameter is NOT 0.5, because the mean is 0.4. That is: the new critter is more likely to be 0.45 with a "lucky" height-enhancing environment, than 0.55 with an "unlucky" height-reducing environment.
That's why you still have to regress to the mean a tiny bit.
Now, in your example, we don't KNOW that the mean is 0.4. But, I'd still think it's more likely to be below 0.5 than above 0.5. Because, the heritability of most characteristics is below 0.5. (Like, say, enjoyment of slapstick comedy: I'd be surprised if there was a correlation that high between mothers and daughters.) So, 0.5 is still likely to be an overestimate.
This logic does not force the height to regress to 100 over multiple generations. Next generation has an expectation of (say) 104.9, but, after that generation, you have more data, so your estimate of the genetic proportion becomes more stable, and there's less and less regression. I'm thinking that if you worked out, it would converge very quickly, to, say, 104.89, or something.
----
OK, here's another way to look at it. The first generation regresses to the theoretical mean of THEIR population, which is 100. The second generation regresses to the theoretical mean of THEIR population, which is unknown. It's unknown, but it's likely to be around 105, so there's a lot less regression. And, there's no reason for them to regress back to 100, because 100 is irrelevant to them.
BUT: their unknown mean is more likely to be (a bit) less than 105 than (a bit) more than 105, because high levels of genetic heritability (like 0.55) are rarer than low levels (like 0.45). So when you see 0.5, you're more likely to be observing lucky environment than good genetics, so the next generation is more likely to regress lower than higher.
But not much, because your sample size for the second generation is so large.
If that makes sense.
Posted by: Phil | June 19, 2014 at 11:46 AM
"I'm implicitly assuming no selection effects"
Indeed you are! Otherwise statistical regression would be confounded with causal regression. The trouble is that you are also implicitly assuming that there are selection effects as soon as you say that "the population's genes and environment are not changing over time"; if there were no selection, then genetic drift would cause the "genes" (I think you mean frequency distribution of alleles) to change over time. And if there is no selection effect over three generations then there is none over infinitely many generations.
A pedantic point, you will say. But my experience is that the proper understanding of statistical effects turns on just such points. That is why people say that statistics is boring :-)
Posted by: Phil Koop | June 19, 2014 at 01:34 PM
Phil (not Koop): I'm still not following you. I'm wondering about: errors in variables bias?; some prior info? Not sure.
Phil Koop. OK, if there is no selection in the initial equilbrium, (so that 100 is not the only equilibrium), then there could be genetic drift of height in either direction. On average it would stay at 100, but it's a random walk.
Posted by: Nick Rowe | June 19, 2014 at 07:56 PM
"But IIRC, there was an interesting finance paper that showed that some anomolies partly remained during the short gap between research and publication. In other words: they weren't false positives, but disappeared once people were aware of them and altered their trading strategies."
Doubt it. Most of these effect work in sample and then fail out of sample not because people alter their strategies, but because the effect wasn't real in the first place. Markets are far too efficient to efficient for this stuff to work.
Posted by: Avon Barksdale | June 19, 2014 at 09:39 PM
Sorry, my explanation isn't great.
Yes, my argument depends on some prior info. I'm assuming that you don't know for sure that the 110s have a genetic factor that makes them 105s. I'm assuming that your prior suggested that less than 105 had a higher prior likelihood than more than 105.
Does that help? Can we agree that if less than 105 had a higher prior likelihood than more than 105, you should expect the next generation to regress lower?
The rest of my argument is why the prior "less than 105 is more likely than more than 105" makes sense.
Posted by: Phil | June 20, 2014 at 02:25 PM
"But you choose individuals that are taller than average, so the average height of your subset is 110. That first generation has children, and that second generation has an average height of 105. That is an illustration of regression toward the mean."
I don't think so, at least not in the original meaning of the term, which was Galton's mistake. Galton thought that the fact that taller fathers had, on average, shorter sons whose height was closer to the mean, was a property of evolution. Note that Galton was looking at fathers who were not segregated from the general population. The mistake is made clear by the fact that taller sons have shorter fathers. Regression works both ways. ;)
This has nothing to do with Phil's point, OC.
Posted by: Min | June 20, 2014 at 05:25 PM
Let me take a whack at this. :)
Let's start, as Nick likes to say, in an equilibrium in which the average height of an organism is 100". The effects of both genetic and environmental variation are normally distributed. Now let us randomly select several individual organisms and segregate them from the rest of the population, while keeping their environment otherwise the same. Let their average height be 110".
Note that for these individuals, neither their genetic variation nor the variation in the environment in which they grew to their mature height are normally distributed. Both will tend to favor height.
To keep it simple, suppose that these organisms reproduce asexually, so that, except for mutations, the offspring have the same genes as the parents. However, the offspring grow up in a different environment as their parents, one which does not favor height. Therefore their height should be shorter, on average, than that of their parents. Say that it is 105". The reason is that they grew up in different environments. Pardon me, but that is not "regression to the mean". It is just the different effects of different environments on height.
Now, what about the next generation? If both their environments and their genes remain the same, we should expect them to have the same average height as their parents. OC, in an evolutionary context, we should not expect their genes to remain the same, but, since their grandparents' genes were adapted to the (constant) environment (as evidenced by the normal distribution of their genes for height), we should expect a gradual adaptation towards a shorter average height.
Posted by: Min | June 20, 2014 at 08:39 PM
Clarification:
I meant to say, "Now let us randomly select several individual organisms whose height is greater than 100"."
:)
Posted by: Min | June 20, 2014 at 08:41 PM
Nick, what does "same as the original environment" mean exactly? It sounds like there were more than one original environments. Do you mean that originally (1st generation) there was a distribution of environments assigned to each individual as it developed, and that new environments are assigned to 2nd generation individuals by randomly drawing from this same distribution?
Posted by: Tom Brown | June 21, 2014 at 12:36 PM
Simple answer: Height is a combination of genes and luck. The individuals of large height might have gotten into that group via either criterion; but their progeny will get the genes while not being any luckier than average.
An analogous situation in economics: Take the set of mutual funds with above-average returns last year. Some of them were smart and some were lucky. The average return from that set next year will be much less than the average return from that set of funds last year. But the average return from that set in the year after next will be roughly the same as the average return from that set next year.
Posted by: Colin Percival | June 29, 2014 at 04:32 AM
Ach, I thought this was going to be about statistical regression to the mean, not the much more complicated biological regression to the mean.
Statistical regression to the mean is an artifact of sampling, basically.
Posted by: Nathanael | June 29, 2014 at 02:53 PM