Canada's 2005 National Graduates Survey asked respondents the following question: "Compared to the rest of your graduating class in your field(s) of study, did you rank academically in the top 10? Below the top 10% but in the top 25%..." The responses are shown below:
Over a third of respondents (36 percent) placed themselves in the top 10 percent, similar numbers placed themselves between 10 and 25 percent. Only two percent would admit to being below the top half. However a relatively large number of respondents - 13 percent - reported that they didn't know their academic rank.
Typically researchers code "don't know" responses as missing. However, in this case, throwing away these "don't knows" would mean losing 13 percent of the observations. Morever, the lost observations would not be a random subset of the sample. My first thought was that "don't know" really means "I don't think I'm in the top 25 percent, but I don't want to admit it."
If I was including this academic rank variable into a regression, I would want to have a series of 0-1 "dummy" or indicator variables: one for the membership in the top 10% group, one for 10 to 25 percent, and so on, ending with 0-1 variable indicating the don't knows. Sure, including five dummy variables uses up a few degrees of freedom, but that's a sacrifice worth making get a significantly larger and more representative sample.
The problem, though, is that if one downloads the National Graduate Survey public use microdata files, "don't know" is coded as a missing value. The researcher has to recode it from a missing value to a numeric value before creating categorical variables. This isn't easy to do in Stata, because just about every Stata command is programmed to ignore missing values. So how did I do it?
I began by looking at the on-line documentation for the National Graduate Survey. It provided me with this coding information:
Variable PR_Q05B : Compared to the rest of your graduating class in your field(s) of study, did you rank academically
Values |
Categories |
N |
|||
1 |
in the top 10%? |
5828 |
|
||
2 |
below the top 10% but in the top 25%? |
5738 |
|
||
3 |
below the top 25% but in the top half? |
2089 |
|
||
4 |
below the top half? |
271 |
|
||
7 |
Don't know |
2117 |
|||
8 |
Refused |
Armed with this knowledge, I spent about two hours trying to work out how to recode "7" from a missing value to a numeric value.
It turns out that reading the on-line documentation is only the right strategy 99 percent of the time. The other one percent of the time, other strategies are needed.
I cracked the case by typing, in Stata:
codebook PR_Q05B
(PR_Q05B is the name of the academic rank variable). By comparing the results of the codebook command to the on-line documentation reproduced above, I could work out that the "don't knows" were coded not as "7", but as ".a" - the standard Stata way of recording a missing value.
Armed with this knowledge, I could then use a somewhat obscure Stata command that recodes missing values into regular numeric ones, mvencode:
mvencode PR_Q05B, mv(.a=5)
This changed the "don't knows" from "missing" to "5".
Now that the "don't knows" are just a category like "top 10%" or "below the top 50 percent", incorporating academic rank into a regression is as easy as typing
xi: regress dependent i.PR_Q05B
(In newer versions of Stata, the xi: may be unnecessary).
So what's the bottom line? How do the don't knows do? I did a quick comparison of earnings across the groups. Although the results were only on the edge of statistical significance, the "don't knows" seemed to have the highest earnings of any group - about $1,000 higher than those who reported being in the top 10 percent of their class. They earned significantly more than those who self-identified as mid-ranked. Though, of course, this regression does not control for age, sex, program of study, or a myriad of other potentially important factors.
Earnings relative to those who self-identify as being in top 10 percent. OLS Regression, National Graduates Survey, Canada, 2005 |
||
Rank |
Coefficient |
P>t |
10 to 25 percent |
-1,036 |
0.015 |
25 to 50 |
-2,511 |
0.000 |
Below 50 |
-5,377 |
0.000 |
Don't know |
1,016 |
0.086 |
Constant |
42,716 |
0.000 |
N |
12,875 |
|
R-squared |
0.0033 |
|
HTs: to Tiger for finding this variable and to Kevin Milligan for helpful discussion.
And then there's the "refused" group. :-)
Posted by: Dave Giles | October 06, 2012 at 12:06 PM
Dave, this raises the question: is it necessary to make a distinction between "don't know" and "refused". If there's no need to distinguish the two categories, then one can just do a much easier recode: replace rankingvariable = 5 if PR_Q05B>=5
In this case, though, I thought the difference was interesting (there were relatively few refused - much less than the other categories).
Posted by: Frances Woolley | October 06, 2012 at 12:28 PM
Assuming, given all the disclaimers you offer at the end, that there is a significant relationship between choosing Don't Know and income, we might guess:
1. Some high earners are just enjoying their income and leisure, and are serenely uninterested in past markers of status, such as academic rank. (This would not be true of all high-income types.)
2. There is a link between income and one or a combination of honesty, humility, or high rational meta-awareness, (i.e., they know that few people could give an accurate retrospective answer to the question, and can constrain the emotional impulse to rank themselves highly)
3. The dont'-knows were academically disinclined students who dropped out and became HVAC technicians, and thus have higher incomes than most university grads.
Posted by: Shangwen | October 06, 2012 at 02:43 PM
Or
4: They are intelligent enough to understand that it is hard to aggregate relative strengths and weakness into some ordinal measure of academic excellence.
That 35 percent think that they are in the top 10 percent is not very strange. If we limit ourselves to economics student:
Assume that you have some socialists, libertarians, liberals and conservatives. Assume that some of them think that the proper way of acquiring knowledge about the economic system is to take empirics seriously and other think that the data contain too much noise and that you have to rely on deduction. Assume that some think that a very simplified but consistent GE framework is the way to think about issues and that other rather rely on more detailed PE models while assuming that everything else stay constant. Assume that some think it is important to ask “what does this really mean” while other thinks it is more about solving the equations they are given. Assume that some think rational agents is a harmless approximation while other think that it really isn´t. Etc. etc. etc.
How many of these people would not think that “Hell, my classmates does not understand the first thing about economics – I´m one of the few who actually understood what we were doing”
Posted by: nemi | October 07, 2012 at 07:38 AM
I´m not very surprised by this result. There is not a clear definition of what does it mean to be among the best 10 %. The respondents can assume the question is about their studies, final exam or just about success in their career. Or, the more pessimistic explanation would be that in America, it is common to have higher levels of self-confidence, which can produce this type of results. I remember reading about Economy in British Columbia vs. Ontario and it could not bring any serious results. I mean, don´t let somebody from Vancouver compare his city with Toronto, and don´t let somebody to compare himself with his schoolmates...
Posted by: JB | October 08, 2012 at 01:07 PM
Thank you! Professor Woolley !
Posted by: Tiger | October 08, 2012 at 04:53 PM
Tiger - I wanted to do this for my own satisfaction. These missing observations type problems are endemic in the Canadian public use microfiles, but there's very little information anywhere that describes how to cope with them. I'll write up some more detailed notes and post them here. By next week, guaranteed.
Posted by: Frances Woolley | October 08, 2012 at 05:14 PM
Or for your American readers: Lake Wobegon, where "all the women are strong, all the men are good looking, and all the children are above average." For Canadians who don't do NPR, google "Lake Wobegon".
Posted by: Chris J | October 08, 2012 at 06:27 PM
If the survey was of graduate students describing how they ranked in their undergraduate program, it makes some sense. But if this is a survey of graduates of graduate-level programs, then how can 42% of the people surveyed finish in the top 10% of their (graduate-level)class?
It would seem to me that graduates have an inflated sense of their own relative performance.
Posted by: Robillard | October 09, 2012 at 04:30 PM
Professor Woolley, I love reading you posts! Longtime reader first time commentary material...
I was interested in if you think this could this be cognitive bias?
http://en.m.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect#mw-mf-search
How did the second and third ranked do?
Posted by: Chris OBrien | October 10, 2012 at 02:28 PM
I think that what is going on is that it is pretty easy to remember whether or not you were in the top half of you class but more difficult to place yourself more precisely in the distribution. I think that the "don't knows" would be more evenly distributed if the survay had included a "50 to 75", "75 to 90" and "bottom 10%" so that the amount of information required to to answer the question would have been somewhat less dependant on the answer.
Posted by: Allan Pollock | October 11, 2012 at 11:34 AM