Relative grade inflation is when one professor grades easier than another professor. Or when one department grades easier than another department.
Operation Birdhunt was my attempt (OK, my colleague Marcel-Cristian Voia did the actual work) to do an econometric study of relative grade inflation across departments. Birdhunt was a failure, but a noble failure.
The absolute level of grades doesn't matter. It makes no difference whether we grade on a 100-point scale from 0 to 100, or a 10-point scale from 90 to 100. OK, it might matter a bit, if we can't use decimals, so can't give a 95% student a 99.5% grade on the new scale. We lose the power of fine discrimination if every student gets an A-, A, or A+. And that might be a problem at some universities, but isn't a problem at Carleton. Our A's still mean something. So I never worried about absolute grade inflation.
What I did worry about, when it was my job to worry about such things, was relative grade inflation. If one prof grades easier than other profs, or one department grades easier than other departments, then grades become a noisy signal, and decisions based on noisy signals will be bad decisions.
If everybody knew who the bird professors and bird departments were, and everybody knew that everybody knew this...if it were common knowledge, in other words, it wouldn't be a problem. Students, the university, and prospective graduate programs and employers, would all take that information into account, apply the appropriate deflator to the grades, and base all their decisions on inflation-adjusted real grades. Just like monetary neutrality. But we don't all know this. So relative grade inflation is non-neutral.
If nobody knows who the birds are, it's just like random noise. Students get misleading signals on their ability to advance to the next level, on their comparative advantage between subjects. Grad schools, scholarship awarders, and employers get misleading signals on who the best students are.
If the students know who the birds are, but nobody else does, the students flock towards the bird courses, rather than taking the courses they are good at and enjoy.
Even if the university knows who the birds are, or at least has strong suspicions, it's not likely that universities will make promotion and graduation rules contingent on which particular professor gave the student the grade. It's like sticky prices causing non-neutrality.
What surprises me is not that grade inflation exists; I am surprised it isn't such a pervasive problem that grades can't mean anything at all. With monetary inflation, there does exist some sort of nominal anchor which can make the overall price level determinate. The central bank fixes the price of gold, or fixes the supply of money, or fixes some sort of price level or inflation target, and that means there is an equilibrium price level, at least in principle.
There is no nominal anchor for grades. Grades mean whatever we choose them to mean, and what they mean depends only on what we all think they mean. Grades are like language. The word "cat" only means cat because everybody uses the word that way. But at least with words we can point to a real cat and say "that's what 'cat' means". Now I could point to a student's economics essay and say "that's what 'A' means", but it might not mean much to a sociology professor.
At best, a common set of grading standards is a convention -- the Nash Equilibrium to a pure coordination game. Just like language, or driving, all the professors want to use the words "cat" and "A" to mean the same thing that other professors mean, and to drive on the right/left if everyone else drives on the right/left. That's at best, if we assume that professors have no incentive to grade easier or harder than other professors.
Coordination games need a focal point to get everyone to the Nash Equilibrium. For the driving game, the focal point is obvious. Just watch which side of the road everyone else is driving on. But for grades, especially for new professors, or visitors, the focal point is not so obvious. It's like me steering a boat up the Grand Union Canal for the first time (fortunately, the barge coming the other way came to a dead stop).
I think it was Dean Allan Maslove (an economist, naturally) who introduced such a focal point into our Faculty. He sent all professors a description of the typical grade distribution for the Faculty. Any professor who submitted grades too far above or below that typical distribution would need to add a short note of explanation before the grades would be approved (by me, in my old job).
In practice, almost anything would count as an "explanation". All it really did was to force all the professors to look at the same focal point, and let me know they knew what the grades meant. And it worked, more or less.
But I wasn't really satisfied. You can't tell, just from looking at the grade distributions, whether one department is grading easier or harder than another department. Maybe department X attracts better students than department Y, and so ought to be giving higher grades? How can you distinguish between a department that has good students and a bird department? (And you can't use their students' high-school grades, because those may have the same problem).
Operation Birdhunt was an attempt to distinguish the two, econometrically. The basic idea was very simple. If some students take courses in both departments, but those students on average get higher grades in X than in Y, then X is birdier than Y.
I am not good at econometrics, so don't trust any of the following.
The basic model is this:
Student i's grade in course j = student i's smarts + course j's birdiness + random error.
I think it's called a "two-way fixed effects model with panel data". We don't observe student i's smarts, so each individual student gets a dummy variable. We don't observe course j's birdiness, so each course gets a dummy variable. That's a very large number of dummy variables, for a medium sized university, even though we did it by department and year-level, rather than down to the level of specific courses.
We got the data (stripped of anything that could put a name on an individual student) and Marcel fed it into a supercomputer, which made a loud crunching sound for a long time, simultaneously estimated every student's smarts and every course's birdiness, then spat out the answers for birdiness. That was the success. The model gave me a numerical estimate of each department's birdiness.
But closer inspection revealed that Operation Birdhunt had failed miserably. The standard errors were very large -- larger than the difference between any two departments' estimated birdiness. So I was unable to say, with any confidence whatsoever, that department X was more birdy than department Y.
For a long time I couldn't figure out why Operation Birdhunt had failed. Now i think I know why.
There's a big divide in the University, between the Sciencey half and the Artsy half. And not many students cross that divide much. And some specialised departments are a bit like closed shops to outsiders. And the computer was trying to figure out the birdiness of each department relative to every other department in the University. And the data, despite the massive sample size, just couldn't give the computer what it needed to do it.
Some pairs of departments just weren't really comparable, in practice. If there are students taking courses in both X and Y, and students taking courses in both Y and Z, but no students taking courses in both X and Z, the computer can only compare the birdiness of X and Z indirectly, via comparing each to Y. And the uncertainty of any pairwise comparison adds up as the chain gets longer, and the computer was trying to tell us this, by reporting very large standard errors. A failure of transitivity: the computer could compare the birdiness of X and Y with confidence, and Y and Z with confidence, but could not compare X and Z with confidence.
Or maybe it was a failure of reflexivity. Students majoring in X have a comparative advantage in X, and do relatively worse in their Y option, just as students majoring in Y do relatively worse in their X option. So the simple additive model can't capture these interactions between students and courses.
Those are my guesses anyway. But again, I'm not an econometrician.
It seems to me that your definition of bird course is too simple. This is of course anecdotal, but it was my experience in undergrad that bird courses weren't necessarily known for higher average grades. They were commonly known for low-fail rates or lighter course loads. It was also my experience that I could get a much higher grade in courses that were known to be difficult and had high work loads (e.g. 4th year econometrics or 4th year microeconomics) than courses commonly known as bird courses. It also seemed to me that courses known to be easy were electives rather than mandatory courses, so your analysis might change if you restricted the sample to only elective courses.
Posted by: JDUB | January 07, 2011 at 11:43 AM
My response as an econometrician is that if the standard errors are too big, you might be able to tighten them up by imposing some more structure and reduce the number of parameters to estimate.
Posted by: Stephen Gordon | January 07, 2011 at 12:07 PM
JDUB: Step one: look at the raw distributions of grades. Step two: do what I tried to do. Step three: look at some of the things you are talking about. Yours are good points, but I never got past step two.
Stephen: That was my view when I decided to do it by department, rather than by course. It's like imposing the restriction that all courses in the same department are equally birdy, so there's a lot fewer parameters to estimate.
But suppose that *no* Science student ever took an Arts course, and vice versa. Then the model would be unable to estimate if Science departments or Arts departments were birdier, but it would still be able to compare the Science departments to each other, and the Arts departments to each other, if you split the model in two. That, in a more extreme version, is why I think the estimates had such large standard errors.
Posted by: Nick Rowe | January 07, 2011 at 12:28 PM
Hmm. I see. You want to be able to 'triangulate'.
Another approach: introduce 'birdness' as a latent binary variable? Looks like a good MCMC project for a graduate student. Do you still have the data?
Posted by: Stephen Gordon | January 07, 2011 at 12:34 PM
Not any more. I could possibly get new data, but I don't have the "need to know" that I used to have. And it took the university some effort to scrub the identifiers off the data, while giving each student a new "number". This is really just memories of my past life, now I'm just a regular prof.
Posted by: Nick Rowe | January 07, 2011 at 12:56 PM
Oh well. File it along with my other empirical projects that were abandoned after 30 minutes...
Posted by: Stephen Gordon | January 07, 2011 at 01:02 PM
Stephen: About how many empirical projects do you usually abandon after 30 minutes a year? Just curious as I am a freshly graduated applied economist, and have already abandoned what seems like a huge amount of ideas for empirical projects after realizing the data or the methodology just won't work.
Posted by: JDUB | January 07, 2011 at 01:52 PM
Too many to count! Especially with the blog - everything looks interesting, and everything looks too hard.
Posted by: Stephen Gordon | January 07, 2011 at 01:56 PM
This is a huge problem for students, and it's too bad you couldn't squeeze down your errors. I know that this has been a big problem for my wife, who's been unable to get any scholarships in law school, because her undergrad GPA in business is competing against undergrad GPAs in education, which most people wouldn't expect to be directly comparable. The same principle is presumably causing similar inequality at the admission stage, though it didn't affect us personally, and countless other places.
It would be really useful if someone could establish a widely-accepted GPA standardization system, which would allow all graduate and professional schools, scholarship bodies and employers to compare candidates who attended different schools or different departments.
Posted by: Neil | January 07, 2011 at 02:06 PM
Neil: I have been on scholarship committees before, and what helps the most are not similar GPA systems, but including indicators on transcripts that show where the student's grade for a course stood in comparison to their classmates.
Posted by: JDUB | January 07, 2011 at 02:15 PM
Neil: that's why I wanted to try to do something about it, in my old job. But even though my method, in principle, could work to compare different subjects within the same university, I'm not sure how well it could work at comparing different universities. You would need a lot of data from transfer students, who took courses at two universities.
I vaguely remember hearing about one Canadian university that tried to measure relative grade inflation at high schools, by comparing students' university GPAs with their high school GPAs.
In my day, in England, we all sat common exams at the national level, and universities based their admissions on those grades. But I hear suspicions that those (A-level) exams are easier in some subjects than in others, which ought to be testable since each student will take several subjects at A-level.
Posted by: Nick Rowe | January 07, 2011 at 02:29 PM
JDUB: reporting class mean or median grades on the transcript may help a bit. But if some classes attract only the smartest students, a high class average may not mean it's an easy grade. That was the problem I was trying to get around.
Posted by: Nick Rowe | January 07, 2011 at 02:32 PM
Just to add more possible confusion. You are assuming a students smartness is stable. Note that observed smartness is composed both of intrinsic smartness and effort. Effort is likely to be variable over time. A comment from a Software Engineering prof who was trying to determine the effects of various organizational techniques on software productivity was that those effects were swamped by boyfriend trouble or girlfriend trouble.
Posted by: Jim Rootham | January 07, 2011 at 02:36 PM
Jim: yep. But in my case I don't think that would normally cause a big problem, if the data set is large enough (which it was), those omitted variables ought to be random errors that roughly cancel out on average. (Unless taking a particular course happens to be correlated with boyfriend/girlfriend trouble!)
Posted by: Nick Rowe | January 07, 2011 at 02:46 PM
Nick, I think you have described precisely why American colleges are so attached to SAT, GMAT, LSAT, and GRE tests. Canadian universities seem unusual in having selective admissions without any uniform testing to supplement the "noisy signals" from course grades.
Posted by: Gregory Sokoloff | January 07, 2011 at 03:20 PM
Has the Ontario government ever commissioned a major study on grade inflation at the high school level? I think it is a serious problem based on my experiences as a student. It seems to be a widespread phenomenon but is happening in different ways at different schools. How much can universities actually know about individual high schools?
Posted by: Kevin | January 07, 2011 at 03:41 PM
Nick, all of your guesses seem reasonable, but ... as Stephen noted, there really are too many parameters in the model. And if you are going to try to reduce the number of parameters, wouldn't it be better to focus on i (number of students) than on j (number of courses or departments or whatever?) Surely, no matter how you measure j, i must have been bigger?
The line then would be to re-write your model to remove the dummy parameters for student intelligence and substitute something observable, such as relative course rank. The model would then be:
Student i's grade in course j = F(student i's rank in j, birdyness of j) + Error
F would be chosen according to taste, such as (rank_i + birdyness_j) or possibly (rank_i * m_birdyness_j + a_birdyness_j).
Posted by: Phil Koop | January 07, 2011 at 04:06 PM
Kevin: "Has the Ontario government ever commissioned a major study on grade inflation at the high school level?"
I don't know. But if it hasn't been done, it would be a very good idea to do one. Each school sends students to many different universities, and each university accepts students from many different schools. And there are lots more students than schools and universities. So it ought to be possible from that data to get an accurate estimate of each school's birdiness and each university's birdiness.
Damn good policy research proposal. And it should be done at the Provincial government level.
Phil: I'm going to have to think about that. I can't get my head around it yet.
Posted by: Nick Rowe | January 07, 2011 at 04:20 PM
Wikipedia has an interesting entry on grade inflation:
http://en.wikipedia.org/wiki/Grade_inflation#Grade_inflation_in_Canada
Posted by: Patrick | January 07, 2011 at 04:22 PM
Or, maybe the Ontario universities themselves should band together and do it. They all have the data, after all. (But maybe universities, like the schools, would be worried about the results being public!)
Posted by: Nick Rowe | January 07, 2011 at 04:23 PM
The results could be scary for many universities and high schools. At the same time, it could illuminate some cases where great teaching is being done(like my high school). I would imagine the minsitry of education and the teachers union would not like such a study being done though.
If universities were able to factor in the quality of ones high school in admissions decisions it would create a nice incentive (in the form of parental pressure) for high schools to improve.
Unfortunately, even with all this analysis of grades and distributions one thing will always remain true to me. Inputs matter. If you put garbage in, you usually get garbage out, even if the garbage follows a nice bell curve!!!.
Posted by: Kevin | January 07, 2011 at 04:47 PM
Interesting wikipedia article. I am primarily interested in the heterogeneity between schools in Ontario.
i.e.
rural vs. urban
high vs. low income
Posted by: Kevin | January 07, 2011 at 04:50 PM
I am not surprised. I usually found the difference in professor to be more significant than the difference in subject. Some were better and others worse. Some wanted to teach and others were more intent on shrinking the size of their classes. Some were enthusiastic and supportive and others dogmatic and rigid.
Posted by: Lord | January 07, 2011 at 04:55 PM
Nick: "maybe universities, like the schools, would be worried about the results being public!"
Many of them surely would. So they should be forced to participate as a condition of their provincial funding. And while we are at it, we should track student future earnings and throw that into the model too. Not that unprofitable (for the students) programs/universities should be cut off, but students should at least know their realistic prospects before they decide to drown themselves in student loans. If they still choose to enroll, we can choose as a society whether that program/university has other socially redeeming qualities that make it worth funding.
You're definitely right Nick... a lot of departments/universities definitely don't want this information out there.
Posted by: K | January 07, 2011 at 05:00 PM
Do econometricians use Rasch analysis? It's often used in psychometrics and seems like a good fit for this problem.
Posted by: Brett | January 07, 2011 at 06:14 PM
Never heard of it. But a quick check of Wikipedia says that it's what we call a logit model. And yes, we do use it. My binary latent variable idea was a slightly more generalised version of this approach.
Posted by: Stephen Gordon | January 07, 2011 at 06:23 PM
I believe some/all universities do this. The Math Faculty at Waterloo compares high school transcripts to performance in mandatory first year courses such as calculus and classical algebra to detect relative grade inflation between schools then uses this in selecting prospective students. This really is a problem, as some high schools are pumping out students with ~100% averages at highly suspect rates.
They do some neat analytics there. They may be worth contacting.
Posted by: Andrew F | January 07, 2011 at 06:40 PM
I wonder if you could use data from, say, law school applicants (or MBA or medical school applicants) since they typically are required to write a standardized exam to be admitted (be it the LSAT, the GMAT or the MCAT). That would give you a common measure of "smarts" across students(setting aside the debate about the merits of such tests - although presumably any error in LSAT score as a measure of smartness wouldn't be correlated with course birdyness), so you wouldn't have to rely solely on the individual student dummy variables to control for "smarts". You probably couldn't do course level birdyness, but a large law school like Osgoode or UofT probably gets enough applicants (they typically get a few thousand applicants a year and you might be able to collect a few years worth of data) that you could do a reasonably robust estimate of department level birdyness (though I guess you'd need to add a university specific dummy variable).
Moreover, from a practical perspective, law school applicants might be a useful dataset because (a) the admissions committee would have all the raw data you'd need (indeed, they may have already semi-processed it for the admissions process - i.e., redacted identifying information to avoid allegations of racism, sexism or nespotism in admissions), (b) a law school typically has one or two law and economics guys (or girls) who could be a co-author (and would probably be thrilled to be associated with a "real" economist) and who might make it easier to get the data (since they're "in" the faculty), and (c) the question you are trying to answer could be of real value to the admissions committee. Moreover, the problem of trying to compare "artsy" students with "sciency" students might be mitigated since it's the nature of law students that, whatever their undergrad major, they've probably taken courses that requires them to write an essay or two (i.e., Artsy course - for the same reasons, a med student dataset would also work, since med schools typically have minimum science prerequisites).
Posted by: Bob Smith | January 07, 2011 at 09:53 PM
a maximum likelihood approach supposedly can distinguish relative strengths of each class in a school. All grades everywhere are compared to all classes everywhere such that persons taking easy A courses will get As, but the C in Physics still might be given more relative worth. It certainly can determine the top students, even if they take harder courses. I don't know the process much, but I know it's used in chess to determine relative strengths of players over 150 years (even though they obviously might not have played each other). Keene and Divinsky's "Warriors of the Mind" explains the use for chessplayers. The process is also used by a lot of sports computer programs to determine the relative ranking of basketball or football teams. All it requires is that there is enough overlap of teams playing each other that relative comparisons can be made. Sagarin and Pomeroy sites are examples. I'd check some of these.
Pete
Posted by: Pete Bias | January 07, 2011 at 10:51 PM
The problem, it seems to me, is complicated by the fact that this could very easily be gamed by any department wanting to avoid bird designation: fit your output (the grade curve) to the expected parameters (the bell curve). As long as some reasonable proxy can be found for grades that correlates to performance elsewhere - and a pop quiz on almost anything would work - Professor Bird would probably never be identified by this method.
As another commenter pointed out, if the bird designation is more about not demanding much of students rather than giving higher grades, birdhunt would be almost impossible to run as a stats model.
Best identification method might be looking for distributions that fit too well, or profs who never justify being out of the curve.
Posted by: GA | January 08, 2011 at 06:14 AM
Nick:
One fundamental source of noise I can see in this model is the fact that most students will have courses they "don't care about". They will in essence have random grades in that course and will only avoid failing. Now, how do filter out this kind of 'crap data' from 'good data': i.e. the courses a student has put a lot of energy into versus the courses the student does not care about? The preferences of the students are dummy variables as well.
I think, if you want a real result then there's a much better, partly statistical and partly psychological approach to identifying birds more reliably: calculate the distribution of those students that get the worst grades in general and look for courses that bad students are (much) less likely to fail. These are typically the students that will search for bird courses the hardest way.
And a course can be a 'bird' for many reasons: it might require less effort to learn its material, the test might be artificially easy or predictable, or the test environment might allow cheating ...
Sometimes not even the lecturer is aware of a course's birdiness ...
One thing is sure: badly performing students will map out these courses sometimes with as much investment of effort as it takes to pass an exam :-)
Note that the reverse is not true: well performing students will not map out "non-bird" courses. Even them will balance their workload and will consciously select know bird courses they deem less important for their future careers.
Posted by: White Rabbit | January 08, 2011 at 08:57 AM
Nick - I think you ought to look at programs that accept students from many backgrounds: say professional education programs that people take after a Bachelor's degree in a subject area. Check the acceptance criteria for that program: is it a common grade level or do they accept so many future math teachers, so many future, science teachers and so many future English teachers? How do those students do?
What this will do is impose a selection bias on your results because you will only get students good enough to complete a program well enough to get into another. But if yo are trying to find out if a B- is becoming a B+, this may be a good place to look.
Posted by: Chris J | January 08, 2011 at 09:07 AM
... I see Bob got there first with this idea...
Posted by: Chris J | January 08, 2011 at 12:24 PM
Some schools would be much more suitable for this sort of study than others. For instance, I go to a liberal arts college (in the US) where EVERYONE is required to complete a series of courses on the history of Western thought. Here's the really crazy part: Every professor is required to teach those classes as well. There are no professors who teach only those courses. If you were looking for a simple dataset to model, it would be an ideal place to look.
On the anecdotal side, I've wondered why it is that almost all the honors students in the School of Sciences seem to be Psychology majors...
Posted by: Blikktheterrible | January 08, 2011 at 05:43 PM
On further thought, the other advantage of using professional schools is that they might impose a common first-year curriculum (law schools generally do). So, you might be able to use first-year grades in common courses a your "smarts" control variable, since those courses are presumably equally "birdy" (generally, at least at UofT, they had common 100% exams) for all students and should be uncorrelated with undergrad bird courses.
Posted by: Bob Smith | January 10, 2011 at 09:57 AM
Two modest suggestions:
1) You argue that large standard errors are caused in part by the fact that some comparisons are hard since it is very rare to find students in both courses. Okay....give up on that. From a student's point of view, such comparisons are the least interesting since very few would consider taking both. This has a lot of potential value-added for departmental adminstrators and students if you can just get it working at a departmental level. (Given that most programs require all students to take a set of "core" courses, comparisons withing a department should be well-observed.)
2) I'm trying to think of other applications that solve the same problem. High-school ranking, sports rankings (e.g. college teams or professional players of different eras), and publication ranking seem to face similar problems; there's a limited number of observations on each unit, N is very large, the observations compare a small number of units using an idiosyncratic standard. What do they do?
Posted by: Simon van Norden | January 13, 2011 at 12:03 PM