My latest time wasting tech toy is Google Ngram Viewer. What's an ngram? A one word phrase, like "Marx" is a one-gram. "Karl Marx" is a two-gram. "K. Marx" is a three-gram (the period counts as a gram too). A phrase of indefinite length is an ngram.
The Ngram viewer is based on Google Books. Google has scanned almost every book (including pamphlets, treatises etc) ever written. It's then counted how often every single word (including names, punctuation marks, etc) appears in every book. For example, in 1860, the word "international" was used 513 times in 162 books published in Britain that year. Google has also counted how often every possible two word phrase was used. Every possible three word phrase.
With a research tool like this I'm like a kid in a candy store - I'm just gazing around hardly knowing where to begin.
Because the Ngram viewer is case sensitive, it's a little hard to use. It also doesn't allow the use of logical operators in searches, so, for example, one can't search for J.M. Keynes | John Maynard Keynes (using the logical operator |to represent or). Yet even within those limitations, some exploratory analysis is possible.
For example, one can document the rise and fall of Marxism.
This first chart shows trends for all works published in English. Note that the counts are percentages, for example, Adam Smith counted for about 0.0003% of the two-word phrases in books published in 1895. So any declines in Adam Smith or Karl Marx references could indicate either a decrease in the absolute number of references, or a decrease in the relative number, as more other words are published. To make the graphs easier to read, the numbers are smoothed, so these are three-year average percentages.
It might be objected that "Karl Marx" is not the right term to search - many people simply use the word "Marx" or "Marxism". Unfortunately the single word Marx will also pick up Groucho, Harpo and other Marxes. Yet a search of variants on Marx reveals that the relative use of related terms such as Marxism also peaked in the late 70s/early 80s and have decreased substantially since then.
It's not possible to download the data behind these graphs. However the raw data files containing the results for all possible one-grams, two-grams, and three-grams are downloadable - it just takes a day of downloading and unzipping, plus an external hard drive to store half a tetrabyte or so of unzipped data.
But a couple of days ago I had a visit from the Good Techie Fairy, who set me up with scripts to automate the downloading and unzipping process. Because only the "British English" ngram files are complete, and I have no idea what biases could be introduced by the incompleteness of other files, I only downloaded the British English files - the ngrams generated by a search of all books published in Britain, by either British or non-British authors.
Ann inspection of the ngram files revealed that the number of times the words "Karl Marx" or "Adam Smith" is used was much more variable than the number of books containing a reference to Karl Marx or Adam Smith. The publication of a biography, for example, causes a large spike in the number of Smith or Marx mentions. To minimize such effects, I chose to analyze trends in the number of books containing Marx or Smith references, as opposed to the number of mentions.
The next table shows the absolute number of books published in Great Britain containing a reference to either Karl Marx or Adam Smith. To make the data easier to read, the numbers reflect three-year moving averages.
This more careful analysis reveals that, because the absolute number of books published has increased rapidly in recent years, the absolute number of Marx/Smith references is very different from the relative frequency of references. Whether because of a real increase in the amount of books published, or because of some artifact of Google's data collection methodology, the absolute number of works containing a reference to Adam Smith or Karl Marx has increased substantially in the last decade or so.
So let's put Smith and Marx in a head to head battle - again, this data is a three year moving average of the number of books published in Britain:
This pattern looks more like the earlier ones, with books referring to Karl Marx peaking in the late 70s/early 80s. Yet it does not show the same pronounced decline - perhaps this is because I have used British English data, and some explorations with the ngram viewer reveals that the relative frequency of Marx is greater in British English than American English.
So what's the point?
This post is, for me, a research exercise - I'm learning how to write scripts, and trying to figure out the strengths and limitations of the ngram data base. Karl Marx and Adam Smith are good phrases to play with because they're two terms that search nicely.
As an analysis of trends in academic scholarship, the approach taken here has limitations. It only surveys books, but most academics, especially most academic economists, publish the bulk of their research in academic journals. Hence it is an imperfect measure of recent intellectual trends, but better as a measure of the evolution of thought in, say, the earlier part of the last century, when books were more important publication outlets.
Moreover, there is a certain amount of noise in the data - sometimes publication dates are inaccurate, and every different edition of a the Wealth of Nations counts as a different book, which might or might not be appropriate.
What the analysis does show is how closely interest in Karl Marx has shadowed Communism - growing rapidly during the years leading up to the Communist revolution in Russia, and then continuing to rise during the Depression era of the 1930s. After 1980, as the strains on the Soviet system began to grow, references to Karl Marx began to level off and then, it appears, decline in relative terms.
Unfortunately, ngram counts are devoid of content. The books referring to Karl Marx could be volumes on the history of economic thought or ones on the history of Soviet Russia. Or they could be modern-day reprints of Das Kapital.
In some ways, ngrams are more useful for showing how the use of language has evolved over time. One can document the decline of the semi-colon (in green), for example, the rise of the question (in red) and the stability of exclamation marks (in blue).
I suspect that an intelligent use of the ngram database could reveal a great deal about the history of economic thought, especially the relative importance of technical terms and phrases. But I don't have any idea where to begin.
Any suggestions?
I think Marx was a good place to begin. Before going below the fold, I made a guess as to when Marx would peak, and your results matched up fairly well with my priors. It's a way to check if the results make sense.
Suggestions: does e.g. "Keynes" and "Friedman" match up with recessions and inflation?
I wonder about the "half-life" of economic (and other) fads.
oh, "Marginal"! "Utility". Do they both climb rapidly after 1871?
Posted by: Nick Rowe | January 02, 2011 at 01:14 PM
It seems to me that you would first have to justify the assumption that each book carries the same weight. Some books are sold 5 milion times, others 1 milion times, and some barely 500.000 times or less. Could assume that if a term is in a popular book that you will see more books published using that term and that therefore the number of books are merely a proxy for how popular a particular book/idea was. You will still have a substantial lag then. I'd guess that the lag would decrease over time as time between writing and publishing has decreased.
Posted by: Martin | January 02, 2011 at 03:03 PM
Here is a minor but interesting example of three more or less interchangeable terms succeeding one another. Whether it also conveys some information about changing credit conditions, is harder to say.
Posted by: Lemuel Pitkin | January 02, 2011 at 04:57 PM
Lemuel, that is very fun, I like it.
Martin, because of multiple editions, some books will have more weight than others but, yes, your point is a good one.
I'm not sure of the best response. Traditional intellectual history has been a study of great thinkers, of great books. Yet this in some sense assumes the diffusion of ideas. Counting mentions - the number of times freedom or justice or choice or utility or marginal value is mentioned - does give a measure of the currency or diffusion of those ideas.
Let me give you an example. The term 'international' is generally attributed to Jeremy Bentham, who coined the term in his writings on international law in the late 18th century. That's traditional intellectual history. What that doesn't answer is how and when use of the term took off. The ngram viewer does that: ngram of 'international'. Now that doesn't tell you the context - is this international law, international society, international relations, etc? But it does chart the growing importance of that concept, and that's something worth knowing.
Nick, here are the Keynes/Friedman results. I understand there's a little town in England called Milton Keynes, though, and that might skew things.
These are the results for Keynes/Friedman with American English only - much less Keynesian influence.
Posted by: Frances Woolley | January 02, 2011 at 05:51 PM
Wow! Milton Keynes was founded in 1967, so that might explain the sudden rise in Keynes around 1970 in British English, which isn't there in American. The little spike in Keynes around 1910 was presumably J.N.Keynes, the father.
Posted by: Nick Rowe | January 02, 2011 at 06:27 PM
Via email from Brett Reynolds:
Your blog said it couldn't accept my comment, so here it is. Perhaps
you could post this in the comments or do something with it.
Tools like the Ngram viewer are often used in corpus linguistics and
natural language processing, which would make those fields natural
places to look for tools and approaches.
Good corpora are representative of a particular domain, and in that
respect, Google's book corpus is outstanding. But size isn't
everything. Google's corpus, as accessed through the n-gram viewer, is
missing much of the meta data that makes deeper investigation possible
(and the meta data for Google books is notorious sloppy, though they've
made large improvements.) Similarly, its character recognition
technology is very good, but the long s causes quite a problem in the
data between the mid 17th century and the early 18th.
Syntactic parsing and tagging makes searches easier. You might, for
example want 'can' as a lexical verb (e.g. can the tomatoes), not as a
modal verb (e.g., yes, I can) or as a noun. It would also be nice to be
able to group words by lemmas so that when I search for 'can' I also
get 'cans' 'canned' and 'canning'. It would also let you look for
strings like "Marx [verb] the revolution".
Semantic tagging, so that you can search for particular word senses, is
also useful.
Non-contiguous collocation is another useful property to look at, but
not available with the Ngram viewer.
And, finally (well not really, but I'm going to stop with the
criticisms here), you should be able to drill down and see the actual
output in context, which isn't possible here.
A much better (but still free) interface is that offered by Mark Davies
at Brigham Young here: http://corpus.byu.edu/
Unfortunately, Mark's corpora are far smaller and limited to what he
could collect for free (except the BNC, which was assembled elsewhere),
both of which limit their ability to be representative.
Of course, the "culturomics" paper in Science would be a good place to
look at approaches
http://www.sciencemag.org/content/early/2010/12/15/science.1199644
For more on what constitutes a good corpus, see
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm
For some more thoughts from the folks as Language Log, see Ben Zimmer's
post here: http://languagelog.ldc.upenn.edu/nll/?p=2859 and the posts
by Geoff Nunberg and Mark LIberman that he links to.
Jean Véronis illustrates some of the problems with the corpus and its
interface being proprietary to Google here (though this is from 2005,
this issue is, I think, still relevant):
http://blog.veronis.fr/2005/03/google-snapshot-of-update.html
Adam Killgarrif and the folks at Lexicography Master Class
http://www.lexmasterclass.com/ make a business of teaching people how
to do this kind of analysis.
Those are my suggestions. I hope they help.
-----------------------
Brett Reynolds
Professor of English for Academic Purposes
English Language Centre
Humber Institute of Technology and Advanced Learning
Posted by: Frances Woolley | January 02, 2011 at 07:25 PM
Brett: sometimes it won't let me comment either. I copy my comment, sign out, sign back in again, then paste, add anything, delete it again, and post.
Posted by: Nick Rowe | January 02, 2011 at 08:00 PM
Capitalism v communism might be a better proxy for comparing smiths ideas with marx's
Posted by: Ian Lippert | January 02, 2011 at 08:52 PM
Ian, I don't know. In British English, capitalism peaked in around 1990, whereas in American English, relative appearances of capitalism peaked in the 1930s - perhaps in the context of 'the failure of capitalism'?
Posted by: Frances Woolley | January 02, 2011 at 10:26 PM
Ian, though those do show an earlier decline than Karl Marx, with the tailing off in relative frequency of references starting in the 1960s - but only in the US - in British English the trends is different - then again capital C Communism in British English shows a trend much more like the American trend. Actually this is a good example of how case sensitivite the ngrams viewer is - it makes a big difference whether one does C or c.
Posted by: Frances Woolley | January 02, 2011 at 10:31 PM
Brett (and others to whom this may occur): My apologies. I've not set any a priori controls on comments; I clean up spam ex post. But for some reason, typepad seems to have a will of its own in these matters.
Posted by: Stephen Gordon | January 02, 2011 at 10:32 PM
Small correction: the long 's' issue runs until the early 19th century, not the early 18th as I wrote.
Posted by: Brett | January 03, 2011 at 03:33 AM
One more thing: you can drill down to the actual data to a certain extent by clicking the links at the bottom, but Google decides which links you see first.
Posted by: Brett | January 03, 2011 at 03:34 AM
Only communists talk about "capitalism"; capitalists talk about a market economy.
Only capitalists talk about "communism"; communists talk about a socialist economy.
Gotta be careful how you interpret those trends.
Posted by: Nick Rowe | January 03, 2011 at 05:24 AM
Nick - yes and no, market economy is a term that didn't really take off in a big way until the 1990s: see this ngram for market economy, socialist economy and free market.
Posted by: Frances Woolley | January 03, 2011 at 10:45 AM
I found the comparison of past, present, and future rather interesting around 1800. The future doesn't change but the present and past are now seen as different and worth commenting on.
Posted by: Lord | January 03, 2011 at 11:15 AM
I see it is the long s causing the problem. Including paft and prefent rectifies them.
Posted by: Lord | January 03, 2011 at 11:22 AM
My favourite one is when you cross "ISLM" and "DSGE"
Posted by: citoyen | January 03, 2011 at 07:03 PM
citoyen, yes, that one is excellent, I couldn't see what you meant until I changed the time frame for analysis to 2008, picking up the intersection in the use of the two terms in around 2004.
Posted by: Frances Woolley | January 03, 2011 at 07:49 PM
In re ISLM and DSGE. Yes, but RBC way dominates both of them all the way back to 1940! I wonder what it meant for 1981.
Posted by: marcel | January 09, 2011 at 11:19 AM