My latest time wasting tech toy is Google Ngram Viewer. What's an ngram? A one word phrase, like "Marx" is a one-gram. "Karl Marx" is a two-gram. "K. Marx" is a three-gram (the period counts as a gram too). A phrase of indefinite length is an ngram.
The Ngram viewer is based on Google Books. Google has scanned almost every book (including pamphlets, treatises etc) ever written. It's then counted how often every single word (including names, punctuation marks, etc) appears in every book. For example, in 1860, the word "international" was used 513 times in 162 books published in Britain that year. Google has also counted how often every possible two word phrase was used. Every possible three word phrase.
With a research tool like this I'm like a kid in a candy store - I'm just gazing around hardly knowing where to begin.
Because the Ngram viewer is case sensitive, it's a little hard to use. It also doesn't allow the use of logical operators in searches, so, for example, one can't search for J.M. Keynes | John Maynard Keynes (using the logical operator |to represent or). Yet even within those limitations, some exploratory analysis is possible.
For example, one can document the rise and fall of Marxism.
This first chart shows trends for all works published in English. Note that the counts are percentages, for example, Adam Smith counted for about 0.0003% of the two-word phrases in books published in 1895. So any declines in Adam Smith or Karl Marx references could indicate either a decrease in the absolute number of references, or a decrease in the relative number, as more other words are published. To make the graphs easier to read, the numbers are smoothed, so these are three-year average percentages.
It might be objected that "Karl Marx" is not the right term to search - many people simply use the word "Marx" or "Marxism". Unfortunately the single word Marx will also pick up Groucho, Harpo and other Marxes. Yet a search of variants on Marx reveals that the relative use of related terms such as Marxism also peaked in the late 70s/early 80s and have decreased substantially since then.
It's not possible to download the data behind these graphs. However the raw data files containing the results for all possible one-grams, two-grams, and three-grams are downloadable - it just takes a day of downloading and unzipping, plus an external hard drive to store half a tetrabyte or so of unzipped data.
But a couple of days ago I had a visit from the Good Techie Fairy, who set me up with scripts to automate the downloading and unzipping process. Because only the "British English" ngram files are complete, and I have no idea what biases could be introduced by the incompleteness of other files, I only downloaded the British English files - the ngrams generated by a search of all books published in Britain, by either British or non-British authors.
Ann inspection of the ngram files revealed that the number of times the words "Karl Marx" or "Adam Smith" is used was much more variable than the number of books containing a reference to Karl Marx or Adam Smith. The publication of a biography, for example, causes a large spike in the number of Smith or Marx mentions. To minimize such effects, I chose to analyze trends in the number of books containing Marx or Smith references, as opposed to the number of mentions.
The next table shows the absolute number of books published in Great Britain containing a reference to either Karl Marx or Adam Smith. To make the data easier to read, the numbers reflect three-year moving averages.
This more careful analysis reveals that, because the absolute number of books published has increased rapidly in recent years, the absolute number of Marx/Smith references is very different from the relative frequency of references. Whether because of a real increase in the amount of books published, or because of some artifact of Google's data collection methodology, the absolute number of works containing a reference to Adam Smith or Karl Marx has increased substantially in the last decade or so.
So let's put Smith and Marx in a head to head battle - again, this data is a three year moving average of the number of books published in Britain:
This pattern looks more like the earlier ones, with books referring to Karl Marx peaking in the late 70s/early 80s. Yet it does not show the same pronounced decline - perhaps this is because I have used British English data, and some explorations with the ngram viewer reveals that the relative frequency of Marx is greater in British English than American English.
So what's the point?
This post is, for me, a research exercise - I'm learning how to write scripts, and trying to figure out the strengths and limitations of the ngram data base. Karl Marx and Adam Smith are good phrases to play with because they're two terms that search nicely.
As an analysis of trends in academic scholarship, the approach taken here has limitations. It only surveys books, but most academics, especially most academic economists, publish the bulk of their research in academic journals. Hence it is an imperfect measure of recent intellectual trends, but better as a measure of the evolution of thought in, say, the earlier part of the last century, when books were more important publication outlets.
Moreover, there is a certain amount of noise in the data - sometimes publication dates are inaccurate, and every different edition of a the Wealth of Nations counts as a different book, which might or might not be appropriate.
What the analysis does show is how closely interest in Karl Marx has shadowed Communism - growing rapidly during the years leading up to the Communist revolution in Russia, and then continuing to rise during the Depression era of the 1930s. After 1980, as the strains on the Soviet system began to grow, references to Karl Marx began to level off and then, it appears, decline in relative terms.
Unfortunately, ngram counts are devoid of content. The books referring to Karl Marx could be volumes on the history of economic thought or ones on the history of Soviet Russia. Or they could be modern-day reprints of Das Kapital.
In some ways, ngrams are more useful for showing how the use of language has evolved over time. One can document the decline of the semi-colon (in green), for example, the rise of the question (in red) and the stability of exclamation marks (in blue).
I suspect that an intelligent use of the ngram database could reveal a great deal about the history of economic thought, especially the relative importance of technical terms and phrases. But I don't have any idea where to begin.