Yesterday I posted a denunciation of Google’s new Ngram Viewer as an example of what Marx called “socially unnecessary labor time”–work that takes skill and craft and time but that nobody wants or needs. Lots of people I respect think more of Google’s Ngram Viewer than I do.  Friends who don’t follow the world of digital computing in the humanities thought my post on Ngram Viewer was spot on. Hmmm.

Ngram Viewer represents possibilities–that some day we could have a whole set of interesting and flexible new tools for searching large–really large–bodies of text. So why did they present this possibility in such an absurd way?

different spelling

It occurs to me that one of the reasons Ngram Viewer is so irritating to people like me grows out of the way Google treats text.If you already know what an “ngram” is you might not want to read any further. If you don’t, you’ll find this both extremely interesting and possibly appalling.

If you enter search term in Google–let’s say “Bush tax cuts,” Google does not search for “Bush” or  “tax” or “cuts.” It searches instead for 1 to 5 character patterns, including the spaces as characters.  So it might be searching for “Bus” and “h t” and “ax ” or “sh ta”. These are the “ngrams” of the title.

Google ignores morphology: it ignores the meanings of words themselves when it searches. it’s not searching in English–you are, but it’s not. It’s searching for patterns of characters. You do a search for a good Chinese takeout place: Google doesn’t search for “chinese food” it searches for all possible “ngrams” which can be made out of the phrase “chinese food.”

Why does it do this? Because “meaning” as we know it gets in the way. You search for Chinese food, but all the other possible meanings or contexts for chinese food–what counts as authentic, what counts as food and not culture: easier to just break it up into patterns that humans don’t recognize. That way you expose patterns that people aren’t really aware they use: you bypass the pitfalls of language, its ambiguity, its fuzziness, its tendency to depend on context for meaning. Google gives humans what they want by ignoring what they mean. If you don’t find that slightly troubling, or at least interesting, then you aren’t paying attention.

Ngram Viewer reflects this disinterest in meaning. It disambiguates words, takes them entirely out of context and completely ignores their meaning, which is why Patricia Cohen can think that the increased frequency of the word “women” equals the rise of feminism. It basically tells us how often an Ngram appears, not how often a word appears. The word is just surface, like sheet metal on a car.

In that sense, it shows how the people who worked on  Ngram Viewer are completely captured by the technology they work with. They aren’t interested in meaning, they are interested in command of pattern. So they produced something that’s offensive to the practice of history, which depends on the meaning of words in historic context.

But then of course, what is reading but pattern recognition? It might seem alarming to think of the English of Shakespeare rendered into the Ngrams of Google, but look how easy it is to find good Chinese take out!

[1. In the meantime, Ngram viewer is giving us this sort of thing. It’s entirely a map of the present, in which the person drawing up the word pairs wants to have what he already knows confirmed. That “latte” passed “lager” in frequency only reinforces what the person knew when he or she asked the questions: ti’s set up to avoid learning anything new]


  • […] This post was mentioned on Twitter by K M Lawson, Mike O'Malley. Mike O'Malley said: Ngramattic: further thoughts on Ngram Viewer, maybe being nicer this time. […]

  • Here’s my theory on the point of this: it’s not the ngram viewer, it’s the data set.

    Google, due to their book scanning project, has an absolutely wonderful data set that many researchers would kill to have access to. Unfortunately, due to copyright issues, google can’t release much of this. So, presented with this problem, perhaps someone asked the question, “well, what can we release?”

    The n-gram frequency-by-year data set is one answer to this. It’s not clear how useful this database will be, but it’s not terribly hard (for someone with the computing resources of Google) to produce, and so it’s probably worth putting out there to see what people might come up with.

    But to broadly disseminate the knowledge that this database is available, it needs some advertising. A stupidly simple site that simply displays graphs of frequencies by year from the database is cheap and effective. Job done.

    Now it’s up to the real scholars to download the database and find a real use for it, if there is one.

  • yes, agreed–the really valuable thing is the dataset. And I know this is just a teaser, but it’s a teaser that does actual harm to history as an intellectual activity. They wanted to get attention, and they did, but they should not complain if the attention involves critique

  • I’m pretty sure your comment about “bus” “h t” is incorrect. As per I believe these are *word* n-grams, not character n-grams. And that makes sense; if searching for ‘bush tax cuts’ showed us texts containing the word ‘bus’, then the results would be _completely_ meaningless.

  • More to the heart of the point here:

    I think you’ve completely missed the point of both Google’s search engine and also what Google is providing with the ngram viewer.

    The reason that Google has completely destroyed their competition in the search space is mostly because they didn’t listen to criticism of this variety: the pair of ideas that (A) people know exactly what they mean, and (B) computers must divine that meaning in order to be useful. This was very much the prevailing view of internet search when AltaVista and Yahoo were vying for control of the market. At that point, the central conflict between these two products, respectively, was “should we try to understand your search, or should we have humans pre-understand your search and curate a gallery of results?”.

    The way language usually works is that I say some words, then you read the words and thereby interpret the meaning. Neither the air which vibrates with the sound of those words when I say them, nor ink that I might write them with, needs to understand the words in order to effectively convey them.

    One of the interesting insights which Google’s search has provided us is that tools which deal with text in aggregate don’t need to understand its meaning any more than tools that deal with it individually. You project some meaning onto some words, and you can search for them just as you might read them.

    I can see how you might be concerned about users of the ngram tool misinterpreting the results they get, but this is hardly unique to some new technology. People misinterpret Shakespeare all the time, and most modern readers can’t even decipher Chaucer. One might argue that all modern readers *ever* do is misinterpret the Bible :).

    The ngram tool isn’t perfect, but it does provide an interesting tool which, if used with proper care and accounting for user error, may allow for some interesting statistical analysis of the evolution of the zeitgeist. It’s got significant limitations, of course, but what tool doesn’t? Historical research will remain a substantial undertaking for the forseeable future.

    It’s up to folks like you who know how historical interpretation of language works to provide meaningful searches with multiple possible ngrams to account for shifts in language when one is examining meaning. But what if one is *only* interested in the language, and not the meaning? I’m sure this is a more immediately rich resource for linguists than historians. Let’s not discount them.

  • I actually don’t disagree with you about the way language works–I just think that the way ngrams work made them produce a bad first example of what could be done.
    I’d be delighted to have linguists show me what it can do, but the problem is it’s framed explicitly as a “history” tool, and it’s terrible as a tool for historians, because it’s indifferent to the context the word appears in.

  • It isn’t indifferent to the context, though. You can have up to a 5-gram of words; that’s two words of context on either side.

    You might argue that that isn’t *enough* context, but it’s not *no* context :).

    I do agree that using this to search for individual words, unless they are highly rarified technical words, is probably not too useful. But that’s the point; it’s not searching for individual words, it’s searching for fixed phraseology.

    This is a trivial and possibly silly example, consider the comparison of these trends:,oh+my+goodness,oh+my+god,oh+shit&year_start=1800&year_end=2000&corpus=0&smoothing=3

    These results did surprise me a little (and surprise is a key indicator that a research tool is doing something useful), and made me think that maybe the phrase ‘oh my word’ is an invention of period fiction rather than something people actually said a lot in victorian england.

  • Matthew Fitzgibbons wrote:

    Interestingly, your example shows exactly why the ngram viewer is a poor historical tool.

    This shows that while ‘oh my word’ was not commonly used by Victorians, ‘my word’ absolutely was. Using ‘oh’ is really just a very poor proxy for teasing out the context you’re interested in (i.e. the exclamation).

    Looking at the data in this way makes it very easy to draw entirely incorrect conclusions.

  • Well, I’ve no complaints about the critique; I’m just trying to inject a bit of balance here.

    But yes, history is used by a few as an insightful tool of analysis, but the popular use without question is to justify ideological points. Sadly, the same is true of most any academic study that touches on daily life, as far as I can see.

Leave a Reply

Your email is never shared.Required fields are marked *