Who Lusts for Certainty Lusts for Lies

We need to talk about the Google Ngram Viewer n-grams. They are wrong.

Here's the Ngram's idea of the frequency of the word said:


This is a companion discussion topic for the original entry at https://www.etymonline.com/columns/post/who-lusts-for-certainty-lusts-for-lies

If ngrams were compiled as accurately as possible, I still wouldn’t understand the benefit of them very well.

Comparing graphs of annual sales of potatoes and zucchini can be useful to a farmer considering what to grow next year, a vegetable distributor drawing up a budget, an economist analyzing whatever it is economists analyze, and so on.

Do publishers still order many carloads of “is” each year during spring thaw (along with “lawyer”, “politician”, “scandal”, and “dishonest” at twenty thousand pieces each, several dozen "egg"s, etc.), start to run low by about November, and blame it on a lack of creativity among authors? Do they have to reject manuscripts that call for words like “ywis” or “gnarly” because the big suppliers have stopped making those? Can writers improve their writing (or their income) by inserting words that are trending on ngram? Are economists interested in saying anything about writing, and could ngrams help them?

It seems to me that if I wanted to really learn about the development, spread, and evolution of words, I’d need something much more detailed (or granular or whatever the right word is) than ngrams, for it to become useful.

When I enter the query dog, dogs, toast, toasts I get four different plots, so it seems that Ngram Viewer does know the difference between singular and plural forms. It even knows the difference between dogs as a noun versus a verb, and toasts as a noun versus a verb, as shows up with the query dogs_NOUN,dogs_VERB,toasts_VERB,toasts_NOUN
To avoid the problem you reported with ‘said’ as arising from the concentration of scientific works in the corpus, I am able to see steadily rising curves with the query said, say, says by choosing English Fiction (2019). These curves look quite reasonable.
The pre-1820 problem is important to etymologists but probably not to most users of Ngram Viewer. It arises from a limitation of optical character recognition, which is a well-established technology around for at least 50 years – I wouldn’t call it AI in any sense of the word. The classification of the output of the OCR scanners is AI, but garbage in, garbage out, you know.

As you say, and I might have added, they can be useful, if you restrict your search and know the limitations. In newspapers dot com, for instance, I use them often to plot differences in British and American use of a word or phrase in the same decade. There, the data set is more complete, but of course the parameters are tighter: not “this word was used,” but “this word was printed in newspapers.” But it can get you somewhere in understanding.

My dislike is in displaying on the site the crudest-possible n-grams, almost always incorrect. The global internet already prefers a graph to a paragraph, and thinks a fact-shaped answer given by computer calculation must be truth.

1 Like

I love this description because it’s both cute and dead-on accurate.

1 Like

I find myself confused by Ngrams about as often as I see them. It’s good to start putting some of that anxiety to rest.

Also, I take ‘eclipse’ to mean ‘intersect’ yet I couldn’t find a view in which “said” and “Jimmy Carter” do such. Perhaps I am misunderstanding.

1 Like

It’s possible that “eclipse” refers to item B blocking our view of item A.

1 Like

My take-away:

N-Grams are of some benefit to some researchers for comparing word use over time in a limited dataset; the information which N-Grams are calculated is limited to mostly older and academic texts, because these works are free to access or in the public domain, more modern works would necessitate scraping data in either ethical or unethical ways; this might be expensive or legally dangerous for a research institution.

1 Like

All of those example n-gram graphs are fooling the eye. The lowest the line goes is always the place Google cuts off the graph. But zero is still far below the line. E.g. the graph for “toast” starts at 2.5 and reaches bottom at 1.5, that is actually a 40% drop in usage ! That can not possibly be characterized as “almost disappearing.” But the graph fooled the author into thinking that is what was being shown.

well there’s two dips, then, the graph and the author. But zero or not, the first one isn’t real. And, as you perhaps suggest, compounds the misleading.

In my opinion, graphs that begin at anything other than zero (of the quantity being measured, not the other axis) are inherently misleading.

In the case of ngrams, learning “This graph starts at the year 1940” is useful and unremarkable, but learning “This graph begins at 1.5 usage” is a weird unfixable surprise.

What I generally use Ngrams for is to get a sense of when a word entered the language. This is particularly true if I’m watching or reading something in an historical setting. If a character in the 1920s uses the word copacetic, and I bump on it, I’ll check Ngrams to see if my instinct is good. For this purpose, the graph always starts at zero, which addresses your other observation.

(I once heard of an author who, writing a novel placed in the English Regency, set up a custom spelling dictionary using the text of Jane Austen’s novels. You’d still have to be careful about usage, but I thought that clever.)

1 Like

Thank you - that use of ngrams (to see where the graph begins, for a relatively new word) certainly makes sense.

Copacetic: Vinegar that tastes (unaccountably, one hopes) of police officers? :face_with_peeking_eye: Or maybe it’s just a nickname for police officers’ favourite brand to go with their fish and chips? No? That’s OK, the real meaning is copacetic.

Yes!

I tend to find spelling dictionaries more irritating than helpful, partly because I often don’t want the defaults (I can “labour to analyze the colour of my licence”, so neither the US nor the UK defaults are mine), partly because so many applications have their own dictionary systems (re-re-setting the same things many times), and partly because they all seem to operate differently - and not always well. But despite all that, a custom dictionary for period-style writing (where all the writing probably takes place in the same application, or will get into it eventually) is something that could be truly useful when it’s prepared well.

I use it the same way, but I had a similar experience recently as Doug had with the misleading date. I wish I could remember which word it was, but the way ngram flopped was cute.

There was an anomalous blip before the word took off, and I was curious about it, so I searched for the year and date using multiple search engines. It turns out a writer had used the word just to describe a year, but Google, bless it, decided it must be a dictionary entry with a citation date; thus placing the word’s entry into the language a couple of centuries early.

This supports my theory that AI is useful for spotting anomalous data, drawing an adult’s attention to it, but really not much else.

2 Likes