If ngrams were compiled as accurately as possible, I still wouldnât understand the benefit of them very well.
Comparing graphs of annual sales of potatoes and zucchini can be useful to a farmer considering what to grow next year, a vegetable distributor drawing up a budget, an economist analyzing whatever it is economists analyze, and so on.
Do publishers still order many carloads of âisâ each year during spring thaw (along with âlawyerâ, âpoliticianâ, âscandalâ, and âdishonestâ at twenty thousand pieces each, several dozen "egg"s, etc.), start to run low by about November, and blame it on a lack of creativity among authors? Do they have to reject manuscripts that call for words like âywisâ or âgnarlyâ because the big suppliers have stopped making those? Can writers improve their writing (or their income) by inserting words that are trending on ngram? Are economists interested in saying anything about writing, and could ngrams help them?
It seems to me that if I wanted to really learn about the development, spread, and evolution of words, Iâd need something much more detailed (or granular or whatever the right word is) than ngrams, for it to become useful.
When I enter the query dog, dogs, toast, toasts I get four different plots, so it seems that Ngram Viewer does know the difference between singular and plural forms. It even knows the difference between dogs as a noun versus a verb, and toasts as a noun versus a verb, as shows up with the query dogs_NOUN,dogs_VERB,toasts_VERB,toasts_NOUN
To avoid the problem you reported with âsaidâ as arising from the concentration of scientific works in the corpus, I am able to see steadily rising curves with the query said, say, says by choosing English Fiction (2019). These curves look quite reasonable.
The pre-1820 problem is important to etymologists but probably not to most users of Ngram Viewer. It arises from a limitation of optical character recognition, which is a well-established technology around for at least 50 years â I wouldnât call it AI in any sense of the word. The classification of the output of the OCR scanners is AI, but garbage in, garbage out, you know.
As you say, and I might have added, they can be useful, if you restrict your search and know the limitations. In newspapers dot com, for instance, I use them often to plot differences in British and American use of a word or phrase in the same decade. There, the data set is more complete, but of course the parameters are tighter: not âthis word was used,â but âthis word was printed in newspapers.â But it can get you somewhere in understanding.
My dislike is in displaying on the site the crudest-possible n-grams, almost always incorrect. The global internet already prefers a graph to a paragraph, and thinks a fact-shaped answer given by computer calculation must be truth.
I find myself confused by Ngrams about as often as I see them. Itâs good to start putting some of that anxiety to rest.
Also, I take âeclipseâ to mean âintersectâ yet I couldnât find a view in which âsaidâ and âJimmy Carterâ do such. Perhaps I am misunderstanding.
N-Grams are of some benefit to some researchers for comparing word use over time in a limited dataset; the information which N-Grams are calculated is limited to mostly older and academic texts, because these works are free to access or in the public domain, more modern works would necessitate scraping data in either ethical or unethical ways; this might be expensive or legally dangerous for a research institution.
All of those example n-gram graphs are fooling the eye. The lowest the line goes is always the place Google cuts off the graph. But zero is still far below the line. E.g. the graph for âtoastâ starts at 2.5 and reaches bottom at 1.5, that is actually a 40% drop in usage ! That can not possibly be characterized as âalmost disappearing.â But the graph fooled the author into thinking that is what was being shown.
well thereâs two dips, then, the graph and the author. But zero or not, the first one isnât real. And, as you perhaps suggest, compounds the misleading.
In my opinion, graphs that begin at anything other than zero (of the quantity being measured, not the other axis) are inherently misleading.
In the case of ngrams, learning âThis graph starts at the year 1940â is useful and unremarkable, but learning âThis graph begins at 1.5 usageâ is a weird unfixable surprise.
What I generally use Ngrams for is to get a sense of when a word entered the language. This is particularly true if Iâm watching or reading something in an historical setting. If a character in the 1920s uses the word copacetic, and I bump on it, Iâll check Ngrams to see if my instinct is good. For this purpose, the graph always starts at zero, which addresses your other observation.
(I once heard of an author who, writing a novel placed in the English Regency, set up a custom spelling dictionary using the text of Jane Austenâs novels. Youâd still have to be careful about usage, but I thought that clever.)
Thank you - that use of ngrams (to see where the graph begins, for a relatively new word) certainly makes sense.
Copacetic: Vinegar that tastes (unaccountably, one hopes) of police officers? Or maybe itâs just a nickname for police officersâ favourite brand to go with their fish and chips? No? Thatâs OK, the real meaning is copacetic.
I tend to find spelling dictionaries more irritating than helpful, partly because I often donât want the defaults (I can âlabour to analyze the colour of my licenceâ, so neither the US nor the UK defaults are mine), partly because so many applications have their own dictionary systems (re-re-setting the same things many times), and partly because they all seem to operate differently - and not always well. But despite all that, a custom dictionary for period-style writing (where all the writing probably takes place in the same application, or will get into it eventually) is something that could be truly useful when itâs prepared well.
I use it the same way, but I had a similar experience recently as Doug had with the misleading date. I wish I could remember which word it was, but the way ngram flopped was cute.
There was an anomalous blip before the word took off, and I was curious about it, so I searched for the year and date using multiple search engines. It turns out a writer had used the word just to describe a year, but Google, bless it, decided it must be a dictionary entry with a citation date; thus placing the wordâs entry into the language a couple of centuries early.
This supports my theory that AI is useful for spotting anomalous data, drawing an adultâs attention to it, but really not much else.
It has been said that âstatistics doesnât prove, statistics suggests. If you want to prove you must use your brain.â
Properly used, statistics (and the graphs it inevitably generates) can be a valuable, even irreplaceable tool to fathom so many mysterious aspects of the universe we live in. Unfortunately though, when misused, it can turn into a mighty stupidity enhancer. âEskimo eat more fish and have fewer children, thus fish is a contraceptiveâ would make a five-years-old roll on the floor guffawing, but it wouldnât surprise me if some politician proposed such a âstatistically proven factâ as a method to control the world overpopulation.
Someone (I believe it was Mark Twain) summarized this nefarious tendency in a brilliant âthe bed is the most dangerous place because over 90% of the deaths occur thereâ
That said, I still must acknowledge that those ân-gramsâ add a certain je ne sais quoi to the pages and look much better than mundane typographic dividers