Ngrams embeddings are probably not reliable, but could still be improved

ferdnyc · May 15, 2026, 3:40am

I have… four observations, regarding the Ngrams graph that’s shown on each etymology entry. Stipulating, up front, that (to quote the caption on each graph) “ngrams are probably unreliable”. Be that as it may, they must have been deemed better than nothing or they wouldn’t be included on the site. However, I think they could be better:

The caption for each graph, shown in poorly-contrasting gray text, reads:

adapted from books.google.com/ngrams/ with a 7-year moving average; ngrams are probably unreliable.

Google’s documentation page on NGrams, OTOH, says:

Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to https://books.google.com/ngrams, would be appreciated.

An unlinked “books.google.com/ngrams/” is not the same as a link to that page, and fails to actually make any “acknowledgement of Google Books Ngram Viewer as the source”, which doesn’t seem to meet the spirit of their request. Why not link to the URL mentioned, instead?
Better yet, why not link to the Ngram Viewer display of the actual graph being shown, so that users can experiment with modifications to the search terms? If Etymonline is embedding a graph with particular search terms, it shouldn’t be hard at all to generate a https://books.google.com/ngrams/graph? link with those same search terms, and link that in the caption instead of the unlinked URL. (The link could also provide the requested acknowledgement, something like, on the etymonline page for “pule”:

adapted from Google Books Ngram Viewer results for pule with a 7-year moving average; ngrams are probably unreliable
I notice that all of the graphs on etymonline only show data through 2019. That makes me think that etymonline is showing results from the older 2019 corpus. In July 2024, Google released a new dataset which contains results through 2022. That new corpus is the default for all searches run on the actual Ngram Viewer site, but the older one can still be requested by adding :eng_2019 to a search.

(Though, it has to be said, neither searches for “pule” nor searches for “pule:eng_2019” produce data that matches what’s shown on etymonline, no matter what smoothing value I use. That lack of reproducibility, coupled with the lack of a link to the actual Ngrams results being displayed, makes it difficult to put much faith in the etymonline ngrams graphs, even relative to the questionable reliability of Google’s own graphs.)
While we’re on the topic of the Google results, though, Ngrams Viewer has another feature that could be extremely useful when embedding Ngrams graphs on etymonline in particular.

I chose “pule” as my example etymonline entry very deliberately, because it’s a verb that, for most of its history (though not always!) was much, much more commonly encountered in its adjective form: “puling”. Etymonline doesn’t maintain separate pages for the various forms of each entry, so the graph is seemingly only generated for the “primary” form? (Though as I noted in the previous item, I can’t really be sure where the data comes from.) Assuming that’s correct, though, primary-only searches can throw off the data for a term where some other form is far more common. But, Google has a solution for that: the inflection search, aka _INF.

The results of a search for “pule” with otherwise-default values:

image765×496 32.2 KB

The results of a search for “pule_INF”, instead:

image765×496 40.6 KB

Showing inflection-search results on the etymonline pages, instead of uninflected primary-form results, could make the graphs more informative for many entries.

ferdnyc · May 15, 2026, 4:25am

Perhaps etymonline is generating its own graphs off of a downloaded Google Books Ngram Viewer Dataset, in which case the latest data available is apparently their “version 3” set from 2020, and no newer corpus is available so ignore that.

If so, that also complicates the question of things like inflection searches and parts-of-speech tagging.

I downloaded the data file that would encompass “pule” and its other forms (the 1-gram file starting from “order.351_NUM”), and a quick scan shows that it incorporates data both with and without tagging.

$ grep -E '^(pules?|puling(s|ly)?|puled)(_|\s)' 1-00021-of-00024|cut -f1-5
puling_X	1958,2,2	1966,2,2	1967,2,2	1970,2,1
pules_X	1754,1,1	1790,1,1	1800,3,3	1803,2,2
pules_ADJ	1820,1,1	1824,4,2	1825,1,1	1832,5,5
pule_ADJ	1773,1,1	1797,1,1	1798,2,2	1799,1,1
pules	1608,2,2	1654,1,1	1683,1,1	1689,1,1
pules_VERB	1608,2,2	1752,1,1	1774,1,1	1800,1,1
puled	1609,2,2	1656,1,1	1674,1,1	1681,1,1
puled_ADJ	1724,1,1	1755,1,1	1790,1,1	1793,1,1
puling_VERB	1588,2,2	1593,3,2	1596,1,1	1598,3,3
pulings_NOUN	1674,1,1	1730,1,1	1740,1,1	1765,1,1
pules_NOUN	1654,1,1	1683,1,1	1689,1,1	1705,2,1
puling_ADJ	1667,1,1	1715,1,1	1718,1,1	1750,2,2
puling	1588,6,3	1593,5,2	1596,2,1	1598,4,3
pulings	1674,1,1	1730,1,1	1740,1,1	1765,1,1
pule_VERB	1650,1,1	1672,1,1	1686,1,1	1700,1,1
pule_X	1651,1,1	1710,1,1	1725,1,1	1738,1,1
pulingly_ADV	1710,1,1	1750,2,2	1800,2,2	1811,1,1
pulingly	1710,1,1	1750,2,2	1800,2,2	1811,1,1
puled_VERB	1609,2,2	1656,1,1	1674,1,1	1681,1,1
pule	1561,2,1	1585,1,1	1608,2,2	1617,1,1
pule_NOUN	1561,2,1	1585,1,1	1608,2,2	1617,1,1
puling_NOUN	1588,4,3	1593,2,2	1596,1,1	1598,1,1

I’m not sure whether those are meant to be combined, or if e.g. the “pule_NOUN” data is a subset of the overall “pule” data. (Though, the fact that the visible points for both “pulings” and “pulings_NOUN” are identical makes me think the latter is true. It would be a weird coincidence if “pulings” was used exactly the same number of times, in four separate years, both in a form identified as a noun and in a form that couldn’t be identified.)

I also don’t know what those “_X” entries are.

Selecting the parts of speech to include in the graph would also then be left up to the data consumer. There doesn’t seem to be any data relating “pule” to “pules”, “puling”, “puled”, “pulingly”, etc… though etymonline already has at least some of that information in its own dataset.

Scott · May 19, 2026, 3:31pm

Have you read this: Who Lusts for Certainty Lusts for Lies | Columns?

ferdnyc · May 20, 2026, 1:24am

I had not, but it kind of makes my point. Despite all of the issues, the Ngrams are still shown on the site, so why not make them as useful as possible?

That ~~harangue~~ article fails to take into account one factor, in its consideration of Ngrams (or Google Books in general): Its developers are genuinely trying to create a useful resource, it’s not like they’re unaware of the issues or deliberately using data that offends the sensibilities of Doug or anyone else.

Their help document even explicitly covers the long-s issue, and demonstrates how each version of the corpus they’ve published has tried to improve their character recognition. While the (non-)word “beft” (an OCR misinterpretation of “best” written with a long s) featured heavily in the 2009 corpus, it was massively diminished in the 2012 data. Then it jumped back up significantly in 2019 (for some reason), but was corrected back down in the 2022 corpus. (The 2022 graph isn’t shown on that help page, but this view shows all four versions.)

Ngrams has issues, nobody would deny that (least of all Google). But they’re working to address what they can, and ultimately it still comes down to: “They must be better than nothing.”

Otherwise, it would say more about etymonline than about Google (and what it says would be nothing good), that the site continues to include them.

Scott · May 20, 2026, 2:21pm

If the ngrams are going to be included, they ought to be kept up to date. I totally agree.

Doug’s explanation in that article for why they are included is a bit vague: “Why are they on the site at all? Because now, online, pictures win and words lose. The war is over; they won.”

I think he is just saying that he gets more ad money if he uses ngrams.