Ngrams embeddings are probably not reliable, but could still be improved

I have… four observations, regarding the Ngrams graph that’s shown on each etymology entry. Stipulating, up front, that (to quote the caption on each graph) “ngrams are probably unreliable”. Be that as it may, they must have been deemed better than nothing or they wouldn’t be included on the site. However, I think they could be better:

  1. The caption for each graph, shown in poorly-contrasting gray text, reads:

    adapted from books.google.com/ngrams/ with a 7-year moving average; ngrams are probably unreliable.

    Google’s documentation page on NGrams, OTOH, says:

    Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to https://books.google.com/ngrams, would be appreciated.

    An unlinked “books.google.com/ngrams/” is not the same as a link to that page, and fails to actually make any “acknowledgement of Google Books Ngram Viewer as the source”, which doesn’t seem to meet the spirit of their request. Why not link to the URL mentioned, instead?

  2. Better yet, why not link to the Ngram Viewer display of the actual graph being shown, so that users can experiment with modifications to the search terms? If Etymonline is embedding a graph with particular search terms, it shouldn’t be hard at all to generate a https://books.google.com/ngrams/graph? link with those same search terms, and link that in the caption instead of the unlinked URL. (The link could also provide the requested acknowledgement, something like, on the etymonline page for “pule”:

    adapted from Google Books Ngram Viewer results for pule with a 7-year moving average; ngrams are probably unreliable

  3. I notice that all of the graphs on etymonline only show data through 2019. That makes me think that etymonline is showing results from the older 2019 corpus. In July 2024, Google released a new dataset which contains results through 2022. That new corpus is the default for all searches run on the actual Ngram Viewer site, but the older one can still be requested by adding :eng_2019 to a search.

    (Though, it has to be said, neither searches for “pule” nor searches for “pule:eng_2019” produce data that matches what’s shown on etymonline, no matter what smoothing value I use. That lack of reproducibility, coupled with the lack of a link to the actual Ngrams results being displayed, makes it difficult to put much faith in the etymonline ngrams graphs, even relative to the questionable reliability of Google’s own graphs.)

  4. While we’re on the topic of the Google results, though, Ngrams Viewer has another feature that could be extremely useful when embedding Ngrams graphs on etymonline in particular.

    I chose “pule” as my example etymonline entry very deliberately, because it’s a verb that, for most of its history (though not always!) was much, much more commonly encountered in its adjective form: “puling”. Etymonline doesn’t maintain separate pages for the various forms of each entry, so the graph is seemingly only generated for the “primary” form? (Though as I noted in the previous item, I can’t really be sure where the data comes from.) Assuming that’s correct, though, primary-only searches can throw off the data for a term where some other form is far more common. But, Google has a solution for that: the inflection search, aka _INF.

    The results of a search for “pule” with otherwise-default values:

    The results of a search for “pule_INF”, instead:

    Showing inflection-search results on the etymonline pages, instead of uninflected primary-form results, could make the graphs more informative for many entries.

Perhaps etymonline is generating its own graphs off of a downloaded Google Books Ngram Viewer Dataset, in which case the latest data available is apparently their “version 3” set from 2020, and no newer corpus is available so ignore that.

If so, that also complicates the question of things like inflection searches and parts-of-speech tagging.

I downloaded the data file that would encompass “pule” and its other forms (the 1-gram file starting from “order.351_NUM”), and a quick scan shows that it incorporates data both with and without tagging.

$ grep -E '^(pules?|puling(s|ly)?|puled)(_|\s)' 1-00021-of-00024|cut -f1-5
puling_X	1958,2,2	1966,2,2	1967,2,2	1970,2,1
pules_X	1754,1,1	1790,1,1	1800,3,3	1803,2,2
pules_ADJ	1820,1,1	1824,4,2	1825,1,1	1832,5,5
pule_ADJ	1773,1,1	1797,1,1	1798,2,2	1799,1,1
pules	1608,2,2	1654,1,1	1683,1,1	1689,1,1
pules_VERB	1608,2,2	1752,1,1	1774,1,1	1800,1,1
puled	1609,2,2	1656,1,1	1674,1,1	1681,1,1
puled_ADJ	1724,1,1	1755,1,1	1790,1,1	1793,1,1
puling_VERB	1588,2,2	1593,3,2	1596,1,1	1598,3,3
pulings_NOUN	1674,1,1	1730,1,1	1740,1,1	1765,1,1
pules_NOUN	1654,1,1	1683,1,1	1689,1,1	1705,2,1
puling_ADJ	1667,1,1	1715,1,1	1718,1,1	1750,2,2
puling	1588,6,3	1593,5,2	1596,2,1	1598,4,3
pulings	1674,1,1	1730,1,1	1740,1,1	1765,1,1
pule_VERB	1650,1,1	1672,1,1	1686,1,1	1700,1,1
pule_X	1651,1,1	1710,1,1	1725,1,1	1738,1,1
pulingly_ADV	1710,1,1	1750,2,2	1800,2,2	1811,1,1
pulingly	1710,1,1	1750,2,2	1800,2,2	1811,1,1
puled_VERB	1609,2,2	1656,1,1	1674,1,1	1681,1,1
pule	1561,2,1	1585,1,1	1608,2,2	1617,1,1
pule_NOUN	1561,2,1	1585,1,1	1608,2,2	1617,1,1
puling_NOUN	1588,4,3	1593,2,2	1596,1,1	1598,1,1

I’m not sure whether those are meant to be combined, or if e.g. the “pule_NOUN” data is a subset of the overall “pule” data. (Though, the fact that the visible points for both “pulings” and “pulings_NOUN” are identical makes me think the latter is true. It would be a weird coincidence if “pulings” was used exactly the same number of times, in four separate years, both in a form identified as a noun and in a form that couldn’t be identified.)

I also don’t know what those “_X” entries are.

Selecting the parts of speech to include in the graph would also then be left up to the data consumer. There doesn’t seem to be any data relating “pule” to “pules”, “puling”, “puled”, “pulingly”, etc… though etymonline already has at least some of that information in its own dataset.