Eggcorn Forum

Discussions about eggcorns and related topics

You are not logged in.


Registrations are temporarily closed as we're receiving a steady stream of registration spam.

Anyone who wishes to register, please email me at chris dot waigl at gmail dot com with the desired username and a valid email address, and I will register you manually.

Thanks for your understanding.

Chris -- 2011-03-08

#1 2014-06-05 05:01:55

From: Victoria, BC
Registered: 2007-08-28
Posts: 2103

Uncharted and ngrams

Uncharted came across my desk this week. It’s Erez Aiden and Jean-Baptiste Michel’s new book explaining how they came to develope Google Ngram Viewer and reporting some of the cuturenomic experiments they tackled with the new Viewer. A funny, insightful, well-written volume.

While reading the book it occurred to me that there was a way we could use their work on ngrams to explore the penetration of certain eggcorns into English vocabulary. The issue of eggcorn standardization is a topic that has challenged and confused several of us on this Forum. The main tools we have used for this are Google and Bing, but web indexers, we have found, do not give us reliable figures about the number of distinct hits for a given word/expression. When Google shows us samples of expressions, it limits us to a thousand of the ones it deems (via the complex PageRank algorithm) as the most significant. What we don’t know about the data provided by the browsers makes any numerical analysis a matter of guesses upon guesses.

The Google Ngram Viewer analyzes data from a huge database of published materials. Google, which is attempting to digitize every item that has ever been published, estimates that there may be 130 million books and journals archived in our libraries. So far Google has acquired about a quarter of these materials in digital format. Aiden and Michel, through contacts at Google, were allowed to play around with this database. They began by reducing the size of the database to about four (now six) percent of the world’s publications, selecting from Google’s larger database only volumes that could be pegged to a specific year and that were in one of their eight target languages. They then compiled a list of all the words (1-grams) and phrases (2-grams, 3-grams, etc.) in the five hundred billion words in their sources.

But are their 1-grams the same as the words we find in a dictionary? Many terms that appear in print are not considered valid lexical entries by dictionary writers. Some printed words are variant spellings, misspellings, regionalisms, etc. Aiden and Michel’s solution to this phantom word problem was to index only the terms that appeared, on average, at least once in every billion 1-grams of published text. This meant that an English word had to occur about fifty times in their published sources before it was included in the Ngram database. Words that occur this often are usually standard terms for large segments of the English speaking world. One would think, then, that a 1-gram index of Aiden and Michel’s English database would contain about the same number of words as the OED, the most exhaustive collection of English words. The OED has a half a million words. The Ngram database, even after being purged of the 1-grams typically excluded by dictionary compilers (words with numbers in them, inflections, compounds, etc.), still contains over a million terms. Aiden and Michel had inadvertently uncovered hard evidence of what they call “lexical dark matter:” real words that do not appear in dictionaries.

This lexical dark matter, though it is at least as large as dictionary-visible matter, is dark because its ngrams are extremely low frequency words. George Zipf discovered this relationship between commonality and cohort size in the 1930s by doing a mathematic analysis of word frequencies in popular books. The patterns he found, known today as Zipf’s Power Law, have been applied with fidelity to a number of human experiences. When applied to texts and word frequencies, The Power Law says that if a million words occur more than fifty times in a body of text, a much smaller number—perhaps only a hundred thousand— would occur more than five hundred times. Four hundred thousand words, 80% of the total, would therefore be little-used words. Even though dictionary dark matter is about half of English, it’s a half that, because of its rarity, has little effect on the experience of English speakers.

Eggcorns tend to belong to lexical dark matter. Dictionary compilers have an intrinsic bias against including eggcorns in their lists. Eggcorns are first cousins to the slips, alternate spellings, and malaprops that lexicographers work hard to screen out of dictionaries. A dictionary, after all, is not just a descriptive list of words that people actually use. It is also a prescriptive tool to tell people which are the correct words to use. To include palpable mistakes as dictionary headwords would be to encourage error. Only the most widely propagated errors command enough attention to break through the lexicograper’s screen and appear as dictionary headwords. Eggcorns are, by their very nature, crepuscular critters.

Aiden and Michel’s mechanical definition of a lexical item, however, changes the rules that decide what is and is not a word. A number of popular eggcorns, one would imagine, cross Aiden and Michel’s line of significance and become ngrams. When they appear in the ngram database, we have access to hard facts that the web search engines will not give us—real, reliable numbers about how widespread these eggcorns actually are.

But how many eggcorns rise to wordhood under the new rules? To put this question to the test, I took the sixty-some eggcorns that we selected from our 2013 candidates as the best of the yearly crop and submitted them to the Google Ngram Viewer. Not all eggcorns can be tested, of course. They have to be words and phrases that would not otherwise exist. Hidden eggcorns are, and always will be, part of the dark matter. Even certain non-hidden eggcorns remain concealed. The Google Ngram Viewer recognizes, for example, “mongrel hordes,” which we have discussed as an eggcorn for “Mongol hordes.” We would need a context, however, for each example—something that the Ngram Viewer does not supply—to differentiate the eggcorn from the casual phrase “mongrel (i.e., mixed) hordes.” Still, the great majority of eggcorns are at least testable, and the Google Ngram Viewer provides a powerful search syntax (controlling for parts of speech, for example) that lets us rule out some of the ambiguities.

Out of the sixty eggcorns, I found only three in our 2013 favorites list, a mere one out of twenty, that were frequent enough to appear in the ngram database. They are the English eggcorns “hectacre” and “divine inflatus” and the Spanish “vagamundo.” Here is an Ngram Viewer graph of the two English words. And here is the Ngram Viewer graph of the Spanish word.

The cutoff frequency for appearing in the database—the one in every billion words level—would be .000000010% on Viewer chart. These three eggcorns, you will note, tend to be about fifty times this frequency, on the average, though in certain decades they rise to more than 500 times this base frequency.

The one out of twenty ratio from the 2013 favorites is probably about right. On the downside, the ratio applies to a list that we judged to be the best of the year, and “best” often includes the idea of “common.” On the upside, this is the 2013 list and the 2013 list is rather rarified, as such lists go, since we collect in these yearly summaries only previously unfound eggcorns. Most of the really popular ones are de facto excluded from such a recent list. My guess is that the positive and negative balance out, and that 5% of all the eggcorns discussed in our Eggcorn Database and Eggcorn Forum could be found, if we looked, in the ngram database. This means that one or two hundred eggcorns which do not appear in any dictionary would have some basic claim to lexical inclusion when found by a mechanical approach to defining words.

Using the ngram database to find raw occurrence frequencies and argue for wordhood, however, is only part of what we can do with a quantitative tool. What we would really like to do is to compare how well eggcorns perform against each other and against their respective acorns. We need to compute what might be called “eggcorn strength:” how likely the eggcorn is to replace its associated acorn, the word/phrase that launches the eggcorn. We can use the Ngram Viewer to compute the occurrence ratios of the two terms, the acorn and eggcorn, by applying a couple of tricks. Take the hectacre/hectare pair as an example. We type the two words into the viewer, separated by commas. This shows us, however, only the line for “hectare.” The eggcorn “hectacre” is so infrequent that it is invisible along the base of the graph. We then put in a ridiculously large smoothing factor—say 40—to turn the acorn and (invisible) eggcorn frequencies into a relatively smooth lines. Now we apply a multiplier to the eggcorn. In the search box we type in “(hectacre*1000),hectare,” and voila, the hidden line appears. Finally, we raise the multiplier amount until the graph lines track together. If the lines don’t have the same slope (they do for “hectare” and “hectacre”), we can pick a run of years where we want the lines to cross. It turns out in our example that multiplier “4000” will lay one line on the other. So the eggcorn “hectacre,” we now know, occurs once for every four thousand “hectares” and has done this consistently for two centuries. Here’s the resulting Ngram Viewer graph.

A brief glace through our forum turns up some other ratios. By looking at the volumes published between 1965 and 1975 and using a smaller smoothing (3), we find that “once beaten twice shy” was employed once for each ten “once bitten twice shy” 4-grams. This is an eggcorn well on its way to becoming a standard. In contrast, during the nineteenth century, “airogant” was written 1/5000th as often as “arrogant” (smoothing of 50). “Trumpeted up” for “trumped up” lies between these two other ratios. Between 1960 and the present, “trumpeted up” was 1/100th as common as the “trumped up” it replaced (smoothing of 30).

As time allows, I’ll look in the ngram database for other eggcorn ratios. Perhaps others on this forum would like to help. It is disappointing, however, to see how many eggcorns do not appear in the ngram database. As a tool, Google’s Ngram Viewer with its fifty billion plus words of English text is better than BYU’s Corpus of Global Web-Based English and its two billion words. But not hugely better: the ngram database cutoff at one word per billion keeps us from doing anything with the vast majority of Eggcorns. In my fantasies, a Google Fairy drops into this Forum and says “how would you like to be able to play around with an ngram database of 1-, 2-, 3- and 4-grams that occur five times or more in a selected and dated corpus of five hundred billion words?” With such a database, we might be able to add reliable standardization/penetration information about most of the eggcorns we discuss here.

Last edited by kem (2014-06-12 03:11:23)



#2 2014-06-06 02:10:31

From: Mexico
Registered: 2007-10-12
Posts: 1751

Re: Uncharted and ngrams

Awesome. Quite simply awesome.

Only the most widely propagated errors command enough attention to break through the lexicograper’s screen and appear as dictionary headwords. Eggcorns are, by their very nature, crepuscular critters.

Yes. But another factor is that when they creep (or charge) out of the shadowlands, we immediately say “Oh, that’s a folk etymology” and tend strongly not to include them among our listed eggcorns, though we sometimes discuss them on this forum. Eggcorns are, by the very criteria we use as collectors, crepuscular critters.

*If the human mind were simple enough for us to understand,
we would be too simple-minded to understand it* .

(Possible Corollary: it is, and we are .)



#3 2014-06-07 02:42:32

From: Victoria, BC
Registered: 2007-08-28
Posts: 2103

Re: Uncharted and ngrams

One other factoid from Uncharted that may be relevant to this Forum: we know that new words are entering English and that old ones are fading away, but is the overall wordhord growing? And if so, how fast?

It’s hard for lexicographers to answer these questions. Our estimation of the volume of new words is imprecise, but it’s banging boffo compared with what we know about the rate that words die. But if we accept the definition of a word in Uncharted (a 1-gram that occurs more than once per billion words in the Google database of published works), we can actually answer this question with some precision. English, Aiden and Michel tell us, has been growing its vocabulary for the last century. Since 1950 growth has been particularly strong. Today the vocabulary of English increases at the rate of about 8400 1-grams a year, about two dozen per day..

One wonders how many of these new words have an eggcornish dimension. Fifty a year? A hundred?. Whatever it is, Pat’s Eggcorn Omega is clearly a receding goal. We can only hope that we are gaining more ground than we are losing.



Board footer

Powered by PunBB
PunBB is © 2002–2005 Rickard Andersson
Individual posters retain the copyright to their posts.

RSS feeds: active topicsall new posts