Discussions about eggcorns and related topics
You are not logged in.
Registrations were closed for a long time because of forum spam, but I have re-opened them on a trial basis.
The forum administrator (chris dot waigl at gmail dot com) reserves the right to request users to plausibly demonstrate that they are real people with an interest in the topic of eggcorns. Otherwise they may be removed with no further justification. Likewise, accounts that have not been used for posting may be removed.
Thanks for your understanding.
Chris -- 2015-05-30
Something that’s emotionally wrenching can in fact make your stomach hurt. I think “gut-retching” (243 ughits) works pretty well – there’s a lot of conscious spelling change from “wrenching” to “retching.” But it’s got to be said that “heart-retching” (249 ughits) – a less fortuitous combo – is slightly more common (though see below).
I liked that ending most of all, at first I thought it was heading the obvious route but nope we get a gut-retching twist at the end.
He`s also responsible for, IMO, hands down, the most depressing, gut
retching song EVER written, bar none. Lou Reed`s “The Kids” from the
Easy to say that he would not have made the gut retching decision to use force if Hussein did not comply with the United Nations inspections.
http://www.huffingtonpost.com/users/pro … ort=newest
In some cases, “gut-retching” isn’t eggcornish but seems instead like an emphatic way of talking about “retching”:
I think its very rude to shake someones hand and at the same time hold your stomach as if to fight back the gut-retching vomit.
http://blogs.wsj.com/law/2008/08/22/sai … r-manners/
Aversion” therapy often included painful electric shock or apomorphine injections, which caused severe gut-retching nausea.
http://www.boxturtlebulletin.com/catego … ve-therapy
If Google numbers can be believed (and maybe they can’t), “gut-retching” does seem to do better compared to its eggcorn than “heart-retching” does. Today, at least, Google claims that “heart-wrenching” is about a third more frequent than “gut-wrenching:
heart-retching 249 ughits
heart-wrenching 1400k rghits
gut-retching 243 ughits
gut-wrenching 1100k rghits
“Gut-retching.” That’s gotta go on the list.
There are about 30 examples out there of “gut-drenching” for “gut wrenching.” A funny transformation, but it’s hard to see the eggcorn factor. Unless I’m missing something. Alimentary canal jokes are not my forte.
The ratio between ughits of one spelling and rghits of another is not a relevant comparison, is it? The 243 ughits for gut-retching can help us conclude that about 24% of the hits (243/1000) are to web sites with substantially different content. So if there are 1500 rghits for “gut retching,” as Google reports, we could reasonably guess that about 360 of them are unique pages. For “gut wrenching” the number of ughits is 710, so of the projected 1100K rghits about 780K (710/1000 * 1100K) are, we might guess, relatively unique. The ughit/rghit ratio for “heart-retching” is 247/1020, so Google seems to find about 250 unique pages with this expression (247/1000 * 1020). For “heart-wrenching” the numbers are 870/1000 * 1400K for a projected 1210K unique pages. Comparing Google’s project ughits for the gut and heart expressions suggests that “gut retching” occurs once for every 2400 instances of “gut wrenching,” while “heart-retching” occurs only once for every 4800 examples of “heart-wrenching.” The doubled frequency for the “gut” forms might be due to the eggcorn factor.
Then there’s “heart-rendering” at about 63,000 ghits. (I always think of lard being rendered.) (It’s been reported previously, I see.)
*If the human mind were simple enough for us to understand,
we would be too simple-minded to understand it* .
Kem—If I understand your theory—and I may not—I think it’s brilliant. But if I do understand it, then there are a lot of things that make no sense to me.
Here’s the problem. For years I’ve noticed that things that get raw hits scores in the millions almost always have unique hits of 700 or higher. But things that have raw hits scores of 5000 to 10000 tend to have scores in the range of roughly 300 to 700, with far fewer scoring over 700. If the unique hits score is, as you say, a proportion of unique hits per 1000 pages surveyed, then the pattern I’ve noticed seems to make no sense at all. Shouldn’t pages getting hits in the millions vary quite widely in the unique hits counts if the latter is a proportional measure of duplication? Why then are virtually all of them polling 70% or higher? And wouldn’t you expect the percentage of interlinkage or duplication to increase logarithmically as the raw hits get bigger? So then shouldn’t the the things with really high raw hits scores tend to have lower unique hits scores—in fact, I would think much lower— than the things in the 5000 to 10000 raw hits category?
[Edit: I changed “interlinkage/duplication ratio” to “percentage of interlinkage or duplication” to avoid (I hope) confusion.]
Last edited by patschwieterman (2008-10-27 15:48:58)
I don’t know how useful it is – and I’m aware that you numerate blokes are busy swapping number-notions -but I can’t resist a change of tense which brings into play the notion of wretchedness.
‘Gut retched’ has 9 ughits
‘Gut wretched’ 99 and
‘Gut wrenched’ 402
Pat – We are wading through murky waters here, largely because we do not know how Google does its computations for total hits and how it decides which pages to present. A few comments, though.
First, your observation that the ratio of ughits/rghits is lower for the words and phrases that occur less frequently is correct. I have not seen this observation contradicted.
Second, we can and should assume that the Google database from which the thousand rghits are drawn has an inherent duplicate page factor. If we were to scan the billion or so pages that Google claims to index, look for pages that had relatively the same content, and throw out all of the duplicates except for a single exemplar, a certain percentage of the database would disappear. My guess is that the Google database would be reduced by more than 70%. I base this guess on that fact that the typical low range for ughits in a sample size of 1000 pages is about 30%. For reasons I will mention below, I think the selection process for the rghit pages that are presented to the user tends to eliminate duplicates, so the real number is probably as low as or lower than the lowest ranges. I find the number of duplicated in the Google database a little shocking. There is a lot of content duplication on the web. I suppose I should reserve judgment, though, since I don’t really know how the Google cowbots corral the cattle they contribute to the database (We should keep in mind that the most distinctive web pages, dynamic pages that are computed from user input, are never rounded up, since the bots grab only static pages.).
Third, you make the assumption in your last post that you would expect fewer duplicate pages from smaller sample sizes taken from the database. This is not an easy assumption to prove. The math needed to prove it is formidable. However, there is a useful analogy: the famous birthday paradox. This is the puzzle about how many people would need to be in a room before the odds are over 50% that two of them have the same birthday. Contrary to what most people believe, only 23 people are needed, even though there are 366 possible birth dates. As a step in computing the answer to the birthday paradox, we first calculate a complementary problem-the odds that when a certain number of people are in a room they all have different birthdays. If you think about it, this step in solving the birthday paradox is not unlike calculating the odds that a group of web pages contain no duplicates when they drawn from a larger bin of web pages that have a certain number of duplicates. If you look at graphs showing the relationship between the number of items selected at random and the probabilities that no duplicates are drawn (some graphs are at http://en.wikipedia.org/wiki/Birthday_paradox), you will see that smaller sample sizes “err” (i.e., deviate from a straight line) in the direction of having a higher probability of nonduplicate pages in smaller sized samples. So your assumption may have some basis in mathematics. If you trust the analogy.
Fourth, this assumption you make, the one about smaller sample sizes having fewer duplicate pages, has another (alas, fatal) assumption behind it, and that is that the web page samples are drawn at random from a larger bin of web pages with duplicates. This is not a safe assumption. In fact, we know it is not true. Google prides itself on the fact that in its thousand rghits it captures the page you are looking for when you submit a search term. It is usually able to present this page to you on the first page of rghits. If it really drew the rghit sample at random from millions of pages, the odds of getting the page you wanted would be vanishingly small. To get the crucial page Google applies its patented PageRank algorithm to the process of drawing the sample. Since this distorts the randomness of the sample, we should not be surprised if it affects the number of duplicate pages it vacuums up when it grabs the thousand it presents to the user.
How does using the PageRank algorithm to draw the sample of a thousand pages skew the number of duplicate pages in the result? My guess would be that when the bin from which the sample is drawn is very large, on the order of millions of pages, duplicates tend to be eliminated. The reason is simple. The PageRank algorithm uses a voting system. The more people link to a given web site, the more votes it gets, and the higher its rank. If two pages have roughly the same content, chances are that web site authors will tend to link to one rather than the other because one of the pages represents the source site for the other page. The nonsource page, getting fewer votes, tends to drop below the top thousand in the PageRanked list of ghits and disappear. In contrast, a search term that yields only a thousand or fewer results will not have the duplicates eliminated by the PageRank algorithm. This is why, I suspect, the ughit/rghit ratio is so much lower for the terms that occur on fewer web pages. And why your assumption, as sound as it seems to be from a mathematical point of view, cannot be applied to the problem at hand.
Last edited by kem (2008-10-28 12:56:19)
Kem—Wow. We have utterly failed to communicate. In your second paragraph, you seem to be agreeing with me on something—but the observation you attribute to me is not one that I agree with or believe is likely. And then in your fourth and fifth paragraphs you argue energetically and at length with an idea I never stated and certainly never intended to imply. (I’m actually arguing against the sampling idea—at least in the lower range of numbers; I don’t have a clue what’s going on with ugh numbers based on rgh numbers much over 10000.) I think the problem may be that you underestimate how fundamentally and completely I disagree with you.
But maybe I’m wrong. Let’s start over. Usually when people talk about the relation of unique hits to raw hits, they see the unique hits as simply what’s left after the dupes, etc. are removed from the earlier raw hits reading. Your interpretation is completely different—it’s one that Google seems to give no hint of and one that I’ve never seen anyone else mention. Did you get the idea that the ugh number is really a proportion of unique hits per thousand from Google or some other source, or is it your own idea?
In an earlier thread (http://eggcorns.lascribe.net/forum/view … hp?id=3291) I tried to explain the retrieval process, at least insofar as Google supplies information about how its search mechanism works. That’s where I mentioned the thousand pages that Google fetches. I was using that post as a starting point. Which I totally failed to mention, so it must seem like I’m pulling rabbits out of my hat. Sorry about that.
OK, starting over. Raw hits and unique pages presented to the user have no known relationship to each other. The raw hit number given in the upper right corner of a search page is based on an undisclosed algorithm that Google uses to estimate how many web pages have the search term. Google does not scan its database to count these (if it did, you’d have time for a cup of herbal tea before getting the results from a query). Nor does it store this information with the search term (otherwise why use an algorithm?). Google needs this computed number of raw hits, it says, in order to refine its search activity. But it also shows this number to the user, which has led to endless confusion, since users tend to think that the number has some absolute meaning. At best the number of raw hits has relative meaning (i.e., we can compare the raw hits total for different words to find out which is found on more web pages). But even this relative meaning is suspect, since we don’t know the algorithm.
Google’s index servers supply to its document servers a list of pages containing the search term(s) and where these pages are in Google’s database. Google’s document servers take the top thousand of these pages (as determined by their PageRank), look these pages up in the database, grab a short snippet of the contexts, eliminate perceived duplicates, and ladle them out to Google’s html servers in bunches of ten. If you try to walk through the snippets supplied by the document servers you will never see the full thousand-the duplicate removal process reduces the number to some fraction of a thousand.
Now we come to what I think is your question. I hope. I’m notorious for addressing questions people don’t have. You wondered why search terms that have a low frequency in the database seem to have more duplicates than high frequency terms. Google does not tell us why this is so, at least not in any place that I can find. But I can make a good guess why this is so. That guess is in the last paragraph of my previous post. I speculate there that ranking pages filters out duplicates. To see the real effect of Google’s duplicate page elimination algorithm we have to look at searches for terms that occur on a thousand or fewer pages. The page ranking process on these limited-result searches does not affect the number of duplicates in the thousand pages that Google is willing to show you.
Kem—You didn’t need to rehearse all this; I understood it the first time. And I bookmarked your “hoof in mouth” post the second I first saw it because I was so surprised by it.
But though you make all of this sound quite official, you didn’t answer my question: Did you get the idea that the ugh number is really a proportion of unique hits per thousand from Google or some other source, or is it your own idea? If the former, could you provide a link? Thanks!
(Edit: To avoid further misunderstanding, let me clarify one thing here. In your calculations in 2, you’re using the ugh # as a measure of unique hits per 1000 thousand raw hits. And that’s the ratio I’m concerned about. For numbers under about 600-700 unique hits, I have no problem with the ugh # as a measure of unique hits out of a possible total of 1000 unique hits.)
Last edited by patschwieterman (2008-10-31 14:45:46)
The reference to the thousand is in the link on that earlier post, the link to Google’s Search Protocol Reference. In the section on filtering (http://code.google.com/apis/searchappli … ilter_auto). Here is the text:
“When the Google Search Appliance filters results, the top 1000 most relevant URLs are found before the filters are applied. A URL that is beyond the top 1000 most relevant results is not affected if you change the filter settings.”
One bit of terminology that may be confusing us (well, me) is the phrase “raw hits.” This is perhaps a meaningful term when it designates the computed number that Google puts on the top right of its search page (I say “perhaps” because the number isn’t really “raw” if it is the result of an algorithm.). But when we use the term “raw hits” for the 1000 unfiltered results that the document servers are willing to release to the html servers, the term “raw” can be misleading. The thousand URLs are only raw in the cases where fewer than a thousand web pages contain the search term. Applying the PageRank values to find the thousand, as Google does for terms that are found on more than a thousand pages, effectively prefilters the results that the document servers later filter for duplications.
Last edited by kem (2008-11-02 01:04:48)
Okay, it looks like we’re converging a bit. I don’t think what you’ve quoted justifies using the ugh # as a proportion for calculating unique hits based on raw hits as you’ve done, but it does suggest why olive and orange and lemon all have impossibly high unique hit percentages (impossible, that is, if the ugh # is truly a proportion on average of non-duplicate results per 1000 unfiltered) in the 900s. The key word, as I think you’re also implying, is “relevant.” The 1000 most “relevant” hits in Google’s eyes for queries where millions of URLS are potentially relevant may already have a far higher likelihood of being unique than the 1000 most “relevant” hits of an estimated, say, 5000 total. (I tried to find their definition of “relevance,” but couldn’t, though I think your assumption about page rank has to be a big part of it.) But aren’t page rank values being applied all the time? That seems to be a critical question here. If my guess is right, then I don’t see why the 243 hits in your #2 above can’t be the total of all unique hits.
A quick note on something I don’t have time to explore further at the moment. Like the people at LL and elsewhere (I got my approach from a series of posts they made in late 2004 and 2005—generally by Zimmer and Lieberman), I’ve always seen the ugh as the total of unique hits for queries returning relatively small results, and since the unique hit rate seems to me to be about 10-20% most of the time, that has always seemed to make good sense for things in the usual eggcorn ballpark. (I”m aware of the potential logical circularity there, but anyhow….) But sometime in 2006 I noticed that just where ugh hits number no longer made as much sense (around, oh say, 8000-10000 rgh for many searches), the form of the results changed. As the numbers get higher, the message you get after doing “&start=950” changes. Say you do a search that returns 4000 rgh. Then you filter with &start=950, and you get 393 ugh—you’re shown “393 of 393” hits. But if you do that with a search that results in say 100,000 rgh, you’re instead shown, say, “733 of 100,000” hits, rather than “733 of 733.” What accounts for that change? I’ve never seen an answer.
Last edited by patschwieterman (2008-11-02 18:40:57)
Is anyone still reading this thread but you and me, Pat? We may have crossed the border into Arcania when we weren’t looking.
I don’t think we disagree about the use of ughits/1000 to derive a ratio that we can apply to the computed raw hits. The comparison I did for the “heart-” and “gut-” alternatives in the second post wasn’t meant as a general rule. I think it works in this comparison because the overall numbers for “heart-retching” and “gut-retching” and for “heart-wrenching” and “gut-wrenching” are fairly comparable. But our inability to know the details behind the raw hits computation and our ignorance of the extent of prefiltering during PageRank selections turns even our best comparisons into expressions of taste and preference. Frustrating.
But aren’t page rank values being applied all the time?
They are. But when Google finds a thousand or fewer pages with a search term, the only effect of applying PageRank, so far as I can see, is to reorder the pages so that the ones with the highest PageRank are displayed first. Ranking the pages doesn’t hide any pages from you. But when the search finds more than a thousand pages, picking the pages to show you by their PageRank makes the ones that fall out of the top thousand invisible to you. And none of those clever tricks on the Google Search Protocol pages shows you how to make them visible.
Then you filter with &start=950, and you get 393 ugh—you’re shown “393 of 393” hits. But if you do that with a search that results in say 100,000 rgh, you’re instead shown, say, “733 of 100,000” hits, rather than “733 of 733.” What accounts for that change? I’ve never seen an answer.
I guess I don’t really know what you mean. Suppose I type into the location bar on my browser
At the top it says “Results 911 – 920 of about 86,700,000.” Now I type in
At the top it says “Results 551 – 558 of about 3,580.” Finally, I type in
and the top line says “Results 11 – 20 of 20.”
In every case the pattern is “Results X – Y of Z.” Y is the number of ughits in the thousand (or fewer) selected for display. Z is the (usually computed) number of raw hits. For search terms found on fewer than 1000 pages, Y=Z and Z is no longer a computed number (Notice that the word “about” is dropped in this case.).
Last edited by kem (2008-11-03 02:25:52)