Why does excluding terms from Google searches increase the no. of hits?
September 11, 2016 5:09 PM   Subscribe

For example, Googling this:
"building works"
...produces "about 405,000 results". Googling this:
"building works" -program
...produces "about 485,000 results". Googling this:
"building works" -"program"
...produces "about 484,000 results". I realize the short answer is "Google is no longer a search company, it's an advertising company, if you're not paying you're the product, silicon valley greed micro$erf etc. etc. etc." but setting aside the snark/bitterness, what's actually going on that excluding terms increases the number of hits?
posted by Bugbread to Computers & Internet (12 answers total) 10 users marked this as a favorite
Not sure what is going on, but interestingly, I tried this and noticed that if you put a space before the word "program" you get only 294K results. e.g. "building works" - "program" or "building works" - program
posted by oxisos at 5:26 PM on September 11, 2016

And to show more variability, I got 439K, 430K, and 430K results for your three queries.
posted by mmascolino at 5:28 PM on September 11, 2016

Pretty sure putting a space between the dash and the word makes it not work the same way. -program means "exclude results that contain program" but "foo - program" means "foo and program" because it just ignores the lone dash

(not sure why excluding increases the search size here though)
posted by RustyBrooks at 5:29 PM on September 11, 2016 [2 favorites]

I got 617k for "building works" and 438k for "building works" -program
posted by RustyBrooks at 5:30 PM on September 11, 2016 [1 favorite]

Best answer: I'd have to dig for the name of it, but they are giving a very fast statistical estimate. It's not exactly correct,but it's (usually) order of magnitude correct. If you clicked through all the pages of results you could find the real number.
posted by TheAdamist at 5:30 PM on September 11, 2016 [6 favorites]

Actually it'll only show you 1000 results so you can't even page to the "end" to see. And yeah, I suspect that it's giving you some estimate.
posted by RustyBrooks at 5:32 PM on September 11, 2016 [2 favorites]

Response by poster: Ah, cool. I was concerned that it wasn't actually excluding the term, hence the number of results increasing, but given y'all's answers, and looking through a few pages of results, it looks like the exclusion part of the process is actually working (that is, pages with "program" in them are getting excluded from the results), and it's just the number-of-results-reporting part of the process that is wonky. I can totally live with that.
posted by Bugbread at 5:41 PM on September 11, 2016 [2 favorites]

XKCD wrote up something about this several years back. He also links to a tool to get the actual number of results, though I have no idea if it's still functional.
posted by yuwtze at 5:43 PM on September 11, 2016 [4 favorites]

For Google to serve an accurate count of hits, it would have to produce a ranked list with hundreds of thousands of elements. I know it looks as though that's what it's doing, but it isn't the case. Their algorithm has no direct way of knowing how many hits there are.

I started using Google relatively early and as far as I know it has never reported the number of hits with even the smallest degree of accuracy. Like, it used to be that it would say there were eight pages of hits, but you could page through them and there would be four. I suspect that the algorithm they use is something like: take simple estimates of how common the terms you're searching for are; multiply the probabilities together if it's an exclusionary search (x AND y), add them if it's an inclusionary search (x OR y); multiply by the number of pages indexed. Depending on how strongly the terms are associated, this can give wildly different answers. And different Google servers apparently have different indices, so the answer it gives can depend on your location and the time of day.
posted by Joe in Australia at 5:45 PM on September 11, 2016 [8 favorites]

Best answer: Much better answer than mine.
posted by Joe in Australia at 5:55 PM on September 11, 2016 [5 favorites]

Response by poster: Joe in Australia: "Much better answer than mine."

Oh, man, that contains the answer to the other great Google mystery that's been vexing me, "why is it that even when I put a phrase in quotes, sometimes the pages in the search results don't even contain the search term?" (the answer being, namely, that "it also includes as results pages that don't necessarily include the matching words at all, but where those words are used in hyperlinks pointing to those pages")
posted by Bugbread at 6:05 PM on September 11, 2016 [3 favorites]

From a friend that works for them:
A: "ghit rates are essentially noise."

B: "But the numbers aren't *completely* made up. Basically, the index of the entire internet does not, unsurprisingly, fit on one machine's memory or disk. When you do a search, you hit some subset of all machines, and those know what they know, not everything there is to know. So, from that incomplete information, and estimate of the total number of hits is made. That estimate is *really* low quality. Like, it might be within two orders of magnitude, but it might not.

Adding search terms, even negative ones, adds to the number of machines hit. That can improve the estimates, possibly upward."
posted by Confess, Fletch at 6:28 PM on September 11, 2016 [6 favorites]

« Older Lovecraft-inspired fiction and cookbooks...   |   Long Rural Driveway Snow Removal Options Newer »
This thread is closed to new comments.