Join 3,513 readers in helping fund MetaFilter (Hide)


[Online Search Filter] A more prescient Google?
November 12, 2007 12:45 PM   Subscribe

Is there a search engine with which I can search among the source code of websites?

Perhaps I should explain a case need for this as a coworker whom I asked replied: "G-O-O-G-L-E dot com."

If I right-click, View Page Source, that material that I see - I'd like to be able to query. For the entire Internet (at least, the entire public Internet). When viewing the source for this page on which I'm typing the question, I see snippets like:

font-size:10pt;

and

<li><a href="http://projects.metafilter.com/">Projects</a></li>

Yet googling either of them results in this and this - I believe neither search would lead me to this page. This is consistent with other experiments I've tried.

I know that Google has a number of search queries that relate here, such as link: and site: but I'd still like the ability to search with plain text among the source code for websites.

Thoughts?

(As always, thanks to all of you who make up this community, this resource.)
posted by gbinal to Technology (5 answers total)
 
Google, yes. But have you tried this?
posted by Soup at 12:54 PM on November 12, 2007 [1 favorite]


AFAIK, Google and other search engines do index all the source code. There's just no reason Metafilter would be ranked particularly high for "font-size" compared to sites/ pages about web development that repeat 10pt a few times and are possibly linked to from other sites discussing 10pt fonts. As an example, this thread may wind up being a fairly high result for "10pt" in a few days.

Because no site is trying to optimize for their code (as opposed to their content), any sites that do come up in those queries will either be web-dev related or simply luck of the draw. That said, koders.com andf Google Code Search might be what you want.
posted by yerfatma at 2:09 PM on November 12, 2007


Google probably does not index the source code; they're more likely to get meaningful results if they only search against the displayed content, and filter the source out. (As a quick test: metafilter's source code includes some relatively unusual class names, like "flagdrop" or "mefimessages" which should show up in the google results if the source code were indexed, but don't.)

In any case, I doubt anyone would bother to build a search engine that does index source code directly, because all websites use basically the same tags (they're all written in the same language, after all)... a list of all the sites that use 10 point fonts, or that contain list items, would be pretty meaningless. And if you're searching for information about particular tag usage, there's plenty of content out there describing those tags which would be more relevant than raw source code searches anyway.

(However -- from your examples -- if you wanted to, say, search for all the sites which link to projects.metafilter.com, you can get that from google's advanced search.)
posted by ook at 2:27 PM on November 12, 2007


Piggyback:

Odd, I was trying to do this today as well, because I couldn't get google to not drop the @ character.

I'm trying to find email addresses for presidential candidates. Most have info@mycampaign.com, but some don't. Searching for @mycampaign.com doesn't work, the + doesn't help, and mailto:***@mycampaign.com fails as well (which is why an answer to the op would help me.)

Any thoughts?
posted by zazerr at 3:00 PM on November 12, 2007


Hi folks. Thank you for replying so far.

I will look at the non-google sites but feel fairly confident that I've long since delved through all the public good functionality (have read guides specifically on it, follow a few google blogs, etc etc).

I agree that my font10 example wasn't helpful. But some page source is fairly unique or, when combined with a 2nd search filters, would be worth searching.

While google certainly gobbles up all the page source info and even makes some of it useful, it is mostly not available to be amongst the searched text.

Both Koder (new to me) and Google Code Search are fine products but aren't about page source that is currently composing sites, but rather about opened source code that someone has written (but not at all necessarily make up pages right now).

Any help?
posted by gbinal at 4:31 PM on November 12, 2007


« Older How do you disinfect your yoga...   |  Apartment-dwelling cat owners:... Newer »
This thread is closed to new comments.