"Scraping" a website for a particular email address?
May 12, 2015 3:07 PM   Subscribe

This is for work, honest. We've had instructions to delete a generic email address as all enquiries sent to it (about one every three months) are going to be redirected elsewhere.

I've been given the task of "going through the website" to find instances of this particular email address. I've been asked to do this because I "know about computers" and therefore apparently know how to do this.

I have sent multiple requests to our web services team only to be told "we don't keep a record of what appears on what page" which I know is a bullshit response but frankly I can't be bothered arguing it with them. I don't want a record, I just want them to do some kind of backend search shit and tell me where the email address appears.

And so here I am. Is there a tool available (preferably free and web-based, or freeware/open source) that will scan through a website (e.g. http://www.websitename.org) and look for a specific address (e.g email.address@websitename.org) on all pages? Or I guess just a "global" website searching thing that will let me search for any given thing?

I don't want to harvest all email addresses on the website. I already know what the search string is. I'm just stupid and don't know how to do it.
posted by turbid dahlia to Computers & Internet (14 answers total)
Response by poster: Oh god I bet there's a way to do this with Google I just realised. Hnng.
posted by turbid dahlia at 3:08 PM on May 12, 2015 [2 favorites]

Search for on Google:

site:www.sitename.org "email.address@websitename.org"

Use quotes around the email address.

This should work.
posted by ethidda at 3:15 PM on May 12, 2015 [4 favorites]

email.address@websitename.org site:websitename.org

However, this will only get the pages google can see. If there are pages set up specifically not to show up in search engines, or pages you have to login to see, it won't find those.
posted by yohko at 3:17 PM on May 12, 2015 [1 favorite]

Response by poster: Thanks ethidda. I was trying that but even with the quotes it's just giving scattershot results. Some with the "email.address" bit (whether with a full stop or not) and others with the "websitename.org" bit (many more of those, obviously). Nothing that's the full phrase. Which I guess could very well mean that the full phrase doesn't actually appear anywhere?
posted by turbid dahlia at 3:18 PM on May 12, 2015 [1 favorite]

Could you have hidden mailto links? Like text saying "Click here to let us know what you need" or "Email us to ask for a quote"?
posted by jaguar at 3:36 PM on May 12, 2015 [1 favorite]

Is the website generated by a CMS or is it just a bunch of HTML files in a directory? Because searching those files, if they exist, is the simplest approach.
posted by zachlipton at 3:39 PM on May 12, 2015 [1 favorite]

Do you happen to know if the site uses an SQL database? Because you can do full text searches of all content within that database from the mySQL interface.
posted by zarq at 3:46 PM on May 12, 2015

Response by poster: jaguar, last I manually checked a couple of instances a while back, it was a combo of hidden "email us" type links, and the email address proper.

zachlipton, I'm afraid my knowledge of the mechanics of the website doesn't go much further than knowing that it's a website with a website address. That said, I did "Inspect Element" and under "Sources" there appear to be a bunch of assets folders under the master heading for the website.

zarq, even though I don't know what I'm looking for exactly, again under "Inspect Element - Resources" there is a bunch of stuff (Local Storage and Session Storage etc.) with stuff in it, and "WebSQL" is completely blank.

Sorry guys, I guess this is a dumb noob question. But I do know about computers, honest! Ask me about Loom!
posted by turbid dahlia at 3:53 PM on May 12, 2015

"Inspect element" in your web browser is not going to get you very far. You need somebody with access to the server. Is there an IT department / sysadmin distinct from the unhelpful "web services team"?

(This is a two-minute job for somebody who has access to the back end and know what she's doing.)
posted by Shmuel510 at 4:00 PM on May 12, 2015 [4 favorites]

One thing you could do is 1) pull down every single web page in the site then 2) use grep on the files to search for the email address.

You could do 1 by using the `wget` command on Mac or Linux. I've used this successfully in the past:

wget -m and robots=off http://yoursite.com

This site has a much more comprehensive command you can try instead if you like. `wget` is not on OS X by default, but you can install it by installing Homebrew, then running brew install wget.

Then, to do 2), you can use `grep` like so:

$ cd directory-containing-downloaded-files
$ grep -re email@domain.com .

(I'm sure you can get someone much better with grep to get you a more optimized command, but I think that'll work.) It'll start citing every file containing that email string.

If you happen to be using ag, then you can just search with:
ag email@domain.com

That explanation might be on the terse side. If you need elaboration or clarification on anything, just let me know!
posted by ignignokt at 4:08 PM on May 12, 2015 [6 favorites]

None of these client-side options will help you find, for example, a contact form that submits to a server-side script that sends the info to the email address in question. These are pretty common.

So if this email address is going to start bouncing, you will need support from whoever maintains the website infrastructure.

On the other hand if it's going to be redirected indefinitely all you really care about are public-side ones where someone where the wget-style approach works.
posted by xiw at 4:52 PM on May 12, 2015 [1 favorite]

Response by poster: Thanks very much everybody. A very generous soul offered their time and resources via memail and has performed the work of of downloading the pages and grep'ing them for the search string - zero hits! I had already suspected as much but was very relieved to have it properly confirmed by actual scientists. Everyone here has offered excellent advice and while I won't flag any "Best Answers", rest assured that you are all winners in my eyes.

Thanks again!
posted by turbid dahlia at 5:44 PM on May 12, 2015 [5 favorites]

For future web searchers, know that I use the OS X program Site Sucker to do this when I am utterly desperate. It downloads all the files, you search them, your itch is scratched, then you have a beer (not included).
posted by Mo Nickels at 7:19 PM on May 12, 2015 [1 favorite]

Glad the problem's solved. Your "web services" "team" doesn't know how to use grep or sed? I think it's more likely they just don't want to help, and that would be the kind of thing I might want to mention to my friend the IT director as an "oh by the way" sometime.
posted by ctmf at 9:34 PM on May 12, 2015 [1 favorite]

« Older MAME me   |   Help me buy some cleats Newer »
This thread is closed to new comments.