Web scraping for dummies
August 6, 2008 3:06 PM   Subscribe

How does web scraping work with PHP/mySQL? What best practices are there?

I'm curious about how price comparison services do and manage web scraping, i.e. finding information in unstructured HTML files over many different sites and presenting the information on their own sites. Ultimately, I would like to learn enough about web scraping so that I can create a functional site that, for example, displays a list of dishes that are linked to various recipe sites.

Stuff that I wonder about:
1. In general terms, how would you code the project using PHP/mySQL? Any code libraries that can be used for scraping?

2. I understand that you can regexp data from the scraped html files, but aren't there more intelligent ways of extracting the data? I'm thinking about XSLT and such.

3. How do you handle form generated pages? For example, recipe sites that allow you to search form recipe by using check boxes, pull down menus, etc? Again, are there any smart code libraries out there that simplifies this?

4. Are there any best practices regarding managing scraping, storage, data manipulation, performance, ethics, etc, that I should be aware of?
posted by Foci for Analysis to Computers & Internet (15 answers total) 9 users marked this as a favorite
 
IANA web junkie, but I did a couple of simple web scraping projects. I wrote code specifically for each site's page format to extract the data I wanted. I have the idea that web scraping projects are often done this way, unless the site you're interested in have web services to get the info you want. I don't think XSLT will be useful unless the pages you scrape use XHTML.
posted by DarkForest at 3:29 PM on August 6, 2008


Regex is the best way.

If you need to work around forms and such, use CURL.
posted by Perpetual Seeker at 3:32 PM on August 6, 2008


Not all price matching sites scrape. Some have agreements with the sites they list to get a structured data feed. Most large ecommerce sites have ways to generate a structured data feed. From a copyright perspective you may need an agreement to use a site's data so from there it's easy to ask for a structured data feed.

You will need to obey robots.txt if you want to scrape/crawl sites and even if you do that sites may block you by IP or user agent if you're using up too much bandwidth.

As for XSLT you cannot be 100% guaranteed of getting valid HTML much less anything close to HTML so you'll need to be ready with a lot of error handling.

For stuff behind forms, this is a hard problem. Most spiders only crawl content that can be obtained via HTTP GET. Crawling things via PUT is risky business and you had better know what you're doing as PUT can have side effects. You don't want to accidentally be creating or deleting recipes instead of looking them up.
posted by GuyZero at 3:35 PM on August 6, 2008


2. I understand that you can regexp data from the scraped html files, but aren't there more intelligent ways of extracting the data? I'm thinking about XSLT and such.

often people parse the html rather than use regular expressions. i have not done this in php but if you are open to working with other languages i have had good luck with both beautifulsoup and the html agility pack.
posted by phil at 3:38 PM on August 6, 2008


The PHP curl libraries are great (http://us3.php.net/curl).

You'll want to use regular expressions (preg_match_all, preg_match) to scrape the data. I find it easier than trying to use an HTML parser.
posted by wongcorgi at 3:43 PM on August 6, 2008


Best answer: Avoid regexs like the plague... they're the most primitive and brittle way of handling HTML. They appear simple and small and elegant but then you try and do something simple like pulling out a table from a string that looks like "<table><tr><td> <table><tr><td> </td></tr></table> </td></tr></table>" and you end up with malformed mess of "<table><tr><td> <table><tr><td> </td></tr></table>" and when you've dealt with that edge-case then HTML will throw you another dozen scenarios like <!-- </table> > that aren't parse of the document tree and your regex will end up as an unmaintainable unreadable mess. Only ever use regexes for when you control the whole system.

Instead, for parsing HTML always use an HTML parser. Structure your code like this:

Malformed HTML String → beautifulsoup / html agility (as suggested by Phil).

OR

Malformed HTML String → HTML Tidy to XHTML → XML selection.

where the XML selection could be SimpleXML, PHP DOM, XSLT, XPath/E4X, etc...

Trust me on this one. I'm a weathered man at the ripe old of 28 who has seen more HTML than you've had cooked breakfasts. Don't use regexes unless you want grief.
posted by holloway at 4:05 PM on August 6, 2008 [5 favorites]


The worst part about web scraping is that your site could break at any time. That tends to be too much of a commitment for someone doing it as a hobby, so the sites are perpetually half broken.
posted by smackfu at 4:34 PM on August 6, 2008


Are there any best practices regarding managing scraping, storage, data manipulation, performance, ethics, etc, that I should be aware of?

One big thing would be to pay attention to the cache/expiration capabilities of HTTP so that you're not fetching redundant pages again and again. Read up on the "If-Modified-Since" and "If-None-Match" HTTP request headers, which will allow you to say to the server you're accessing, "Has this page changed since the last time I fetched it?" If it hasn't, you'll get an HTTP 304, "Not Modified", and you can skip getting the page again.

That plus avoiding hitting a server for lots of requests in a short amount of time is the main part of good robot/scraper behavior, I think.
posted by letourneau at 4:36 PM on August 6, 2008


Best answer: i think holloway offers a valid point of view. but if you just want a little nugget of info, a php, curl and a regexp will get the job done.


// example: get number of hits for a search term on google
$url = 'http://www.google.ca/search?q=metafilter';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_HEADER, 1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl);
curl_close($curl);
// Results 1 - 10 of about 3,530,000 for xxx.
if (preg_match('#Results 1 - \d+ of about (.+?) for#', $html, $matches))
{
print 'number of results: ' . $matches[1];
}


i'm sure cleaning up the HTML and making it into a valid XML structure could be more elegant in some cases, but i've found that the code for parsing and making sense of XML to be messier than regexps. In both cases, the original problem remains: If the sources change their HTML format, your scraper breaks and you need to spend time fixing it.
posted by kamelhoecker at 4:38 PM on August 6, 2008


Seconding beautifulsoup - it's python rather than PHP so it might not be what you're looking for, but it really does make this kind of thing easy.
posted by xchmp at 4:44 PM on August 6, 2008


Response by poster: Everyone, thank you for your thoughts and suggestions.

I like holloway's approach because it seems more structured and robust. Also, I try to avoid write regexps if possible because the damn things get cthulhuian pretty fast.

kamelhoecker, thanks for the example, very illustrative.


I realize that the scraping can and will break often, so error detection seems pretty important. I think that detecting changes in the pages' structures and sending myself an email will do. Any ideas?
posted by Foci for Analysis at 5:00 PM on August 6, 2008


Best answer: I realize that the scraping can and will break often, so error detection seems pretty important. I think that detecting changes in the pages' structures and sending myself an email will do. Any ideas?

Well, if you're going to be automating your web scraping, you could write your scripts to generate no direct output (i.e. just write to files, not the screen) when they find what they're looking for, and have them complain to the screen when they get stumped. Then, if you're running them via cron, you'll get e-mails from the cron daemon whenever your scripts write anything to stdout/stderr, i.e. whenever they fail to find what they're looking for.

A liberal sprinking of assert to make sure the structure of your scraped page matches what you're expecting could help you do this.
posted by letourneau at 5:14 PM on August 6, 2008


Yes, as letourneau says make assertions about the document as you would for in any programming language that takes user input. If something should be a number then assert that, and the extent to which you should assert things varies from document to document (judging how brittle to make your code will take some thought). If you're snapping off small parts of the document then you can assert nodes as integers, strings, etc. If you're wanting to assert large complex documents then it's about validating XML data structures so use the tools for that -- RelaxNG.

As well as validation be sure to sanitise the data carefully, preferably with a whitelist. For example, if you're selecting a <span> node that later changes <span onclick="exploit goes here"> then obviously your code would be more robust if it didn't copy any attributes or elements that it wasn't expecting... a whitelist-style XSLT is a good way of doing this.

(heh... brittle seems to be my word of the day)
posted by holloway at 5:40 PM on August 6, 2008 [1 favorite]


Here are a few approaches using Ruby and Perl.
posted by PueExMachina at 10:12 PM on August 6, 2008


I try to avoid write regexps if possible because the damn things get cthulhuian pretty fast

I don't have enough HTML parsing experience to speak to the issue of using regexps for that, but I do regularly use them for all kinds of other things, and use this piecewise technique to limit their tendency toward being crazily write-only.
posted by flabdablet at 4:40 PM on August 7, 2008


« Older mail2web/addressbook/iphone groups synching...   |   Please help ID an old toy! Newer »
This thread is closed to new comments.