Web scraping for dummies
August 6, 2008 3:06 PM
Subscribe
How does web scraping work with PHP/mySQL? What best practices are there?
I'm curious about how price comparison services do and manage web scraping, i.e. finding information in unstructured HTML files over many different sites and presenting the information on their own sites. Ultimately, I would like to learn enough about web scraping so that I can create a functional site that, for example, displays a list of dishes that are linked to various recipe sites.
Stuff that I wonder about:
1. In general terms, how would you code the project using PHP/mySQL? Any code libraries that can be used for scraping?
2. I understand that you can regexp data from the scraped html files, but aren't there more intelligent ways of extracting the data? I'm thinking about XSLT and such.
3. How do you handle form generated pages? For example, recipe sites that allow you to search form recipe by using check boxes, pull down menus, etc? Again, are there any smart code libraries out there that simplifies this?
4. Are there any best practices regarding managing scraping, storage, data manipulation, performance, ethics, etc, that I should be aware of?
posted by Foci for Analysis to computers & internet (15 comments total)
9 users marked this as a favorite
posted by DarkForest at 3:29 PM on August 6, 2008