application for scraping info from a web page for a database
February 7, 2005 1:22 PM
Suppose you wanted to scrape information from a web page and dump it into a CSV file for use in a database - specifically a large table of contact information coded in HTML. Is there an application (I work in OS X, but have access to Windows machines) that will do this? Would it be possible to write up an AppleScript or something similar to do the job?
I doubt an application already exists for this, as the data analyzed and output format needs would be different for each person. This sort of thing is quite easily handled with a perl script. After all, PERL is the Practical Extraction and Report Language. It's already installed on your Mac OS X box.
Can you be more specific about how the HTML is laid out and what parts of it you want to extract into CSV? An link to an HTML page would be most helpful.
Depending on your time window, someone might be able to whip you up a quick script to do this job.
posted by pmbuko at 1:30 PM on February 7, 2005
Can you be more specific about how the HTML is laid out and what parts of it you want to extract into CSV? An link to an HTML page would be most helpful.
Depending on your time window, someone might be able to whip you up a quick script to do this job.
posted by pmbuko at 1:30 PM on February 7, 2005
Right now I'm thinking of it as a one time job. Full fledged automation isn't necessary, but it'd be nice if it could be easily done again in 6 months or so.
I've just realized that spammers probably have tools that do exactly this. I'm not trying to harvest e-mail addresses for malicious purposes, but the same principle applies.
posted by aladfar at 1:31 PM on February 7, 2005
I've just realized that spammers probably have tools that do exactly this. I'm not trying to harvest e-mail addresses for malicious purposes, but the same principle applies.
posted by aladfar at 1:31 PM on February 7, 2005
This could probably be done in a few lines of Perl using HTML::TableExtract and Text::CSV.
(n.b.: Perl is not "PERL" and is not an acronym.)
posted by Zed_Lopez at 1:42 PM on February 7, 2005
(n.b.: Perl is not "PERL" and is not an acronym.)
posted by Zed_Lopez at 1:42 PM on February 7, 2005
It's called Excel. Specifically, Excel Web Query. This is exactly the purpose that tool was written for.
In Excel 2003, go to Data -> Import External Data -> New Web Query. You specify the web page address, which is shown in Excel's own window, and specify which tables you want to copy data out of. Excel does the rest of the work, and you don't need to figure out the exact data layout, the way you would for a perl or python script. Nifty stuff.
Not sure if this functionality is in latest version of Excel for mac, though.
posted by blindcarboncopy at 1:44 PM on February 7, 2005
In Excel 2003, go to Data -> Import External Data -> New Web Query. You specify the web page address, which is shown in Excel's own window, and specify which tables you want to copy data out of. Excel does the rest of the work, and you don't need to figure out the exact data layout, the way you would for a perl or python script. Nifty stuff.
Not sure if this functionality is in latest version of Excel for mac, though.
posted by blindcarboncopy at 1:44 PM on February 7, 2005
Not sure if this functionality is in latest version of Excel for mac, though.
Nope.
posted by AlexReynolds at 2:01 PM on February 7, 2005
Nope.
posted by AlexReynolds at 2:01 PM on February 7, 2005
If you load up the page in IE, highlight the table and copy it, then open a new worksheet in Excel and paste it the results should be one excel cell for every cell in the HTML table. If that doesn't work for your versions and you get all the text on a row in a single cell try the Data -> Text to Columns functionality.
Or, yeah, get someone to write a script for you!
posted by babar at 2:17 PM on February 7, 2005
Or, yeah, get someone to write a script for you!
posted by babar at 2:17 PM on February 7, 2005
It's also possible that you could massage it with GREP. Not quite as fancy as Perl, but a little more accessible (it's built into BBEdit, if you have that, and of course is in the shell). It might take a few patterns to completely strip it down.
posted by adamrice at 2:23 PM on February 7, 2005
posted by adamrice at 2:23 PM on February 7, 2005
I use Python's Beautiful Soup, which is meant to easily process poorly-designed HTML.
posted by five fresh fish at 4:25 PM on February 7, 2005
posted by five fresh fish at 4:25 PM on February 7, 2005
Whenever I've written screen-scraping scripts (say that 10 times fast!) the HTML always changes on and then I am left to debug again. :( Caveat scriptor.
posted by grouse at 4:33 PM on February 7, 2005
posted by grouse at 4:33 PM on February 7, 2005
blindcarboncopy: Thanks for the Excel tip! Perfect for something I need to do right now and the feature was under my nose (well actually under the "data" menu but still)
posted by donovan at 4:36 PM on February 7, 2005
posted by donovan at 4:36 PM on February 7, 2005
in theory you could use xsl (you might pre-process the html with tidy to guarantee valid xhtml).
not that i would, probably.
well, maybe i would, but only because i'm stupid like that.
posted by andrew cooke at 4:43 PM on February 7, 2005
not that i would, probably.
well, maybe i would, but only because i'm stupid like that.
posted by andrew cooke at 4:43 PM on February 7, 2005
Sometimes this sort of thing is easiest if you FIRST copy the HTML table into Word, then go from Word to Excel. I did that recently with a tide table where the date occured only once for each of the 4 tidal peaks. Word would let me copy collums without taking entire rows, IE would not.
Hmm, this Perl thing sounds useful. Must investigate further.
posted by Goofyy at 10:31 PM on February 7, 2005
Hmm, this Perl thing sounds useful. Must investigate further.
posted by Goofyy at 10:31 PM on February 7, 2005
Thanks for the Beautiful Soup tip for Python, fff. I scrape baseball stats from errr...umm...various sites, which I swear are puposely malformed, for my fantasy baseball league manager app and have been stuck doing regex or tidy+XSLT depending on which way the wind blows. This could be gravy.
posted by Fezboy! at 8:12 AM on February 8, 2005
posted by Fezboy! at 8:12 AM on February 8, 2005
It was fairly gravyish for scraping new MLS listings, too. Not a good idea for commercial purposes, though: they have sued and won against scrapers.
posted by five fresh fish at 10:22 AM on February 8, 2005
posted by five fresh fish at 10:22 AM on February 8, 2005
Aye, s'not commercial yet and if/when it does go that route, I'll probably be sharewaring it to cover my costs to subscribe to a service.
posted by Fezboy! at 1:04 PM on February 8, 2005
posted by Fezboy! at 1:04 PM on February 8, 2005
Sorry, my sloppy writing is malinformative: I mean that the MLS.COM guys do not allow scraping. I believe it was a Century 21 realtor (or office) that got rapped knuckles over scraping listings for his/their own website.
I'm sure the BeautifulSoup guy isn't a lawyer. And that his software, like almost all pythonic software, is a "free to do whatever you want" offering.
Er, the baseball-stats-site-which-shall-not-be-named might take an attitude similar to that of MLS.COM. Be careful there.
posted by five fresh fish at 3:26 PM on February 8, 2005
I'm sure the BeautifulSoup guy isn't a lawyer. And that his software, like almost all pythonic software, is a "free to do whatever you want" offering.
Er, the baseball-stats-site-which-shall-not-be-named might take an attitude similar to that of MLS.COM. Be careful there.
posted by five fresh fish at 3:26 PM on February 8, 2005
This thread is closed to new comments.
posted by alkupe at 1:25 PM on February 7, 2005