Best way to convert 900 HTML table files to CSV?
August 7, 2014 4:56 PM   Subscribe

Does anyone know of a friendly program for batch-converting a folder full of hundreds of HTML tables to CSV format that will load in Google Spreadsheets? I know I could just load one in Excel but I need to automate it to handle tons of these files.
posted by johngoren to Computers & Internet (9 answers total) 3 users marked this as a favorite
If you have any Python experience, BeautifulSoup is a library that will help you parse HTML tables.
posted by aleatorictelevision at 4:59 PM on August 7, 2014 [1 favorite]

Well, if you have Ruby lying around, this script seems interesting. Looking around the Internet, there are a variety of solutions that might work including perl and one using lynx!

If you want a bit of an interesting challenge, you could try to invert this bash/awk script. Since it does the CSV-to-HTML, you could probably pretty easily reverse it. (And most likely by tomorrow I'll be so obsessed by the thought that I'll try to do it.)
posted by Fortran at 5:15 PM on August 7, 2014 [1 favorite]

It should be possible to do this with a few simple grep patterns, I think. BBEdit and textwrangler (both Mac programs from the same company) can both perform grep operations on multiple files simultaneously. If there are any rowspans or colspans, it'll go straight to hell though.
posted by adamrice at 7:50 PM on August 7, 2014

This is probably an hour (or maybe an afternoon if there are weird bits) for someone with some scripting skills. BeautifulSoup, as mentioned above, is a handy library to do exactly this sort of thing.

As my suggestion as has already been made I will, instead, ask a few followup questions that may help in your quest for an ultimate solution:

1) By "friendly program" I am assuming that you don't have the chops to write this one-off script yourself; is this correct?

2) I am pretty sure nothing like what you want is easily/cheaply available (the scriptability of this project and it's one-off nature means there's never be a market) so I'm wondering if this question might be better posed as looking for guidelines and cost estimates for having someone do it for you?

3) Do you know if your tables are satisfactorily convertible to csv? CSV can contain no formatting, formulas, have trouble with dates and timestamps and are subject to other data-formatting gotchyas

4) Do your html files contain one table per file? Are you sure? What if they don't? Would a file with more than one table need to have each table's data concatenated into one CSV file or separated into different tabs in the final workbook? CSV doesn't do tabs/pages.

5) Is your source data all ASCII chars? Does it contain any unicode in it at all? Many tools are not unicode aware and/or have poor unicode support.

6) What was the original source? 900+ html files if entered by hand will probably (no, scratch that, they _will_) contain inconsistencies, typos and the like. If they were machine generated is there any chance you can get more suitable output from the originating source?
posted by mce at 7:53 PM on August 7, 2014

mce has a number of legitimate questions that you should consider.

Regarding a solution, I'd add that if you have tried converting a single HTML file to csv in Excel, the simplest route might be just to use an Excel macro. If you're new to Macros, this is the Office help on macros. Someone on StackOverflow asked about a macro to convert all XLSX files in a folder to XLS files. You can just take that script and make the following changes:

1) Change the "Pathname = ..." line to include your desired folder location
2) Change the "Filename = ..." line - replace "*.xls" to "*.html"
3) Change the "saveFilename = Replace(" - replace ".xlsx" with ".html" and ".xls" with ".csv"
4) Change the "FileFormat:=xlExcel8" to "FileFormat:=xlCSV"

Run that and I believe that should do the trick.
posted by comradechu at 8:21 PM on August 7, 2014

You might be better off writing a little script that ignores your source data and just creates 900 Google spreadsheets, each one containing an ImportHTML function, and let Google Docs itself do the importing.

That's also an approach you can test quickly by hand on one or two of your existing table files before making a big time investment in writing scripts.
posted by flabdablet at 9:16 PM on August 7, 2014

There's an app that can do this for mac/windows/linux. Seems like your actual question is how to do it in bulk. Maybe contacting the developer to see if you can use wildcards or if they can build that functionality into a future version might do the trick?

FWIW, that was one of the top results of a google search for "html tables to csv"
posted by softlord at 8:12 AM on August 8, 2014

I will just script it out of hurt pride at the questioning, above, of my "chops." To tell you the truth I just didn't ever bother to figure out what CSV was because it is so dull. But now I see how easy a format it is to just do in PHP.

Falling asleep just thinking of it. I guess I had imagined some Mac app similar to video converters, where you can just drag a bunch of files into it.

BTW the program mentioned in the previous post is not that good, or at least not that well documented, and is mostly for ripping a single URL.

posted by johngoren at 12:49 PM on August 8, 2014

p.s. but thanks for letting me know about Beautiful Soup, looks useful.
posted by johngoren at 12:54 PM on August 8, 2014

« Older Best way to get from Seattle to Snoqualmie Pass...   |   Young child maybe gifted, maybe delayed. What now? Newer »
This thread is closed to new comments.