A Tool To Convert HTML to something importable into WordPress?
November 13, 2008 7:16 AM Subscribe
Need a tool to help me convert html files to something I can import into WordPress.
I am converting a client's site to be run by WordPress. She has hundreds of articles, blog entries, and podcasts that have to be imported. But the site was manually managed and updated, so there is no current database. Am I looking at hours and hours of soul-crushing tedium? Or is there a handy tool that can parse through the pages, grab the content, and convert it to a WordPress-friendly import file?
I'm on a Mac, and have BBEdit; using Find/Replace hasn't proved helpful, because each page is slightly different.
I am converting a client's site to be run by WordPress. She has hundreds of articles, blog entries, and podcasts that have to be imported. But the site was manually managed and updated, so there is no current database. Am I looking at hours and hours of soul-crushing tedium? Or is there a handy tool that can parse through the pages, grab the content, and convert it to a WordPress-friendly import file?
I'm on a Mac, and have BBEdit; using Find/Replace hasn't proved helpful, because each page is slightly different.
I've been looking for the same thing for more than a year -- there doesn't seem to be any easy way to do it (which seems incredible).
I basically gave up after a small database company offered to do it for free (they are fans of my site), but then backed out when they got a sense of how big a job it was. I wanted to ask them how they had planned to do it in the first place, but they stopped answering my emails. Now I just have a link on my WP to the old site.
posted by words1 at 7:41 AM on November 13, 2008
I basically gave up after a small database company offered to do it for free (they are fans of my site), but then backed out when they got a sense of how big a job it was. I wanted to ask them how they had planned to do it in the first place, but they stopped answering my emails. Now I just have a link on my WP to the old site.
posted by words1 at 7:41 AM on November 13, 2008
It's not just a matter of creating a csv file, it will need to be a csv file that matches up to the mySQL table structure needed by WP. That does sound like a rather complicated job and I'm not surprised it's not automated easily. Do you have the budget to pay via Mechanical Turk for somebody to create the posts and cut and paste the content once you get WP set up?
posted by COD at 8:06 AM on November 13, 2008
posted by COD at 8:06 AM on November 13, 2008
Best answer: Step 1.
Remove all non article html that surrounds each post.
The key to this is to use BBEdit find & replace with GREP to find the patterns. For Example: If every article starts with something consistent (like a div with a certain ID or a headline that is wrapped in a H1). If there really is no structure at all, then there probably won't be a way to automate it. Keep in mind that when you are doing these searches you are going to be doing searches across many files. Make sure you duplicate the files before every step so you can roll-back since there is no way to undo across multiple files.
http://www.anybrowser.org/bbedit/grep.shtml
Step 2.
Gather all the posts.
Once you have just the content stripped out you still have a ton of text files. You then need to get all those text files in one spot. In the past I have written applescripts to do this. There may be other file merge utilities to do this.
Step 3.
Convert this massive text file into something with structure.
You can either make a CSV (again with BBEdit/GREP) or try cramming your data into one or many SQL insert statements. These statements need to match up to the WordPress DB structure. I would start by opening the DB and exporting one post and formatting your data to match.
posted by rdurbin at 9:17 AM on November 13, 2008
Remove all non article html that surrounds each post.
The key to this is to use BBEdit find & replace with GREP to find the patterns. For Example: If every article starts with something consistent (like a div with a certain ID or a headline that is wrapped in a H1). If there really is no structure at all, then there probably won't be a way to automate it. Keep in mind that when you are doing these searches you are going to be doing searches across many files. Make sure you duplicate the files before every step so you can roll-back since there is no way to undo across multiple files.
http://www.anybrowser.org/bbedit/grep.shtml
Step 2.
Gather all the posts.
Once you have just the content stripped out you still have a ton of text files. You then need to get all those text files in one spot. In the past I have written applescripts to do this. There may be other file merge utilities to do this.
Step 3.
Convert this massive text file into something with structure.
You can either make a CSV (again with BBEdit/GREP) or try cramming your data into one or many SQL insert statements. These statements need to match up to the WordPress DB structure. I would start by opening the DB and exporting one post and formatting your data to match.
posted by rdurbin at 9:17 AM on November 13, 2008
I don't know if this is helpful, but I remember a question about the format of the wp-import file on the wordpress forums awhile back. I've used the format that is recommended by doodlebee in the thread, successfully. As far as getting the entries into that suggested format? I'd do what rdurbin recommends.
Alternately, in the thread I referenced above, the last comment is from someone who ran into issues and wrote scripts to extract the data from sql dumps, and said to contact her if anyone needed the scripts.
Good luck!
posted by 8dot3 at 9:53 AM on November 13, 2008
Alternately, in the thread I referenced above, the last comment is from someone who ran into issues and wrote scripts to extract the data from sql dumps, and said to contact her if anyone needed the scripts.
Good luck!
posted by 8dot3 at 9:53 AM on November 13, 2008
wordpress supports a variety of import formats. None of them are going to do exactly what you need, but they'll be easier to work with than trying to import a csv directly in to the database.
You'll want to do a little researh into all the options, but my gut feeling is that either RSS or Wordpress's XML archive format is going to be your best bet. RSS is probably simplest.
To start, look at some RSS. I was just looking at twitters rss feeds and they are extremely simple, so try using that as your model, but put the html you want to import between the description tags. Cut and paste together a short feed with a few articles and try imporingit in to a test Wordpress instance using their import tool (under manage:import, I think)
Once you get the hang of it, you'll need to come up with a way to automate the process. I'm guessing there are scripts to turn a collection of files into rss. In fact, I think the bloxum blog engine works in just that manner.
Automating the task of extracting the HTML of the content of interest from the existing pages may be harder. It depends on how uniforly the pages have been coded. If they all use a common template, it should be pretty easy to do with grep patterns. Otherwise your best bet is to look at some of the tools, like beautiful soup, for screenscraping HTML. A few years ago, I found some that could look at a pile of pages and create extraction templates by learning the different sections that varied from page to page, then you could extract various sections from each page.
There may be some easy to use scraping applications, but the stuff I'm familiar with required, at the very least some fussing at the command line, if not some basic scripting.
Even after you get over this hurdle, you may have issues with the quality of the extractex HTML, and how well it coexists inside the generally up to date markup in wordpress.
Oh, if they already have real podcasts, with really rss feeds, toucan probably import them more or less directly, though you may have to find/replace to fix the URLs for the audio files.
posted by Good Brain at 10:15 AM on November 13, 2008
You'll want to do a little researh into all the options, but my gut feeling is that either RSS or Wordpress's XML archive format is going to be your best bet. RSS is probably simplest.
To start, look at some RSS. I was just looking at twitters rss feeds and they are extremely simple, so try using that as your model, but put the html you want to import between the description tags. Cut and paste together a short feed with a few articles and try imporingit in to a test Wordpress instance using their import tool (under manage:import, I think)
Once you get the hang of it, you'll need to come up with a way to automate the process. I'm guessing there are scripts to turn a collection of files into rss. In fact, I think the bloxum blog engine works in just that manner.
Automating the task of extracting the HTML of the content of interest from the existing pages may be harder. It depends on how uniforly the pages have been coded. If they all use a common template, it should be pretty easy to do with grep patterns. Otherwise your best bet is to look at some of the tools, like beautiful soup, for screenscraping HTML. A few years ago, I found some that could look at a pile of pages and create extraction templates by learning the different sections that varied from page to page, then you could extract various sections from each page.
There may be some easy to use scraping applications, but the stuff I'm familiar with required, at the very least some fussing at the command line, if not some basic scripting.
Even after you get over this hurdle, you may have issues with the quality of the extractex HTML, and how well it coexists inside the generally up to date markup in wordpress.
Oh, if they already have real podcasts, with really rss feeds, toucan probably import them more or less directly, though you may have to find/replace to fix the URLs for the audio files.
posted by Good Brain at 10:15 AM on November 13, 2008
If you're going to dip into programming anyway (and you probably will in order to process out the HTML in these files), you should consider WordPress' API, which has a way to programatically create new posts.
IMO, CSV or XML would add an unnecessary layer that would be difficult to debug. Using the simple API also means WordPress can calculate the correct defaults for you and gives you more control over the content. (There's a lot administrative cruft in WordPress' database tables layout. For example, tagging posts correctly is nearly impossible to do without the API's help.
posted by shadytrees at 12:31 PM on November 13, 2008
IMO, CSV or XML would add an unnecessary layer that would be difficult to debug. Using the simple API also means WordPress can calculate the correct defaults for you and gives you more control over the content. (There's a lot administrative cruft in WordPress' database tables layout. For example, tagging posts correctly is nearly impossible to do without the API's help.
wp_insert_post
also should detect podcast links/embeds seamlessly.)posted by shadytrees at 12:31 PM on November 13, 2008
This thread is closed to new comments.
One more thing: it doesn't have to be Markdown in particular. I've just used it before, so it came to mind. Any sort of format-preserving text "schema" would work, or what Wikipedia calls lightweight markup languages.
Good luck!
posted by hatta at 7:34 AM on November 13, 2008