Move text from WordPress site to spreadsheet
May 4, 2023 10:27 AM   Subscribe

I inherited a work project that involves a highly outdated and mostly defunct WordPress site that has around 600 text entries describing scientific findings. I would like to figure out how to export the text content of the WordPress site into a pretty simple straightforward spreadsheet. I have zero WordPress experience. Can this be done?

My understanding is that on the backend of this site these text entries and the information architecture organizing them is...a convoluted hot garbage mess. It takes 5+ clicks to get to the thing you're on the site to see, making it nearly impossible for anyone to find anything.

Even if I have to do this methodically and slowly I'm hoping that at least I can export some "levels" of the information as some sort of CSV type file that at least saves some time.

I'm at the point right now that I think just painstakingly copying and pasting the text fields one at a time into a spreadsheet is the only option I know how to do.

If using something like Python would be easy enough to learn to help automate I'd be happy to learn! I know that there are lots of tools out there for web scraping, so I'd also be interested in trying some of those out.

Basically what I'll need is:
- Study citation information (for some reason they didn't include links or DOIs so I'll have to add that information in somehow)
- 150 word text summary
- Title
- Date
- Taxonomic information (what topic area of the site was this filed under)
- Any tags (e.g., what country or journal was this study from)
posted by forkisbetter to Technology (7 answers total)
 
Do you have access to the backing database? It may be pretty easy to pull the data directly out of MySQL instead of mucking around with the application itself.
posted by rockindata at 10:30 AM on May 4, 2023 [4 favorites]


Yes, if you have access to the database administration, you can use PHPMyAdmin to find the table where the data are and dump them with SQL into a text file. Otherwise, 14 years ago, I used a Python script called BeautifulSoup to do that (extract 600 scientific datasheets from a dead website).
posted by elgilito at 10:38 AM on May 4, 2023


If you don't have access to the database, but do have access to the Wordpress backend, you might be able to export the data and convert from XML to CSV.

Also probably not too terrible to scrape the data from the website frontend if you don't have any access.

This is something I do for work sometimes, happy to help if I can, if you want to memail me.
posted by gregr at 11:04 AM on May 4, 2023


There are plenty of plugins that will export WP entries to a CSV file.
posted by DarlingBri at 12:04 PM on May 4, 2023


I was going to make both of those recommendations. If the underlying MySQL / SQL Lite / whatever database is still there, that's going to be the easiest to get the raw data out. But if you have to parse the site itself, you can download the pages and hack something together with Beautiful Soup.
posted by wnissen at 1:09 PM on May 4, 2023


Best answer: Of the three proposed approaches—A) get the data directly from the database; B) do an export from WordPress, either a standard WXR (RSS/XML) export or with a CSV export plugin, then import it to a spreadsheet; or C) scrape the published pages—I’m guessing B is going to be the least painful path.

A) Content and metadata for a given post in a WordPress database are stored across multiple tables in a way that requires some non-trivial SQL joins and understanding of the data model to tease out coherently

C) You’re basically starting with a blank slate and need to develop not only the logic to extract data from a given page, but a way to enumerate all the pages you want to target and loop through them (the site may already publish a sitemap that can provide this)

With option B you’re dealing with a tool that already understands the structure of the data and is designed to output an importable file.
posted by staggernation at 1:42 PM on May 4, 2023


Best answer: WP All Export (the free version should be fine) is a good option to get everything as a CSV. Assuming all of the information you want is associated with one post per item, it should be trivially easy to get this into a usable spreadsheet from the Wordpress backend.

Definitely do not try to get anything from the database directly, unless you have no other option.
posted by ssg at 3:34 PM on May 4, 2023


« Older Friendlist tattoo artist in New York   |   Stressed about my mom coming, getting hot/cold... Newer »
This thread is closed to new comments.