please help my create a spreadsheet from a list of links.
November 20, 2012 2:43 PM   Subscribe

I want to create a spreadsheet from the hyperlinks and words in a word list on Wiktionary. Please take me through the steps. Thanks!

The Wiktionary list is the most frequent 5000 words of a non-English language. Most of the words in the list are hyperlinks, leading to individual pages that contain information about each word (part of speech, etymology, English gloss, etc.). I'd like to scrape the data from each link so that I'd end up with a csv file that has columns with all the available information for each of the 5,000 words (and I'll fill in the rest manually). How do I go about this? Thanks again!
posted by iamkimiam to Computers & Internet (8 answers total) 2 users marked this as a favorite
 
Like this list?

Seems pretty simple to get the words. Copy and paste and do text to columns splitting on spaces.

As far as scraping the dictionary definitions, you're going to want a programmer for that. It's non-trivial, unless there's an api and it doesn't look like there is.
posted by empath at 2:51 PM on November 20, 2012




It doesn't sound like you have any programming experience. Wiktionary (and all wikimedia sites) provide downloadable XML versions of their databases at http://dumps.wikimedia.org/. There is no API. I'd be happy to email you a spreadsheet from the most recent dump with the information you need.
posted by GDWJRG at 2:54 PM on November 20, 2012


There's also simply the export page on the site itself which accepts a list of page titles and returns an XML file. You normally get to it from the Special Pages link on the left hand toolbox.

I'm not sure what you guys mean by saying there isn't an API, unless you're talking about some particular kind of API. There are many different client programming libraries for various languages that interface with the MediaWiki server software (which is what Wikipedia, Wiktionary, WikiBooks, etc. run on) and all sorts of automated tools for normal people based on them which might be useful. (Well, these tools are for people who are more normal than someone who would write their own application from scratch, at least.) Also there's the toolserver which hosts various web-based utilities which you could poke around on to see if there's anything handy.
posted by XMLicious at 3:12 PM on November 20, 2012


Here are a few command line tools for handling the XML export files.

But it might be easiest to just try opening or importing the XML file into whatever spreadsheet program you have and see what it does with it; you'll quite possibly get something you can massage into the layout you want with a few formulas and autofilled columns or something along those lines and then save it as CSV.

Alternatively, if you're handy with the search-and-replace tool in a text editor you can often use that to convert files like this into a comma-separated format such that you can just open it in a spreadsheet program and delete a few columns to get it the way you want.
posted by XMLicious at 3:30 PM on November 20, 2012


When I said there was no API I meant that there was no way of taking the page text from Wiktionary and saying "give me the definition field". It just isn't formatted that way. The XML files provided similarly have no structure beyond page name -> page text, with no further division into any structured data.
posted by GDWJRG at 3:42 PM on November 20, 2012


Oh, I see what you mean. Yeah, that is pretty annoying. I was surprised that they went with completely unstructured content, particularly when there are a bunch of MediaWiki plugins that do that, albeit not as well as content management systems of more recent vintage.

On the other hand, it might have made sense in that it let the page format evolve instead of trying to plan everything out from the beginning and leave a more sophisticated solution as a future project.
posted by XMLicious at 7:56 PM on November 20, 2012


If the XML isn't easy to work with, you could write a Python* script that downloads each word's page as HTML and then uses regular expressions to scrape the content for whatever sections the page has.

*Doesn't have to be Python, but it's fairly easy to set up and has regex and HTTP functionality built-in.
posted by clorox at 1:52 AM on November 21, 2012


« Older what commercial is this?   |   Return my things! Newer »
This thread is closed to new comments.