How to convert a dictionary in XML format to a text file for use in a flashcard program?
January 7, 2011 7:15 AM   Subscribe

I use Anki as a flashcard program to learn languages. I have an XML file of an open source English-Swedish dictionary. I'd like to turn this file into a text file that Anki can import. I know nothing about XML files (I don't even quite know what to open it with. OSX tries to use Adobe Illustrator, but surely that can't be right?). Is there any way to do this more or less easily?

Below is an example of the entry for the word "ord", meaning "word", as displayed on the website.

The current format that I use for my cards includes fields for the word in Swedish, the definition, the inflections, and examples. It would already be fantastic to be able to extract this data from the XML file and produce a comma, semicolon or tab-separated text file.

Even better would be to have a way to extract all the idioms, and separate the ones from the same entry every time there is a comma.

Even more amazing (but now entering the realm of language learning OCD) would be to automatically capitalize the example sentences and add a period at the end of the sentence (which is indicated by a space followed by an opening bracket) if there's no other punctuation mark.

Being able to do this would save me a considerable amount of time. I'm therefore ready to commit a reasonable amount of effort to making it work...

Can anyone the best tools for this (preferably on OSX, but I can find a computer that runs Windows if necessary)? If you can't/don't want to walk me through how to do this step by step, what are some good tutorials that might help me figure it out?

---------

ord noun, word

Pronunciation: [o:r_d]

See Saldo: associations inflections

Inflections: ordet, ord, orden

Synonyms: glosa, glosor

Explanation: minsta självständiga språkliga enhet

Example: fula ord (foul language, swearwords),
säg inte ett ord till någon! (don't say a word to anyone!)

Idiom: med andra ord ("annorlunda uttryckt") (in other words ("put in another way")),
ord för ord ("ordagrant") (word for word ("literally verbatim")),
ha ord om sig ("vara känd för") att vara snål (be known to be mean),
innan man vet ordet av ("mycket snabbt") (before I knew where I was ("very quickly")),
ta till orda ("börja tala") (begin to speak),
hålla sitt ord ("hålla vad man lovat") (keep one's word ("do what one has promised")),
begära el. ha ordet ("vilja hålla el. hålla ett anförande") (ask to speak (ask for the floor) or have the floor ("want to address, or address, a gathering")),
ordet är fritt ("vem som helst får yttra sig") (the debate is open ("anyone may speak")),
ta någon på orden ("tro på vad någon säger") (take sby at their word ("believe what sby says")),
ha sista ordet ("vara den som bestämmer") (have the last word ("be the one to decide"))

Compounds: glåpord (taunt, jeer),
ord|följd (word order),
ord|lista (word list, glossary)
posted by snoogles to Computers & Internet (12 answers total) 7 users marked this as a favorite
 
Maybe I can help out, but the link to the xml file is dead. Can you maybe mirror it somewhere?
posted by tempythethird at 7:58 AM on January 7, 2011


Link is just *slow*.
You want Perl (of course), one of the simple XML modules should suffice, the rest is a bit of text processing that shouldn't be that hard. And deciding on output format suitable for Anki. I've done a bunch of stuff like this for Japanese (EDICT) that's probably similar.
posted by zengargoyle at 9:42 AM on January 7, 2011


That file doesn't seem to have what you think it has in it.


[word value="abandonee" lang="en" class="nn"]
[translation value="förvärvare" comment="juridik"/] [translation value="person som äganderätt övergår till"/]
[/word]
[word value="about" lang="en" class="pp"]
[translation value="omkring i" comment="i rumsbetydelse"/]
[translation value="runt i"/]
[translation value="runt på"/]
[translation value="runtomkring"/]
[translation value="om"/]
[example value="gå runt på stan'"]
[translation value="walk about the town"/]
[/example]
[example value="här någonstans"]
[translation value="somewhere about here"/]
[/example]
[/word]

[word value="admire" lang="en" class="vb"]
[translation value="beundra"/]
[grammar value="transitivt"/]
[explanation value="be impressed by, respect"]
[translation value="be impressed by, respect"/]
[/explanation]
[/word]


There are no lines not matching 'lang="en"'. No sort of pronunciation information that I can see. Looks to be just a dictionary. They probably have other data files for other bits of information. You would also have to find the meaning of the 'class' attributes (noun, verb, etc.).

Typically one of the simple Perl modules would give you a hash or array of words, some structure you can just loop over and print out what you want.
posted by zengargoyle at 10:00 AM on January 7, 2011


Response by poster: Okay, as far as I can tell then, the dictionary is sorted by English word (word value), not Swedish word (translation value). Ah, it even says so at the top of the file (source-language="en" target-language="sv").

So what I would want to do, to use your example, zengargoyly, is to have a file that looks like this:

förvärvare (juridik),abandonee
person som äganderätt övergår till,abandonee

If there's an example, then

runt på,about,gå runt på stan
omkring i, about,gå runt på stan

... and so on for all the stuff between [word value] [/word]. Does that make sense? And then I can go through and manually delete the examples that correspond to the specific Swedish word/expression.

How does one go about making Perl do this for me? Assume I know nothing...
posted by snoogles at 12:14 PM on January 7, 2011


To make this work, post formally:

1) one or two input records
2) or our two equivalent output records.

This stuff is easy (for programmers!) to do in any of the common langs, like Python, Ruby, Perl, which if you have OSX are builtin :) If you are on Windows, it is tougher!
posted by gregglind at 1:34 PM on January 7, 2011


Best answer: Here is a quick example. The 'W/E:' are there for my quick scanning of the output. Quotes may be needed around some of the long examples that have commas in them (don't know Anki formats). And I'm not really sure how the examples go given your example. If you look at the code you'll see that it's pretty easy. Just a little setup and then finding children of tags and fetching attributes. Not sure how hard (or what you need) for setting up on a Mac. It uses XML::Twig module which might have a macports version or can be installed from CPAN (I'm sure it uses LibXML or something underneath.)

Really the rest depends on the specifics of what you need to do for Anki. Maybe open a couple of files each for Word Mappings and Examples, or blank lines between entries, or quotes around fields, etc.

This is a snippet of the output from running the script in the same directory as the XML file:

W: lössläppt,abandoned
W: otyglad,abandoned
W: utsvävande,abandoned
W: fördärvad,abandoned
E: Otyglat beteende.,abandoned,Abandoned behaviour.
W: nödställd,abandoned
E: Vi var helt nödställda när vårt hus hade brunnit ned.,abandoned,We were compl
etely abandoned after our house had burned down.
W: övergiven,abandoned
E: Ett övergivet hus.,abandoned,An abandoned house.
W: förvärvare (juridik),abandonee
W: person som äganderätt övergår till,abandonee
W: otvungenhet,abandon
W: nonchalans,abandon
W: frigjordhet,abandon
E: I glad uppsluppenhet.,abandon,In gay abandon.
W: ge upp,abandon
W: avstå från,abandon
W: frångå,abandon
E: Jag har gett upp min tidigare plan.,abandon,I have abandoned my previous plan.

posted by zengargoyle at 1:53 PM on January 7, 2011


I should add that once looking deeper into the file instead of just random peeking it does have some more information available.

The tags used:
      1 dictionary
     42 see
    187 related
    198 definition
   1833 variant
   3595 paradigm
   4507 idiom
   6057 inflection
   7701 explanation
  11728 example
  15755 grammar
  46762 word
 110038 translation
The classes of words:
      2 ie
      2 latin
      2 suffix
     25 prefix
     31 ro
     33 article
     41 rg
     45 hjälpverb
     49 in
     49 pm
     52 pc
     71 kn
    150 abbrev
    158 pn
    269 pp
   1553 ab
   7227 jj
  13197 vb
  23896 nn
They all could be used the same way, via $X->{att}{class} eq "kn" or with children()/first_child() or other funcions. I wont post the 3000+ grammar specifications but they look something like:
$ perl -ne '@q=();while(/grammar value="([^"]+)"/g){push@q,$1}; @q && do {;@f=split/, /,join", ",@q;print join"\n",@f,""};' < folkets_public.xml | sort | uniq -c | sort -n

...
     22 predikativt
     25 attributivt
     26 ingen komparation
     27 intransitivt och transitivt
     30 ofta i plural
     35 alltid med bestämd artikel och i singular
     35 står i plural
     44 vanligen i passiv form
     47 alltid med bestämd artikel
     50 ej i progressiv form
     72 transitivt och intransitivt
     74 ofta i passiv form
     77 vanligen i plural
     77 vanligen i singular
    152 alltid i plural
   3639 intransitivt
   7760 transitivt

posted by zengargoyle at 2:07 PM on January 7, 2011


Response by poster: First, let me say how much I appreciate how helpful you are all being, and zengargoyle in particular.

I've tried a few things, and Anki will deal with commas well as long as the separation value in the text file uses semi-colons.
Let me try to write out what I would like my final output to look like.

File 1 for idioms:
Input for idioms:

[word value="admittance" lang="en" class="nn"] [translation value="tillträde"/] [example value="vi lyckades inte utverka tillträde till specialsamlingarna"] [translation value="we were unable to gain admittance to the rare collections"/] [/example] [idiom value="tillträde förbjudet"] [translation value="no admittance"/] [/idiom] [/word]

Output for idioms:


tillträde förbjudet;no admittance


(i.e. idiom value; translation value of the idiom)

File 2 for everything else:

Input:

[word value="love" lang="en" class="nn"] [translation value="kärlek"/] [translation value="förälskelse"/] [translation value="tillgivenhet"/] [translation value="lust"/] [translation value="förtjusning"/] [translation value="känsla att man tycker mycket om någon"/] [example value="kärlek till naturen"] [translation value="love of nature"/] [/example] [/word]

Output:


nn;kärlek;love;kärlek till naturen
nn;förälskelse;love;kärlek till naturen
nn;tillgivenhet;love;kärlek till naturen
nn;lust;love;kärlek till naturen
nn;förtjusning;love;kärlek till naturen
nn;känsla att man tycker mycket om någon;love;Kärlek till naturen.


(i.e. for each translation value, give class; translation value; word value; translation value)
This results in the examples not quite matching the Swedish, but I think it would be easiest to just eliminate those manually, as the form of the word in the example may not match the "translation value" perfectly.
posted by snoogles at 2:20 AM on January 8, 2011


I hope you mean:

class; translation value; word value; example value

I tried some fuzy matching on the examples X translations lines. Might want to reconsider manually removing awkward combinations. With basic String::Approx::amatch (using the default approximate match of the translation value against the example value the number of lines output goes from 23,714 down to 8,801. (and it still includes some bad matches). It's a tossup.
# pruned via amatch
vb;abdikera;abdicate;drottningen abdikerade
vb;abdikera;abdicate;abdikera
vb;avsäga sig;abdicate;avsäga sig ansvaret för någonting

# no pruning
vb;abdikera;abdicate;drottningen abdikerade
vb;avgå;abdicate;drottningen abdikerade
vb;abdikera;abdicate;abdikera
vb;avgå;abdicate;abdikera
vb;avsäga sig;abdicate;avsäga sig ansvaret för någonting
Proper sentence formatting is in there if you enable it. It does seem to me to be annoying when a single word gets made into say 'Abdikera.' but I'm probably just used to Japanese no-case, no-space.

Code is updated in the Gist. You have to uncomment a couple of places to choose between emitting idioms or examples and whether or not to prune and/or sentence-ize examples. Other than that...

# change commented parts
./folkets.pl > idioms.txt
# change commented parts
./folkets.pl > examples.txt


There's a bunch of options you could try to prune the examples a bit smarter, but I don't know Swedish. :P You could try stemming and pluralizers and inflectors (Oh my!) but that's a full blown project.
posted by zengargoyle at 10:55 AM on January 8, 2011


Response by poster: zengargoyle, I cannot express enough gratitude to you for taking the time to play around with this for me. The Swedish language learning community will be forever thankful :).

If I can bother you for one last thing: how do I run the script?

I followed the instructions that I found here. Meaning that I put the .pl file that I downloaded from the Gist in my home folder along with the XML dictionary file, removed the # in from of do_idioms(\@idioms); in the .pl file and saved, and then typed run folkets.pl in the terminal window and pressed Enter.

Then I got this:


Emilie-Cotes-MacBook-Pro:~ emiliecote$ perl run.pl
Can't locate String/Approx.pm in @INC (@INC contains: /Library/Perl/Updates/5.10.0 /System/Library/Perl/5.10.0/darwin-thread-multi-2level /System/Library/Perl/5.10.0 /Library/Perl/5.10.0/darwin-thread-multi-2level /Library/Perl/5.10.0 /Network/Library/Perl/5.10.0/darwin-thread-multi-2level /Network/Library/Perl/5.10.0 /Network/Library/Perl /System/Library/Perl/Extras/5.10.0/darwin-thread-multi-2level /System/Library/Perl/Extras/5.10.0 .) at run.pl line 4.
BEGIN failed--compilation aborted at run.pl line 4.


Any idea why?
posted by snoogles at 11:33 AM on January 8, 2011


Best answer: Ok, I don't really know Mac OS X that well but 95% of the time it's just another UNIX... This just means that your Perl doesn't have String::Approx installed. (if you don't want to use my best guess approximate prune then you can comment out that line and the 'use String::Approx' line at the top).

#use String::Approx qw(amatch);
...down in do_example...
# uncomment following line to prune by approximate match
#next unless amatch( $t->{att}{value}, $e->{att}{value} );

But I fear you may not have XML::Twig installed either... Easiest way to check is:

perl -MXML::Twig -e 1

and see if it errors out the same way. It probably will. This is not that uncommon, with any scripting language worth using there are modules that you pick and install to do the hard parts for you. You just need to install them. It can be a pain the first time you do this because you have to install some Mac stuff and configure CPAN (the Perl module installer) but once you do it's usually just a simple one-line command to install modules. Some links I found for Mac:

see the Installing Perl Modules section
another page that basically says the same thing.
yet another example

Basically you need Mac OS X Xtools developer package installed. It's on your OS DVD or you can download it from Apple Developer Support. It's free and a standard Mac install thing so you shouldn't have any problems with it. (even Ubuntu/Linux users usually have to install a build-essentials package that has compiler/make/headers and other tools for installing modules)

Once you have the Xtools installed you can use CPAN to install Perl modules. At its simplest it's just:

sudo cpan Module

or

sudo perl -MCPAN -e 'install "Module"'

But the first time you run it you need to configure it. Mostly this just involves picking a mirror to download modules from. And you can accept the defaults for everything else. This sounds rough but it's just:

# install Xtools
sudo cpan XML::Twig #boring config stuff
sudo cpan String::Approx

And if all goes well it will work. If you see any 'XXX:YYY not found should I add it to install [yes]' type messages just hit return to say yes. 90% of the time it's no problem. :)

If this is majorly too complicated or problem prone, I can generate the files and upload them. I just didn't because it's slow on my DSL and don't know if you prefer the approximate match filtering or not. Or zip vs gz (maybe for Windows Anki users). It's "Distributed under the Creative Commons Attribution-Share Alike 2.5 Generic license" so no problems there.
posted by zengargoyle at 12:50 PM on January 8, 2011


Response by poster: It worked perfectly. Thank you so much! I don't know if you use Anki (I guess you're learning Japanese?), but as soon as I clean up the deck I'll share it through Anki under "Snoogle's Swedish Vocabulary", if you want to take a look at the result for all your hard work. You've inspired me to learn some Perl, if only to be able to do this myself next time!
posted by snoogles at 3:37 PM on January 8, 2011


« Older iPhone 4 upgrade questions.   |   2011 -- what did I ever do to you? Newer »
This thread is closed to new comments.