How to convert a dictionary in XML format to a text file for use in a flashcard program?
January 7, 2011 7:15 AM Subscribe
I use Anki as a flashcard program to learn languages. I have an XML file of an open source English-Swedish dictionary. I'd like to turn this file into a text file that Anki can import.
I know nothing about XML files (I don't even quite know what to open it with. OSX tries to use Adobe Illustrator, but surely that can't be right?). Is there any way to do this more or less easily?
Below is an example of the entry for the word "ord", meaning "word", as displayed on the website.
The current format that I use for my cards includes fields for the word in Swedish, the definition, the inflections, and examples. It would already be fantastic to be able to extract this data from the XML file and produce a comma, semicolon or tab-separated text file.
Even better would be to have a way to extract all the idioms, and separate the ones from the same entry every time there is a comma.
Even more amazing (but now entering the realm of language learning OCD) would be to automatically capitalize the example sentences and add a period at the end of the sentence (which is indicated by a space followed by an opening bracket) if there's no other punctuation mark.
Being able to do this would save me a considerable amount of time. I'm therefore ready to commit a reasonable amount of effort to making it work...
Can anyone the best tools for this (preferably on OSX, but I can find a computer that runs Windows if necessary)? If you can't/don't want to walk me through how to do this step by step, what are some good tutorials that might help me figure it out?
---------
ord noun, word
Pronunciation: [o:r_d]
See Saldo: associations inflections
Inflections: ordet, ord, orden
Synonyms: glosa, glosor
Explanation: minsta självständiga språkliga enhet
Example: fula ord (foul language, swearwords),
säg inte ett ord till någon! (don't say a word to anyone!)
Idiom: med andra ord ("annorlunda uttryckt") (in other words ("put in another way")),
ord för ord ("ordagrant") (word for word ("literally verbatim")),
ha ord om sig ("vara känd för") att vara snål (be known to be mean),
innan man vet ordet av ("mycket snabbt") (before I knew where I was ("very quickly")),
ta till orda ("börja tala") (begin to speak),
hålla sitt ord ("hålla vad man lovat") (keep one's word ("do what one has promised")),
begära el. ha ordet ("vilja hålla el. hålla ett anförande") (ask to speak (ask for the floor) or have the floor ("want to address, or address, a gathering")),
ordet är fritt ("vem som helst får yttra sig") (the debate is open ("anyone may speak")),
ta någon på orden ("tro på vad någon säger") (take sby at their word ("believe what sby says")),
ha sista ordet ("vara den som bestämmer") (have the last word ("be the one to decide"))
Compounds: glåpord (taunt, jeer),
ord|följd (word order),
ord|lista (word list, glossary)
Below is an example of the entry for the word "ord", meaning "word", as displayed on the website.
The current format that I use for my cards includes fields for the word in Swedish, the definition, the inflections, and examples. It would already be fantastic to be able to extract this data from the XML file and produce a comma, semicolon or tab-separated text file.
Even better would be to have a way to extract all the idioms, and separate the ones from the same entry every time there is a comma.
Even more amazing (but now entering the realm of language learning OCD) would be to automatically capitalize the example sentences and add a period at the end of the sentence (which is indicated by a space followed by an opening bracket) if there's no other punctuation mark.
Being able to do this would save me a considerable amount of time. I'm therefore ready to commit a reasonable amount of effort to making it work...
Can anyone the best tools for this (preferably on OSX, but I can find a computer that runs Windows if necessary)? If you can't/don't want to walk me through how to do this step by step, what are some good tutorials that might help me figure it out?
---------
ord noun, word
Pronunciation: [o:r_d]
See Saldo: associations inflections
Inflections: ordet, ord, orden
Synonyms: glosa, glosor
Explanation: minsta självständiga språkliga enhet
Example: fula ord (foul language, swearwords),
säg inte ett ord till någon! (don't say a word to anyone!)
Idiom: med andra ord ("annorlunda uttryckt") (in other words ("put in another way")),
ord för ord ("ordagrant") (word for word ("literally verbatim")),
ha ord om sig ("vara känd för") att vara snål (be known to be mean),
innan man vet ordet av ("mycket snabbt") (before I knew where I was ("very quickly")),
ta till orda ("börja tala") (begin to speak),
hålla sitt ord ("hålla vad man lovat") (keep one's word ("do what one has promised")),
begära el. ha ordet ("vilja hålla el. hålla ett anförande") (ask to speak (ask for the floor) or have the floor ("want to address, or address, a gathering")),
ordet är fritt ("vem som helst får yttra sig") (the debate is open ("anyone may speak")),
ta någon på orden ("tro på vad någon säger") (take sby at their word ("believe what sby says")),
ha sista ordet ("vara den som bestämmer") (have the last word ("be the one to decide"))
Compounds: glåpord (taunt, jeer),
ord|följd (word order),
ord|lista (word list, glossary)
Link is just *slow*.
You want Perl (of course), one of the simple XML modules should suffice, the rest is a bit of text processing that shouldn't be that hard. And deciding on output format suitable for Anki. I've done a bunch of stuff like this for Japanese (EDICT) that's probably similar.
posted by zengargoyle at 9:42 AM on January 7, 2011
You want Perl (of course), one of the simple XML modules should suffice, the rest is a bit of text processing that shouldn't be that hard. And deciding on output format suitable for Anki. I've done a bunch of stuff like this for Japanese (EDICT) that's probably similar.
posted by zengargoyle at 9:42 AM on January 7, 2011
That file doesn't seem to have what you think it has in it.
There are no lines not matching 'lang="en"'. No sort of pronunciation information that I can see. Looks to be just a dictionary. They probably have other data files for other bits of information. You would also have to find the meaning of the 'class' attributes (noun, verb, etc.).
Typically one of the simple Perl modules would give you a hash or array of words, some structure you can just loop over and print out what you want.
posted by zengargoyle at 10:00 AM on January 7, 2011
[word value="abandonee" lang="en" class="nn"]
[translation value="förvärvare" comment="juridik"/] [translation value="person som äganderätt övergår till"/]
[/word]
[word value="about" lang="en" class="pp"]
[translation value="omkring i" comment="i rumsbetydelse"/]
[translation value="runt i"/]
[translation value="runt på"/]
[translation value="runtomkring"/]
[translation value="om"/]
[example value="gå runt på stan'"]
[translation value="walk about the town"/]
[/example]
[example value="här någonstans"]
[translation value="somewhere about here"/]
[/example]
[/word]
[word value="admire" lang="en" class="vb"]
[translation value="beundra"/]
[grammar value="transitivt"/]
[explanation value="be impressed by, respect"]
[translation value="be impressed by, respect"/]
[/explanation]
[/word]
There are no lines not matching 'lang="en"'. No sort of pronunciation information that I can see. Looks to be just a dictionary. They probably have other data files for other bits of information. You would also have to find the meaning of the 'class' attributes (noun, verb, etc.).
Typically one of the simple Perl modules would give you a hash or array of words, some structure you can just loop over and print out what you want.
posted by zengargoyle at 10:00 AM on January 7, 2011
Response by poster: Okay, as far as I can tell then, the dictionary is sorted by English word (word value), not Swedish word (translation value). Ah, it even says so at the top of the file (source-language="en" target-language="sv").
So what I would want to do, to use your example, zengargoyly, is to have a file that looks like this:
förvärvare (juridik),abandonee
person som äganderätt övergår till,abandonee
If there's an example, then
runt på,about,gå runt på stan
omkring i, about,gå runt på stan
... and so on for all the stuff between [word value] [/word]. Does that make sense? And then I can go through and manually delete the examples that correspond to the specific Swedish word/expression.
How does one go about making Perl do this for me? Assume I know nothing...
posted by snoogles at 12:14 PM on January 7, 2011
So what I would want to do, to use your example, zengargoyly, is to have a file that looks like this:
förvärvare (juridik),abandonee
person som äganderätt övergår till,abandonee
If there's an example, then
runt på,about,gå runt på stan
omkring i, about,gå runt på stan
... and so on for all the stuff between [word value] [/word]. Does that make sense? And then I can go through and manually delete the examples that correspond to the specific Swedish word/expression.
How does one go about making Perl do this for me? Assume I know nothing...
posted by snoogles at 12:14 PM on January 7, 2011
To make this work, post formally:
1) one or two input records
2) or our two equivalent output records.
This stuff is easy (for programmers!) to do in any of the common langs, like Python, Ruby, Perl, which if you have OSX are builtin :) If you are on Windows, it is tougher!
posted by gregglind at 1:34 PM on January 7, 2011
1) one or two input records
2) or our two equivalent output records.
This stuff is easy (for programmers!) to do in any of the common langs, like Python, Ruby, Perl, which if you have OSX are builtin :) If you are on Windows, it is tougher!
posted by gregglind at 1:34 PM on January 7, 2011
Best answer: Here is a quick example. The 'W/E:' are there for my quick scanning of the output. Quotes may be needed around some of the long examples that have commas in them (don't know Anki formats). And I'm not really sure how the examples go given your example. If you look at the code you'll see that it's pretty easy. Just a little setup and then finding children of tags and fetching attributes. Not sure how hard (or what you need) for setting up on a Mac. It uses XML::Twig module which might have a macports version or can be installed from CPAN (I'm sure it uses LibXML or something underneath.)
Really the rest depends on the specifics of what you need to do for Anki. Maybe open a couple of files each for Word Mappings and Examples, or blank lines between entries, or quotes around fields, etc.
This is a snippet of the output from running the script in the same directory as the XML file:
posted by zengargoyle at 1:53 PM on January 7, 2011
Really the rest depends on the specifics of what you need to do for Anki. Maybe open a couple of files each for Word Mappings and Examples, or blank lines between entries, or quotes around fields, etc.
This is a snippet of the output from running the script in the same directory as the XML file:
W: lössläppt,abandoned W: otyglad,abandoned W: utsvävande,abandoned W: fördärvad,abandoned E: Otyglat beteende.,abandoned,Abandoned behaviour. W: nödställd,abandoned E: Vi var helt nödställda när vårt hus hade brunnit ned.,abandoned,We were compl etely abandoned after our house had burned down. W: övergiven,abandoned E: Ett övergivet hus.,abandoned,An abandoned house. W: förvärvare (juridik),abandonee W: person som äganderätt övergår till,abandonee W: otvungenhet,abandon W: nonchalans,abandon W: frigjordhet,abandon E: I glad uppsluppenhet.,abandon,In gay abandon. W: ge upp,abandon W: avstå från,abandon W: frångå,abandon E: Jag har gett upp min tidigare plan.,abandon,I have abandoned my previous plan.
posted by zengargoyle at 1:53 PM on January 7, 2011
I should add that once looking deeper into the file instead of just random peeking it does have some more information available.
The tags used:
posted by zengargoyle at 2:07 PM on January 7, 2011
The tags used:
1 dictionary 42 see 187 related 198 definition 1833 variant 3595 paradigm 4507 idiom 6057 inflection 7701 explanation 11728 example 15755 grammar 46762 word 110038 translationThe classes of words:
2 ie 2 latin 2 suffix 25 prefix 31 ro 33 article 41 rg 45 hjälpverb 49 in 49 pm 52 pc 71 kn 150 abbrev 158 pn 269 pp 1553 ab 7227 jj 13197 vb 23896 nnThey all could be used the same way, via
$X->{att}{class} eq "kn"
or with children()/first_child()
or other funcions. I wont post the 3000+ grammar specifications but they look something like:
$ perl -ne '@q=();while(/grammar value="([^"]+)"/g){push@q,$1}; @q && do {;@f=split/, /,join", ",@q;print join"\n",@f,""};' < folkets_public.xml | sort | uniq -c | sort -n ... 22 predikativt 25 attributivt 26 ingen komparation 27 intransitivt och transitivt 30 ofta i plural 35 alltid med bestämd artikel och i singular 35 står i plural 44 vanligen i passiv form 47 alltid med bestämd artikel 50 ej i progressiv form 72 transitivt och intransitivt 74 ofta i passiv form 77 vanligen i plural 77 vanligen i singular 152 alltid i plural 3639 intransitivt 7760 transitivt
posted by zengargoyle at 2:07 PM on January 7, 2011
Response by poster: First, let me say how much I appreciate how helpful you are all being, and zengargoyle in particular.
I've tried a few things, and Anki will deal with commas well as long as the separation value in the text file uses semi-colons.
Let me try to write out what I would like my final output to look like.
File 1 for idioms:
Input for idioms:
[word value="admittance" lang="en" class="nn"] [translation value="tillträde"/] [example value="vi lyckades inte utverka tillträde till specialsamlingarna"] [translation value="we were unable to gain admittance to the rare collections"/] [/example] [idiom value="tillträde förbjudet"] [translation value="no admittance"/] [/idiom] [/word]
Output for idioms:
tillträde förbjudet;no admittance
(i.e. idiom value; translation value of the idiom)
File 2 for everything else:
Input:
[word value="love" lang="en" class="nn"] [translation value="kärlek"/] [translation value="förälskelse"/] [translation value="tillgivenhet"/] [translation value="lust"/] [translation value="förtjusning"/] [translation value="känsla att man tycker mycket om någon"/] [example value="kärlek till naturen"] [translation value="love of nature"/] [/example] [/word]
Output:
nn;kärlek;love;kärlek till naturen
nn;förälskelse;love;kärlek till naturen
nn;tillgivenhet;love;kärlek till naturen
nn;lust;love;kärlek till naturen
nn;förtjusning;love;kärlek till naturen
nn;känsla att man tycker mycket om någon;love;Kärlek till naturen.
(i.e. for each translation value, give class; translation value; word value; translation value)
This results in the examples not quite matching the Swedish, but I think it would be easiest to just eliminate those manually, as the form of the word in the example may not match the "translation value" perfectly.
posted by snoogles at 2:20 AM on January 8, 2011
I've tried a few things, and Anki will deal with commas well as long as the separation value in the text file uses semi-colons.
Let me try to write out what I would like my final output to look like.
File 1 for idioms:
Input for idioms:
[word value="admittance" lang="en" class="nn"] [translation value="tillträde"/] [example value="vi lyckades inte utverka tillträde till specialsamlingarna"] [translation value="we were unable to gain admittance to the rare collections"/] [/example] [idiom value="tillträde förbjudet"] [translation value="no admittance"/] [/idiom] [/word]
Output for idioms:
tillträde förbjudet;no admittance
(i.e. idiom value; translation value of the idiom)
File 2 for everything else:
Input:
[word value="love" lang="en" class="nn"] [translation value="kärlek"/] [translation value="förälskelse"/] [translation value="tillgivenhet"/] [translation value="lust"/] [translation value="förtjusning"/] [translation value="känsla att man tycker mycket om någon"/] [example value="kärlek till naturen"] [translation value="love of nature"/] [/example] [/word]
Output:
nn;kärlek;love;kärlek till naturen
nn;förälskelse;love;kärlek till naturen
nn;tillgivenhet;love;kärlek till naturen
nn;lust;love;kärlek till naturen
nn;förtjusning;love;kärlek till naturen
nn;känsla att man tycker mycket om någon;love;Kärlek till naturen.
(i.e. for each translation value, give class; translation value; word value; translation value)
This results in the examples not quite matching the Swedish, but I think it would be easiest to just eliminate those manually, as the form of the word in the example may not match the "translation value" perfectly.
posted by snoogles at 2:20 AM on January 8, 2011
I hope you mean:
class; translation value; word value; example value
I tried some fuzy matching on the examples X translations lines. Might want to reconsider manually removing awkward combinations. With basic String::Approx::amatch (using the default approximate match of the translation value against the example value the number of lines output goes from 23,714 down to 8,801. (and it still includes some bad matches). It's a tossup.
Code is updated in the Gist. You have to uncomment a couple of places to choose between emitting idioms or examples and whether or not to prune and/or sentence-ize examples. Other than that...
There's a bunch of options you could try to prune the examples a bit smarter, but I don't know Swedish. :P You could try stemming and pluralizers and inflectors (Oh my!) but that's a full blown project.
posted by zengargoyle at 10:55 AM on January 8, 2011
class; translation value; word value; example value
I tried some fuzy matching on the examples X translations lines. Might want to reconsider manually removing awkward combinations. With basic String::Approx::amatch (using the default approximate match of the translation value against the example value the number of lines output goes from 23,714 down to 8,801. (and it still includes some bad matches). It's a tossup.
# pruned via amatch vb;abdikera;abdicate;drottningen abdikerade vb;abdikera;abdicate;abdikera vb;avsäga sig;abdicate;avsäga sig ansvaret för någonting # no pruning vb;abdikera;abdicate;drottningen abdikerade vb;avgå;abdicate;drottningen abdikerade vb;abdikera;abdicate;abdikera vb;avgå;abdicate;abdikera vb;avsäga sig;abdicate;avsäga sig ansvaret för någontingProper sentence formatting is in there if you enable it. It does seem to me to be annoying when a single word gets made into say 'Abdikera.' but I'm probably just used to Japanese no-case, no-space.
Code is updated in the Gist. You have to uncomment a couple of places to choose between emitting idioms or examples and whether or not to prune and/or sentence-ize examples. Other than that...
# change commented parts
./folkets.pl > idioms.txt
# change commented parts
./folkets.pl > examples.txt
There's a bunch of options you could try to prune the examples a bit smarter, but I don't know Swedish. :P You could try stemming and pluralizers and inflectors (Oh my!) but that's a full blown project.
posted by zengargoyle at 10:55 AM on January 8, 2011
Response by poster: zengargoyle, I cannot express enough gratitude to you for taking the time to play around with this for me. The Swedish language learning community will be forever thankful :).
If I can bother you for one last thing: how do I run the script?
I followed the instructions that I found here. Meaning that I put the .pl file that I downloaded from the Gist in my home folder along with the XML dictionary file, removed the # in from of do_idioms(\@idioms); in the .pl file and saved, and then typed run folkets.pl in the terminal window and pressed Enter.
Then I got this:
Emilie-Cotes-MacBook-Pro:~ emiliecote$ perl run.pl
Can't locate String/Approx.pm in @INC (@INC contains: /Library/Perl/Updates/5.10.0 /System/Library/Perl/5.10.0/darwin-thread-multi-2level /System/Library/Perl/5.10.0 /Library/Perl/5.10.0/darwin-thread-multi-2level /Library/Perl/5.10.0 /Network/Library/Perl/5.10.0/darwin-thread-multi-2level /Network/Library/Perl/5.10.0 /Network/Library/Perl /System/Library/Perl/Extras/5.10.0/darwin-thread-multi-2level /System/Library/Perl/Extras/5.10.0 .) at run.pl line 4.
BEGIN failed--compilation aborted at run.pl line 4.
Any idea why?
posted by snoogles at 11:33 AM on January 8, 2011
If I can bother you for one last thing: how do I run the script?
I followed the instructions that I found here. Meaning that I put the .pl file that I downloaded from the Gist in my home folder along with the XML dictionary file, removed the # in from of do_idioms(\@idioms); in the .pl file and saved, and then typed run folkets.pl in the terminal window and pressed Enter.
Then I got this:
Emilie-Cotes-MacBook-Pro:~ emiliecote$ perl run.pl
Can't locate String/Approx.pm in @INC (@INC contains: /Library/Perl/Updates/5.10.0 /System/Library/Perl/5.10.0/darwin-thread-multi-2level /System/Library/Perl/5.10.0 /Library/Perl/5.10.0/darwin-thread-multi-2level /Library/Perl/5.10.0 /Network/Library/Perl/5.10.0/darwin-thread-multi-2level /Network/Library/Perl/5.10.0 /Network/Library/Perl /System/Library/Perl/Extras/5.10.0/darwin-thread-multi-2level /System/Library/Perl/Extras/5.10.0 .) at run.pl line 4.
BEGIN failed--compilation aborted at run.pl line 4.
Any idea why?
posted by snoogles at 11:33 AM on January 8, 2011
Best answer: Ok, I don't really know Mac OS X that well but 95% of the time it's just another UNIX... This just means that your Perl doesn't have String::Approx installed. (if you don't want to use my best guess approximate prune then you can comment out that line and the 'use String::Approx' line at the top).
But I fear you may not have XML::Twig installed either... Easiest way to check is:
and see if it errors out the same way. It probably will. This is not that uncommon, with any scripting language worth using there are modules that you pick and install to do the hard parts for you. You just need to install them. It can be a pain the first time you do this because you have to install some Mac stuff and configure CPAN (the Perl module installer) but once you do it's usually just a simple one-line command to install modules. Some links I found for Mac:
see the Installing Perl Modules section
another page that basically says the same thing.
yet another example
Basically you need Mac OS X Xtools developer package installed. It's on your OS DVD or you can download it from Apple Developer Support. It's free and a standard Mac install thing so you shouldn't have any problems with it. (even Ubuntu/Linux users usually have to install a build-essentials package that has compiler/make/headers and other tools for installing modules)
Once you have the Xtools installed you can use CPAN to install Perl modules. At its simplest it's just:
or
But the first time you run it you need to configure it. Mostly this just involves picking a mirror to download modules from. And you can accept the defaults for everything else. This sounds rough but it's just:
And if all goes well it will work. If you see any 'XXX:YYY not found should I add it to install [yes]' type messages just hit return to say yes. 90% of the time it's no problem. :)
If this is majorly too complicated or problem prone, I can generate the files and upload them. I just didn't because it's slow on my DSL and don't know if you prefer the approximate match filtering or not. Or zip vs gz (maybe for Windows Anki users). It's "Distributed under the Creative Commons Attribution-Share Alike 2.5 Generic license" so no problems there.
posted by zengargoyle at 12:50 PM on January 8, 2011
#use String::Approx qw(amatch);
...down in do_example...
# uncomment following line to prune by approximate match
#next unless amatch( $t->{att}{value}, $e->{att}{value} );
But I fear you may not have XML::Twig installed either... Easiest way to check is:
perl -MXML::Twig -e 1
and see if it errors out the same way. It probably will. This is not that uncommon, with any scripting language worth using there are modules that you pick and install to do the hard parts for you. You just need to install them. It can be a pain the first time you do this because you have to install some Mac stuff and configure CPAN (the Perl module installer) but once you do it's usually just a simple one-line command to install modules. Some links I found for Mac:
see the Installing Perl Modules section
another page that basically says the same thing.
yet another example
Basically you need Mac OS X Xtools developer package installed. It's on your OS DVD or you can download it from Apple Developer Support. It's free and a standard Mac install thing so you shouldn't have any problems with it. (even Ubuntu/Linux users usually have to install a build-essentials package that has compiler/make/headers and other tools for installing modules)
Once you have the Xtools installed you can use CPAN to install Perl modules. At its simplest it's just:
sudo cpan Module
or
sudo perl -MCPAN -e 'install "Module"'
But the first time you run it you need to configure it. Mostly this just involves picking a mirror to download modules from. And you can accept the defaults for everything else. This sounds rough but it's just:
# install Xtools
sudo cpan XML::Twig #boring config stuff
sudo cpan String::Approx
And if all goes well it will work. If you see any 'XXX:YYY not found should I add it to install [yes]' type messages just hit return to say yes. 90% of the time it's no problem. :)
If this is majorly too complicated or problem prone, I can generate the files and upload them. I just didn't because it's slow on my DSL and don't know if you prefer the approximate match filtering or not. Or zip vs gz (maybe for Windows Anki users). It's "Distributed under the Creative Commons Attribution-Share Alike 2.5 Generic license" so no problems there.
posted by zengargoyle at 12:50 PM on January 8, 2011
Response by poster: It worked perfectly. Thank you so much! I don't know if you use Anki (I guess you're learning Japanese?), but as soon as I clean up the deck I'll share it through Anki under "Snoogle's Swedish Vocabulary", if you want to take a look at the result for all your hard work. You've inspired me to learn some Perl, if only to be able to do this myself next time!
posted by snoogles at 3:37 PM on January 8, 2011
posted by snoogles at 3:37 PM on January 8, 2011
This thread is closed to new comments.
posted by tempythethird at 7:58 AM on January 7, 2011