Best utility/language to convert text files
October 19, 2007 5:49 AM   Subscribe

What is the best programming language to learn or (open) software to use for converting between text formats?

I am trying to educate myself on the wily ways of moving between formats that I've used for writing (mostly journals) in the past. Some of what I've written is in Word files, some in proprietary software (LifeJournal, TheJournal...stored in a database of some kind with IDX, BLB, and DAT files but with an export function), some in a wiki (which can export to HTML), some in other notetaking software that exports to XML.

I'm pretty sure I can export/import everything from either RTF-like or XML-like documents. My question is: what is the best way to "clean them up" so that one fits into the other? Is there some some common utility to do this, or does it require learning a language? If so, which one? (I'm currently learning Python--fingers crossed on that one).

Extra bonus question: where is page data stored in a wiki like MediaWiki? I know they render as HTML, but where is the data coming from? A MySQL table?
posted by mjklin to Computers & Internet (13 answers total) 2 users marked this as a favorite
 
perl is the most commonly used for this purpose.
ruby is a little bit less powerful here, but is much easier to learn (and read when you have to go back in and change your program).

i never really got into python, but i'm sure you could make it work.
posted by ArgentCorvid at 6:07 AM on October 19, 2007


You should get pretty far with Python.

I've used Python for all sorts of file conversion. But you might want to google around - it's pretty rare for a particular file format to not have a Python module already written.
posted by schwa at 6:07 AM on October 19, 2007


You should be able to do all this Python
posted by mmascolino at 6:16 AM on October 19, 2007


Best answer: Use any one of the suggestions above - all of them (+ any number as yet unmentioned) are fine. It mostly depends on your degree of familiarity with the language rather than finding the (subjective) best for the job.

You haven't mentioned what you're going to convert them into though. Might I suggest that you devise some sort of templated output format if at all possible and get your data to fit into that? It seems tempting to just read something in one format in a script and print out data/markup in a unified file. However, it's usually more maintainable to write a generic one-size-fits-all template file first and then massage your input data to fit that template.

I remember writing lots of throwaway scripts to convert from various formats into SQL when I started out. A bit of foresight and planning would have helped me make things more reusable and maintainable.

Also, be wary of control characters and escaping for your input and output formats. For example, a plain text format wouldn't care about angle brackets (ie:, < and >) but if your target format is HTML/XML - they certainly will expect some degree of escaping. Most conversion modules available (in Python, Perl etc) will do this for you automatically - but it's useful to check. (One example of templating: HTML::Template in Perl. Despite the name, you can use it to template pretty much anything into anything else.)

Yes, MediaWiki data is stored in a MySQL database table.
posted by geminus at 7:09 AM on October 19, 2007


Best answer: Perl, Python, or Ruby would all do the trick, though you'll probably have the best luck finding off-the-shelf converters with Perl. There's a large library of Perl modules available on CPAN (http://cpan.org/), including many format converters.
posted by dws at 7:24 AM on October 19, 2007


Yeah, python will do it, and has xml parsing libraries. Word is a bit more tricky.
posted by KirkJobSluder at 7:31 AM on October 19, 2007


I've moved away from Perl for the most part except in cases such as this. It has by far the greatest range of modules available for parsing docs.

If you're learning python for other reasons though you may want to stick with that.
posted by bitdamaged at 8:55 AM on October 19, 2007


Best answer: Check out Text Processing in Python
posted by miniape at 9:36 AM on October 19, 2007


Perl was built to do this, and even though I agree Python and Ruby are often better for many projects, I find myself going back to Perl nearly every time I have to do string slinging or format conversion.
posted by weston at 10:36 AM on October 19, 2007


Perl's main advantage in tasks like this comes in the form of of CPAN. While other languages have good centralized repositories of their own, CPAN is frigging enormous, thanks to a ten-year (give or take) head start.

While Ruby, PHP and Python have all taken the driver's seat in terms of mindshare, Perl's still a sturdy and remarkably easy to use language.
posted by boo_radley at 10:49 AM on October 19, 2007


Perl is so fast and so flexible to write in that even if it doesn't make technical sense to write something in it, people that know it often will. Yes, it's ugly, yes, it has issues, but it's literally putting exactly what you're thinking into the interpreter and having it have a go at it.

Python will definitely work, though, and if you're learning it, it's probably the route to go.
posted by devilsbrigade at 11:47 AM on October 19, 2007


Are you familiar with Babel?
posted by Mr. Gunn at 4:32 PM on October 19, 2007


Response by poster: Y'all have given me plenty of reason to start learning Perl, but also some encouragement with Python. Thanks again to the hive mind.
posted by mjklin at 11:56 AM on October 23, 2007


« Older Brasil, qual é teu negócio/o nome do teu...   |   Help me send my desktop over teh internets! Newer »
This thread is closed to new comments.