How to write an .rtf to XML/HTML conversion tool
February 4, 2020 8:00 AM   Subscribe

I would like to create a tool that converts an .rtf file to an XML or HTML document. I have familiarity with markup languages, but very limited knowledge of programming/scripting languages. What would be the best language for me to learn to try to create this tool?

While this tool would serve an actual purpose, the exercise is as much a personal learning "stretch" goal as much as anything else. It also doesn't need to happen in any particular time frame. So if your first thought is, "I would use X but I think you're biting off more than you can chew," that's fine! I do better at learning new skills when I have a legit goal to apply them to.

I've left this open to both XML and HTML because I have two different projects I could apply this to, and either would be useful.
posted by pinwheel spark to Technology (8 answers total)
Best answer: Python. In fact there’s probably already a Python library that does this, although that obviously defeats the purpose.
posted by Tell Me No Lies at 8:05 AM on February 4, 2020 [1 favorite]

Best answer: Just pick a language you want to learn or pick one that is popular (Python), if you are interested in learning programming. I could probably accomplish that ask in C++, perl, MS Access Visual Basic, Javascript, MS DOS, and probably a few others, depending on how complex the RTF is. Once you learn a programming language and understand the concepts, they are pretty much all the same.

That's also actually a pretty good starting project, because reading input and turning it into another format is like 50% of programming.
posted by The_Vegetables at 8:10 AM on February 4, 2020

Best answer: Pandoc does this already, I believe, and it's written in Haskell.

Python is probably a good choice given a lot of beginner's docs and a large and friendly community.
posted by jzb at 8:11 AM on February 4, 2020 [1 favorite]

Best answer: I would suggest Python for this as a first project. If I were doing this, I'd probably reach for Perl or some abominable combination of Unix shell utilities/built-ins first because I'm more familiar with them.
posted by jquinby at 8:19 AM on February 4, 2020

Best answer: I'd agree with everyone so far - this is an excellent project for Python!
posted by Umami Dearest at 8:43 AM on February 4, 2020

Best answer: Agree with others, Python is a good tool for this. But you could do it in basically any server-side scripting language. NodeJS. Ruby. Perl. PHP. Go, etc... the options are basically limitless. Python is a good first language though.

That said -- I question your task's premise a little bit. RTF to HTML is "easy" as they are both markup languages that represent the layout of content on a screen in a defined way and I would bet there are already a dozen tools that do this. Creating another one is a valuable learning exercise. RTF to XML doesn't really make sense though without more info - what purpose does having this data in XML serve? Are you just looking for an alternative way to represent the data? (In that case, I'd look at JSON instead of XML... all modern languages speak JSON these days, XML is more legacy than not). Or is there a specific XML spec you're planning to convert the RTF to?
posted by cgg at 9:09 AM on February 4, 2020

Response by poster: Sounds like Python is the place to start, which is nice because now that I think about it, I already know some folks who are very enthusiastic about it. Thanks everyone!

Re: why XML; yes, it would be a specific spec, one that we use for publishing (think user manuals -- a lot of structured text with many headings, numbered paragraphs, etc.). It's definitely a legacy model, but moving away from it is happening at the very slow pace of institutional change. The more I think on it, though, I realize it's probably not the best place for me to spend my time and energy.
posted by pinwheel spark at 12:05 PM on February 4, 2020

Honestly, the differences between formatting for JSON vs XML for a fixed data set are pretty minor. Do it in XML first and then in JSON. There are libraries for both.
posted by The_Vegetables at 1:41 PM on February 4, 2020

« Older Help Me Get Out of My Head and Into My Body.   |   Which Chromebook Should I Buy? Newer »
This thread is closed to new comments.