Best solution for ad-hoc parsing and reformatting of text?
February 22, 2016 2:33 PM   Subscribe

I handle a lot of screenplays, recording scripts, game dialogue text tables, localization tables, and other text-related files. And I have a need to output to varying formats with varying goals--sometimes to reformat a screenplay to something more standard or to put text in a format that is ready to use by a game or other software. For simplicity's sake, assume that all input and output formats would be plain text. What is a good software tool to help with these format conversions?

I'm looking for a swiss army knife. Something quick and easy, but powerful, and helps me avoid making mistakes. I'm willing to spend money, but ideally, the software would cost less than $200. I'm a programmer, but feel that there should be a mostly non-code-oriented solution out there. I picture something where you set up rules on processing the input file, and the tool quickly generates previews of output.

Let me explain the non-structured aspect of the input files. Take as an example, a screenplay, which follows a format like:

Do we have to do this now?

I'm afraid so!

I'd like to be able to take something like that as an input and convert it to some structured output like:

'Bob', 'nervously', 'Do we have to do this now?'
'Jane', '', 'I'm afraid so!'

The variations on this conversion are endless. I'll be given screenplays in slightly different formats--some of them very janky. And the output will need to be specialized per project for different clients. Hence the desire for a strong adhoc tool instead of just investing time in coding up a converter utility or some regex-based shell script.
posted by ErikH2000 to Computers & Internet (12 answers total) 2 users marked this as a favorite
Best answer: the closest i know, and what i would likely use for similar tasks, is emacs' keyboard macros. but that's horrendously old-school, and you need to be familiar with a wide range of emacs commands for it to be useful.

xml style sheets were intended to be an answer to the second two thirds of this problem (a standard representation format and an arbitrary formatting), but no, you don't want to go there.

not really what you asked, but i wonder whether markdown would be a good intermediate target. if the documents are text they are already likely close to markdown. once you edit them so they "really are" then you can use pandoc to convert to other formats. that's how i support multiple formats for documentation at work.

i suspect the more general tool you're asking for is ai complete. but i am also hoping there's some cool app that someone will post that gets much of the way there...

edit: there's some confusion in both my answer and your question about the meaning of "format" - whether it's the arrangement of text, or the kind of file (text, word, pdf, etc).
posted by andrewcooke at 2:46 PM on February 22, 2016 [2 favorites]

Unless the inputs can be very carefully defined, this is what Perl is for. The really hard part isn't the coding, but the validation of the output. There will always be one script where there's a blank line missing, or the text isn't really at the start of the line, or …
posted by scruss at 3:39 PM on February 22, 2016 [3 favorites]

Do plugins for SublimeText get you part of the way there? [ref]
posted by gregglind at 3:46 PM on February 22, 2016

Yes, Perl is great for this. You might have to invest in writing the scripts initially, but then you should be able to reuse them.
posted by Ender's Friend at 7:38 PM on February 22, 2016 [1 favorite]

sed and awk might also be to your liking as a more restrained (and admittedly oldschool) alternative to Perl.

Sadly, programming is probably your best bet.

It's not even difficult programming. But tools like Perl have an unfortunate learning curve, because they do lots of other stuff.
posted by schmod at 7:59 PM on February 22, 2016 [1 favorite]

I'm not sure about this, as I've only read about it in the ProfHacker column rather than used it myself, but Pandoc might be the kind of thing you're looking for. Here's an overview by one of the ProfHacker columnists:
posted by davemack at 3:16 AM on February 23, 2016

Notepad++ can do things like this.
posted by soelo at 8:04 AM on February 23, 2016

after schmod's comment, awk is even better for this than Perl, as it's a tool to reformat text that just happens to also be a Turing-complete language. It might not be so good on Unicode text as Perl, though.
posted by scruss at 10:27 AM on February 23, 2016 [1 favorite]

You might want to take a look at fountain, which is a plain-text markup language for screenplays, similar to Markdown. It converts the plain text to a tagged format that might be overkill for you.
posted by adamrice at 1:09 PM on February 23, 2016 [1 favorite]

Response by poster: I think this conversation has helped me clarify the problem a bit. If I take out the simpler or less-frequently-occurring tasks, I'm left with... wanting a way to get widely varying screenplay formats into a consistent, structured format. That format could be Fountain, FinalDraft's XML-based format, or even a relational database. (For my purposes, probably a database.) Once structured, then it can be output in various ways with code that needs little maintenance.

The input side is where all the ad-hockery is unavoidable. I was hoping somebody would mention a tool that is faster than writing a bunch of parsing code. But it seems there's nothing like that. And it may be a wish for oversimplification of the task on my part.
posted by ErikH2000 at 10:03 PM on February 23, 2016

This is how I restructure that type of text in Word.

Ctrl+ F > select replace from the drop down Menu. Select Advanced. Select Special. Select Paragraph mark (^p) then do the same again so you have ^p^p . Then in the "replace with" box write "abcxyz" or something that is definitely not in the actual text with a space before and after. Then replace. This will find all the consecutive paragraph marks (such as the two between "now?"^p

Now do the same thing but just find all the single paragraph marks and replace with a space. That will bring all your text onto one line.
Then find all the "acbxyz" and replace those with a ^p paragraph mark.
Works for me.
posted by guy72277 at 2:29 AM on February 24, 2016

Best answer: You've gotten some great tips above for Notepad++ and Word. I'm on a Mac, and I love TextWrangler for this sort of thing.

You can create little regex macros to look for things like guy72277 described for Word (or even [full line] return [full line] return [full line] return return) and then save and re-use them if they're likely to be useful again.

Something quick and easy, but powerful, and helps me avoid making mistakes. I'm willing to spend money, but ideally, the software would cost less than $200.

In my experience, TextWrangler is very quick, reasonably easy to use for such a variable task, and extremely powerful. It may not help you avoid making mistakes, but its Undo is excellent.

And it's free.
posted by kristi at 8:40 AM on February 25, 2016

« Older "Underground" comedy shows in NYC?   |   Advice for dealing with our wedding photographer? Newer »
This thread is closed to new comments.