I'd like to find, or build, a tool for stripping differently encoded text from bilingual (Chinese/English) text documents. Bonus - since this is for subtitles, I need to be able to train it to recognize timecodes and unique characters like returns and hyphens. Double bonus - I know jack about programming, but since the only way to do this right now I have is by hand, which costs me hours and hours of tedium and is putting me at risk for carpal tunnel, I'm definitely willing to put in the time to learn.
I'm a subtitle translator, and because of the nature of movie production, where they rarely have a final cut even after I'm done translating and timing the movie, I usually need to make a "master" subtitle file for editing purposes. That is, a subtitle file with both languages, one of which then usually needs to be removed so I can burn a review DVD for the technically inept director or whoever's in charge of making final decisions on the subtitles.
That means that, at the moment, I have to go line by line and delete all the English or Chinese from the file. For a typical movie, with 1000+ lines of dialogue, that's two hours if I'm feeling self-destructive, and my wrists HURT afterward. Not to mention it just zones me out, and I'm usually exhausted after that kind of repetition. It's the kind of work that can destroy a day's productivity, and I usually have to do that 4-5 times per movie. (I know you'll suggest just doing it once and then editing the English, but these people sometimes add/change/remove 2-300 lines of dialogue per edit. I need to be able to look at the source language as I do it, and they, being movie people, provide zero documentation; gotta work from the master sub file or I waste even more time.)
There has got to be a better way, and I'm pretty sure that given the fact that I'm working with two differently encoded languages, it shouldn't be too terribly impossible, right? I'm imagining a tool where I can tell it "take anything that looks like GBK/BIG5 and remove it." Or the same with ASCII. Now, the trick is, it would need to be trainable to recognize and ignore the particular ASCII patterns that subtitle files use for timecodes, returns, font details, and other metadata in the subs.
If it doesn't exist, as God as my witness, I will build it! It seems simple enough. I know absolutely nothing about programming except general principles, but I know that building a tool that can work with text in .txt documents. Some encoding interfaces with the character databases in whatever OS you're using, some wildcard fields for patterns to ignore, and profile memory for different subtitle formats. This is the thing I need.
If this tool doesn't exist yet, what do I need to know to build it? Is there a software language I should focus on? How can I keep this lightweight and distribute it freely once I've got it? Does sourceforge have some sort of development platform for stuff like this?
posted by saysthis to computers & internet (8 comments total)
2 users marked this as a favorite
What you're probably going to look for if you end up coding it yourself is 'regular expressions'.
posted by jangie at 2:27 PM on November 5, 2009