Get that Chinese out of there!
November 5, 2009 2:10 PM   Subscribe

I'd like to find, or build, a tool for stripping differently encoded text from bilingual (Chinese/English) text documents. Bonus - since this is for subtitles, I need to be able to train it to recognize timecodes and unique characters like returns and hyphens. Double bonus - I know jack about programming, but since the only way to do this right now I have is by hand, which costs me hours and hours of tedium and is putting me at risk for carpal tunnel, I'm definitely willing to put in the time to learn.

I'm a subtitle translator, and because of the nature of movie production, where they rarely have a final cut even after I'm done translating and timing the movie, I usually need to make a "master" subtitle file for editing purposes. That is, a subtitle file with both languages, one of which then usually needs to be removed so I can burn a review DVD for the technically inept director or whoever's in charge of making final decisions on the subtitles.

That means that, at the moment, I have to go line by line and delete all the English or Chinese from the file. For a typical movie, with 1000+ lines of dialogue, that's two hours if I'm feeling self-destructive, and my wrists HURT afterward. Not to mention it just zones me out, and I'm usually exhausted after that kind of repetition. It's the kind of work that can destroy a day's productivity, and I usually have to do that 4-5 times per movie. (I know you'll suggest just doing it once and then editing the English, but these people sometimes add/change/remove 2-300 lines of dialogue per edit. I need to be able to look at the source language as I do it, and they, being movie people, provide zero documentation; gotta work from the master sub file or I waste even more time.)

There has got to be a better way, and I'm pretty sure that given the fact that I'm working with two differently encoded languages, it shouldn't be too terribly impossible, right? I'm imagining a tool where I can tell it "take anything that looks like GBK/BIG5 and remove it." Or the same with ASCII. Now, the trick is, it would need to be trainable to recognize and ignore the particular ASCII patterns that subtitle files use for timecodes, returns, font details, and other metadata in the subs.

If it doesn't exist, as God as my witness, I will build it! It seems simple enough. I know absolutely nothing about programming except general principles, but I know that building a tool that can work with text in .txt documents. Some encoding interfaces with the character databases in whatever OS you're using, some wildcard fields for patterns to ignore, and profile memory for different subtitle formats. This is the thing I need.

If this tool doesn't exist yet, what do I need to know to build it? Is there a software language I should focus on? How can I keep this lightweight and distribute it freely once I've got it? Does sourceforge have some sort of development platform for stuff like this?
posted by saysthis to Computers & Internet (8 answers total) 3 users marked this as a favorite
What does the format of the file look like?

What you're probably going to look for if you end up coding it yourself is 'regular expressions'.
posted by jangie at 2:27 PM on November 5, 2009

Response by poster: Here's a sample of an Adobe Encore format subtitle file:

00:01:54:08 00:01:56:10 这就是你家老头的问题
That was his problem. He went through life
00:01:56:10 00:01:59:08 总是白命清高
on a god-damned high horse.

Which is probably the easiest format to create patterns for. Other subtitle formats get lots more complicated with formatting information after/before the timecode.
posted by saysthis at 3:00 PM on November 5, 2009

Best answer: Just to clarify, you're looking for something that would take the above example, let you click "remove all the Chinese" and just output a new text file like this:

00:01:54:08 00:01:56:10 That was his problem. He went through life
00:01:56:10 00:01:59:08 on a god-damned high horse.

Or, alternately, click "Remove all the English" and just spit back a new text file like this:

00:01:54:08 00:01:56:10 这就是你家老头的问题
00:01:56:10 00:01:59:08 总是白命清高

Correct? But you want it flexible enough to 'learn' what Russian or (insert other language name here) looks like, in order to do the same thing?
posted by bhance at 3:07 PM on November 5, 2009

Best answer: I don't know if such a tool exists. But I agree with you that it doesn't sound like a massive challenge to create.

jangie is right: you'll probably end up using Regular Expressions to filter out the special character sets that you mention. Regular expressions is a 'language', or lexicon of sorts, used for finding patterns in text, and can do pretty sophisticated text parsing.

The beauty of them is that there are a large number of programming languages that have regex libraries, so you'll have many options for which language/environment you want to use to achieve your application.

The other thing you're going to want familiarity with is Unicode. You're not exactly on the mark when you say that you're working with two differently encoded languages; if you're keeping them in the same file, they're probably encoded the same. But their character sets will be distinct within that encoding, certainly if you're working with a Unicode document, e.g. one that's saved as UTF-8 or UTF-16. You should be able to pretty easily separate the different languages by knowing their character code boundaries within the Unicode code page.

Of course, from those basic ideas you could add any amount of complexity to your application just through bells and whistles, which could quickly get a non-programmer into the weeds. For what you're trying to achieve, I'd recommend a language that's well-suited to working from the command line, like Perl. It should be relatively straightforward to pick up, is cross-platform, has very nice built-in regular expression support, and should be able to do everything you're looking for.

Good luck!
posted by Brak at 3:08 PM on November 5, 2009

Response by poster: bhance - exactly! I should have included an example in there. Thank you for being more awake/astute than I am.

Brak - not quite smart enough to learn "Russian", but it would need to be able to ask me and create filters on the fly for things it's not certain about. I imagine learning another language would be...well...there are probably character sets somewhere in the computer that I could just load in for that. The learning capacity I'm talking about is where it will ask me about anything not defined in said character set, and then give me options like the Find & Replace function. Ignore once/all, delete once/all, etc. That way I would only have to click through, at the worst, 50-60 different hiccups. And then I'd want it to be able to save those settings, because most of the things I'd want it to ignore are parts of a given subtitle format. Take the file sample above, I'd want it to be able to remember "numbernumber:numbernumber:numbernumber:numbernumber[space]" is timing information, and not delete that! That way, the next time I have an Adobe Encore file to fiddle with, I can reload that profile and not have to create a rule again.
posted by saysthis at 3:40 PM on November 5, 2009

Response by poster: I just answered bhance's question addressed to Brak.

Do you see what this process does to me??????

Brak, thanks for the info, that looks like it'll be enough to at least point me in the right direction. Expect more clueless questions from me about Perl in the future, I guess. :)
posted by saysthis at 3:44 PM on November 5, 2009

Ok, got it. I think there's probably a half dozen mefites out there who could probably help you crank something close to this out in short order if you wanted some assistance and could provide a couple more file examples. Memail me if you want, I can probably send you a quick PHP example just for the Chinese ...

Perfect for a jobs post, btw
posted by bhance at 4:21 PM on November 5, 2009

A friend who also does a lot of Chinese/English subtitle work solves this by doing it all in Excel. Not sure of the exact mechanics, but I suspect if you had one column of time signatures, one of Chinese then another of English it would be possible to generate the needed data minus the unnecessary column fairly trivially. I could be entirely wrong about that I admit; never done it that way myself.
posted by Abiezer at 4:29 PM on November 5, 2009

« Older Let's Twist Again!   |   Jeez Google pt. 2 - We still want clean URLs Newer »
This thread is closed to new comments.