We are not the masters of subtitling yet
December 16, 2007 6:39 PM   Subscribe

I (well, my boss) am/is inputting Chinese-language sub-station Alpha subtitles into VirtualDub for DVD's, and what comes out are, surprise surprise, piles of unicode. The citizens of the PRC are not known to be fluent in unicode, so what could be going wrong and how do I, his designated software monkey, fix it?

Chinese language packs are installed, for the record.
posted by saysthis to Computers & Internet (13 answers total) 1 user marked this as a favorite
 
Response by poster: Also, the subtitling plugin used is called subtitler.vdf.
posted by saysthis at 6:50 PM on December 16, 2007


Someone has to ask -- what do you mean by "piles of unicode"?

Obviously what you've got isn't Chinese characters, it's something else?
posted by AmbroseChapel at 6:53 PM on December 16, 2007


Response by poster: Specifically, I mean the subtitles are gibberish onscreen. In the SSA they're Chinese (at least on my screen). After rendering, once you turn on the Chinese subtitles, they're just...code.
posted by saysthis at 7:01 PM on December 16, 2007


Aren't subtitles on DVDs pictures? By pictures I mean not text. The last time I tried to rip subtitles from a DVD, I had to use OCR software because there is no kind of text representation of the subtitles. Isn't this the standard method for ripping subtitles?

If this is so, then it seems like the characters are not being converted to "pictures".

By piles of unicode, he means unicode representations of the characters. For each Chinese character you would have five ascii characters. Unicode-enabled systems translate these via lookup table to Chinese characters.

I know that doesn't help too much, but maybe it makes the question clearer.
posted by strangeguitars at 7:32 PM on December 16, 2007


Subtitles are just very complicated text files, this is why you can choose between French and Spanish on the same DVD. If they were just pictures it would mean an additional copy of the movie for each translation. Same with Audio. You have a Video stream, multiple Audio streams and multiple subtitle streams. I'm surprised the OP has a problem because DVD stuff is pretty much international standard, (except for Region Codes which you can get around...

Probably something like sending a Big5 (old school encoding font) with a UTF-8 subtitle file. But I'll put off further speculation in the hopes that somebody else knows more.
posted by zengargoyle at 7:46 PM on December 16, 2007


Zengargoyle, that's not correct.

The subtitles are a separate video stream, which is two bits deep, encoding four colors. One of the colors is "transparent" and you get to pick the other three. That's why you could see "follow the white rabbit" on the Matrix DVD.

Subtitling packages are handed text and timecodes, and render the text into 2-bitplane graphic images, which are then encoded in the VOB file as a separate video stream.

The OP's problem is that rendering step. Instead of taking the text and interpreting it two bytes at a time to produce hanzi, it's taking it one byte at a time and producing garble. It is then taking the garble, rendering it to 2-bitplane images, and adding that to the VOB file as a separate video stream -- which is displaying the garble when the DVD is played.
posted by Steven C. Den Beste at 8:06 PM on December 16, 2007 [1 favorite]


No, DVD format subtitles (.vob/.sub format) are in fact pictures. They are pictures of text, which is then overlayed on top of the movie. There are two parts (.vob, and .sub), one is the text pictures, and one is a text file containing the exact times to overlay the pictures on the video.

That said, I'm not familiar enough with VirtualDub to be of much help. Maybe try on some video forums- here's the subtitle forum on doom9.
posted by p3t3 at 8:06 PM on December 16, 2007


Should have previewed. SCDB said it better ;)
posted by p3t3 at 8:07 PM on December 16, 2007


On VirtualDub's page: "Why does my foreign-language text appear garbled?"
posted by Steven C. Den Beste at 9:16 PM on December 16, 2007


sweet, I enjoy being wrong... :) I guess that in the *few* 'DVD' rips of things I've downloaded the subbers did OCR on the images or something... because they're .mkv and have a video, 3 audio and 2 or 3 .srt streams. Somebody is doing a lot of work... so bad reverse engineering guesswork on me.
posted by zengargoyle at 9:31 PM on December 16, 2007


Zengargoyle, I have a program somewhere that does exactly that. It relies on the fact that the character renderer is extremely regular, and the fact that the characters are not continuous. So it physically parses the subtitle graphics into characters, and every time it runs into one it doesn't know, it displays it on the screen and asks you to enter the appropriate character. Once you've done that, it will handle that character properly for the rest of the run.

It's actually not very annoying to use; you get prompted a lot early, but 25 or 30 prompts is usually all you'll deal with, just because of how character frequencies in English work.

Some alternate player programs support a different way of doing subtitles, in which the timecodes and the subtitle texts are placed in a text file which has the same basename as the video file they're associated with, but a different file extension. That may have been the source of your confusion.
posted by Steven C. Den Beste at 9:54 PM on December 16, 2007


You're not seeing Unicode. There's no such visible thing. You're (perhaps) seeing an "encoding" of unicode values. The problem is it's the wrong encoding.

The question to ask is "what encoding does my renderer expect?" Then, if the text you have is really Unicode (a big "if"), then you can encode it using something like Python. Assuming it has to be encoded to GB2312-80,

$ python
>>> "国家标准码".decode("utf8").encode("GB2312-80") # if you have utf8
'\xb9\xfa\xbc\xd2\xb1\xea\xd7\xbc\xc2\xeb'
>>> u'\u56fd\u5bb6\u6807\u51c6\u7801'.encode("GB2312-80") # if you have unicode
'\xb9\xfa\xbc\xd2\xb1\xea\xd7\xbc\xc2\xeb'
>>> #
>>> # Smarter to read from a file
>>> import codecs
>>> f = codecs.open('foo', encoding='bar') # where "foo" is your filename, and "bar" is its encoding
>>> for line in f:
>>> print repr(line)
posted by cmiller at 5:24 AM on December 17, 2007


The web ate some significant whitespace before that last line, "space space print...".
posted by cmiller at 5:26 AM on December 17, 2007


« Older Algorithm Challenge!   |   What happens to a membership contract when the gym... Newer »
This thread is closed to new comments.