Do I have to turn このファイル.dat into 00001.dat?
June 24, 2008 5:51 PM Subscribe
Is there some way to bulk rename files from kanji/kana to romaji? Or any other solution to keep some of the Japanese file name while only using ASCII characters?
I have about 3,000 files that contain at least 1 kanji or kana character. I need to use them with software that refuses to load any file with non ASCII characters. There is no alternative to this software and tech support's answer is "sorry, we'll put this on the wish list".
It would be terribly useful to retain some of the meaning of the original names of the files, but renaming them by hand would take more time than I have. As an absolutely last resort I'll just bulk rename them with arbitrary numbers, but only as a last resort. I don't need English translation (though that would be okay) but something like romaji readings of the Japanese.
Any ideas, cleverness or scripting tricks?
Windows preferred but Mac answers welcome. Need not be free.
I have about 3,000 files that contain at least 1 kanji or kana character. I need to use them with software that refuses to load any file with non ASCII characters. There is no alternative to this software and tech support's answer is "sorry, we'll put this on the wish list".
It would be terribly useful to retain some of the meaning of the original names of the files, but renaming them by hand would take more time than I have. As an absolutely last resort I'll just bulk rename them with arbitrary numbers, but only as a last resort. I don't need English translation (though that would be okay) but something like romaji readings of the Japanese.
Any ideas, cleverness or scripting tricks?
Windows preferred but Mac answers welcome. Need not be free.
If you insist on readable results, kana to romaji is doable, but kanji do romaji probably isn't -- written Japanese is hard to segment and sometimes simply ambiguous. If you don't care about readability, I'd suggest converting all the file names to UTF-8 (or UTF-7 if you're really restricted to 7-bit ASCII). You could then losslessly convert them back afterwards.
posted by The Tensor at 7:30 PM on June 24, 2008
posted by The Tensor at 7:30 PM on June 24, 2008
The problem is that most kanji have multiple pronunciations. If the kanji is part of a word, you can't translate it into romaji with a simple script. It would take a dictionary lookup.
For example, 米
It can mean "rice", or "metre", or "USA". In terms of pronunciations, it's all of メエトル ベイ マイ こめ よね meetoru bei mai kome yone
(I've been told that there's a cottage industry in Japan who are experts in obscure phonetic readings of kanji. They act as consultants for new parents who are trying to come up with names for their babies, because it's considered scandalously uncultured for someone to give their child a normal name with an easily-read spelling. As a result, it's often next to impossible to tell how to pronounce a non-famous person's name without furigana. That's what I was told. The same person, a Ph.D in linguistics who grew up bilingual in Japanese and English, also told me that the name "Yamada Jirou" is just about the most dull, boring name it's possible for a man to have. As he told me, that's because it can be written without using any kanji beyond the first grade level.)
posted by Class Goat at 8:18 PM on June 24, 2008 [1 favorite]
For example, 米
It can mean "rice", or "metre", or "USA". In terms of pronunciations, it's all of メエトル ベイ マイ こめ よね meetoru bei mai kome yone
(I've been told that there's a cottage industry in Japan who are experts in obscure phonetic readings of kanji. They act as consultants for new parents who are trying to come up with names for their babies, because it's considered scandalously uncultured for someone to give their child a normal name with an easily-read spelling. As a result, it's often next to impossible to tell how to pronounce a non-famous person's name without furigana. That's what I was told. The same person, a Ph.D in linguistics who grew up bilingual in Japanese and English, also told me that the name "Yamada Jirou" is just about the most dull, boring name it's possible for a man to have. As he told me, that's because it can be written without using any kanji beyond the first grade level.)
posted by Class Goat at 8:18 PM on June 24, 2008 [1 favorite]
I would suggest separating the translation issue from the re-naming issue.
Collect the names of the files into a machine-readable format, such as a spreadsheet. Translate the names, either mechanical or manually, so that you now have a cross-reference of old name and new name. (If you use a mechanical translation, you might want to give the cross-reference a once-over to fix any obviously bad names.)
A trival Python script can be used to rename the files from the cross-reference.
The cross-reference is useful in case you need to go back the other way and need to know the original name of the file. Also, if you ever need to translate the files again (due to a revision in the original source, for example), then you have the cross-reference as a starting point for your next translation/renaming effort.
posted by SPrintF at 8:39 PM on June 24, 2008
Collect the names of the files into a machine-readable format, such as a spreadsheet. Translate the names, either mechanical or manually, so that you now have a cross-reference of old name and new name. (If you use a mechanical translation, you might want to give the cross-reference a once-over to fix any obviously bad names.)
A trival Python script can be used to rename the files from the cross-reference.
The cross-reference is useful in case you need to go back the other way and need to know the original name of the file. Also, if you ever need to translate the files again (due to a revision in the original source, for example), then you have the cross-reference as a starting point for your next translation/renaming effort.
posted by SPrintF at 8:39 PM on June 24, 2008
Oh, another thought: before your execute your re-name process, sort the cross-reference and resolve duplicate destination names, to ensure you don't overwrite one file with another.
posted by SPrintF at 9:02 PM on June 24, 2008
posted by SPrintF at 9:02 PM on June 24, 2008
Best answer: Well, I fiddled around with that python romaji conversion script and came up with another script which may give you okay results, depending on how well your filenames play with the conversion script. It sounds like you're okay with some degradation. I tried this on a file named "このファイル.dat" and got back "konoFUXAIRU.dat".
Knowing very little about this subject, I can't really evaluate it, but this web parser says it ought to be "kono fairu.dat"
Close enough? Then:
1. Save Ed Halley's conversion script as romaji.py
2. Save my filename batch-conversion script as filename-to-romaji.py
3. Back up your files
4. Run
Corrections and improvements to my admittedly cargo-culted code are welcome. Use at your own risk and YMMV.
posted by wam at 9:30 PM on June 24, 2008
Knowing very little about this subject, I can't really evaluate it, but this web parser says it ought to be "kono fairu.dat"
Close enough? Then:
1. Save Ed Halley's conversion script as romaji.py
2. Save my filename batch-conversion script as filename-to-romaji.py
3. Back up your files
4. Run
python filename-to-romaji.py dir_with_files_to_convert
Corrections and improvements to my admittedly cargo-culted code are welcome. Use at your own risk and YMMV.
posted by wam at 9:30 PM on June 24, 2008
Response by poster: Thanks guys! I suppose I should have made it more clear that I knew I was going to bastardize some of the names, but slightly readable was better than unreadable (such as UTF). (Though as poor as my Japanese reading skill is, reading romaji is painful.) My goal was a) getting these files into the destination application, and b) having some idea of what's in the file before I open it. A one way translation is fine.
I'm going to run them through bothwam's nice hack and the kinda scary looking (but possibly more robust) Russian app in the first reply, and see which gives me better resutls. (And yes, it should be "konofairu.dat" :)
Class Goat: Never heard of people looking for obscure names however I know many people (mostly girls and women) who have given names that are 100% kana, which any school child can read. Though I have noticed in the last year or two a fad to improve the average person's kanji knowledge, including pronunciation, meaning and strokes. Perhaps it's a result or cause of that.
Can't think of a curse much worse than giving my kid a name that no one can write or spell. Then again I knew a man whose name was 4 syllables long, but the kanji took over 60 strokes to write.
posted by Ookseer at 10:42 PM on June 24, 2008
I'm going to run them through bothwam's nice hack and the kinda scary looking (but possibly more robust) Russian app in the first reply, and see which gives me better resutls. (And yes, it should be "konofairu.dat" :)
Class Goat: Never heard of people looking for obscure names however I know many people (mostly girls and women) who have given names that are 100% kana, which any school child can read. Though I have noticed in the last year or two a fad to improve the average person's kanji knowledge, including pronunciation, meaning and strokes. Perhaps it's a result or cause of that.
Can't think of a curse much worse than giving my kid a name that no one can write or spell. Then again I knew a man whose name was 4 syllables long, but the kanji took over 60 strokes to write.
posted by Ookseer at 10:42 PM on June 24, 2008
You want Kakasi. It's not perfect, a pain to use (but there's a Perl module front-end that cat help). It does kanji/kana -> romaji conversion using a basic dictionary. You'll have to write a script of some sort and do a lot of cleanup work after.
posted by zengargoyle at 2:55 AM on June 25, 2008
posted by zengargoyle at 2:55 AM on June 25, 2008
This thread is closed to new comments.
If you don't like that one, here's another kana-romaji python script that you might be able to use.
posted by wam at 6:21 PM on June 24, 2008