What's going on with instagram's English -> Japanese translation?
February 12, 2021 6:25 AM Subscribe
I follow a bunch of Japanese instagram accounts, and the auto-translate frequently does this weird thing where a word is repeated a lot in the translation, when it pretty obviously isn't in the original text. Examples inside.
E.g., here's the original text on a recent post:
"#pizzatoru #pizza #ぴざとる #ピザトル #bushbaby#ブッシュベイビー #ショウガラゴ#galago #monkey #猿 #🐵"
Here's the translation:
"Yummy #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 "
(that's a monkey emjoi, in case it's not showing up).
In other cases it does the same thing with normal text (i.e. not just with a collection of hashtags like the above). It'll come out something like "Today I went out for a walk, and then a, and then a, and then a, and then a, and then a"
I know the state of machine translation from Japanese is not great, but this one seems particularly bizarre. Anyone have an idea why this might happen?
E.g., here's the original text on a recent post:
"#pizzatoru #pizza #ぴざとる #ピザトル #bushbaby#ブッシュベイビー #ショウガラゴ#galago #monkey #猿 #🐵"
Here's the translation:
"Yummy #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 "
(that's a monkey emjoi, in case it's not showing up).
In other cases it does the same thing with normal text (i.e. not just with a collection of hashtags like the above). It'll come out something like "Today I went out for a walk, and then a, and then a, and then a, and then a, and then a"
I know the state of machine translation from Japanese is not great, but this one seems particularly bizarre. Anyone have an idea why this might happen?
What is the actual translation of those Japanese hashtags?
The Japanese hashtags are nearly the same as the English ones: pizzatoru (twice, two different character sets, and translating it is beyond me), bushbaby, shougalago (Senegal bushbaby or lesser galago, says Wikipedia) and monkey. So in total, you've got:
#pizzatoru #pizza #pizzatoru #pizzatoru #bushbaby #bushbaby #lesser galago #galago #monkey #monkey #🐵
posted by ManyLeggedCreature at 7:23 AM on February 12, 2021
The Japanese hashtags are nearly the same as the English ones: pizzatoru (twice, two different character sets, and translating it is beyond me), bushbaby, shougalago (Senegal bushbaby or lesser galago, says Wikipedia) and monkey. So in total, you've got:
#pizzatoru #pizza #pizzatoru #pizzatoru #bushbaby #bushbaby #lesser galago #galago #monkey #monkey #🐵
posted by ManyLeggedCreature at 7:23 AM on February 12, 2021
Best answer: It's a known phenomenon in the field of natural language processing in machine learning called text degeneration, and happens with many language models, including GPT-2.
posted by typify at 7:41 AM on February 12, 2021 [4 favorites]
posted by typify at 7:41 AM on February 12, 2021 [4 favorites]
Basically what typify said. This is a weird artifact that sometimes happens when using neural networks for text generation.
posted by mekily at 8:05 AM on February 12, 2021
posted by mekily at 8:05 AM on February 12, 2021
Response by poster: Thanks, typify! That's super interesting.
posted by Jobst at 11:29 AM on February 12, 2021
posted by Jobst at 11:29 AM on February 12, 2021
« Older How to kill it at work when my boss is trying to... | I need a new computer keyboard. Newer »
This thread is closed to new comments.
In modern Machine Translation engines, the model must frequently deal with novel, out-of-vocabularly words. The way it handles this is by looking at neighboring words and finding something, anything, that might be a related word that could fit in that slot.
It's possible they're treating each and every hashtag as new word (rather than stripping the poundsign - but even that doesn't always work, as in the case of something like #WearAMask, where there are concatenated words - don't know if the Japanese internet does that). This could result in a sentence with one or two translatable words at either end, and a sequence of tokens that the system simply doesn't know, and needs to find alternatives for.
The other possibility is that they rolled out a new model that is just completely 100% broken :-)
posted by scolbath at 6:32 AM on February 12, 2021