What's going on with instagram's English -> Japanese translation?
February 12, 2021 6:25 AM   Subscribe

I follow a bunch of Japanese instagram accounts, and the auto-translate frequently does this weird thing where a word is repeated a lot in the translation, when it pretty obviously isn't in the original text. Examples inside.

E.g., here's the original text on a recent post:
"#pizzatoru #pizza #ぴざとる #ピザトル #bushbaby#ブッシュベイビー #ショウガラゴ#galago #monkey #猿 #🐵"

Here's the translation:
"Yummy #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 #🐵 "

(that's a monkey emjoi, in case it's not showing up).

In other cases it does the same thing with normal text (i.e. not just with a collection of hashtags like the above). It'll come out something like "Today I went out for a walk, and then a, and then a, and then a, and then a, and then a"

I know the state of machine translation from Japanese is not great, but this one seems particularly bizarre. Anyone have an idea why this might happen?
posted by Jobst to Computers & Internet (5 answers total)
What is the actual translation of those Japanese hashtags? And to be clear, it did not simply pass through the English hashtags (e.g. #pizza?)

In modern Machine Translation engines, the model must frequently deal with novel, out-of-vocabularly words. The way it handles this is by looking at neighboring words and finding something, anything, that might be a related word that could fit in that slot.

It's possible they're treating each and every hashtag as new word (rather than stripping the poundsign - but even that doesn't always work, as in the case of something like #WearAMask, where there are concatenated words - don't know if the Japanese internet does that). This could result in a sentence with one or two translatable words at either end, and a sequence of tokens that the system simply doesn't know, and needs to find alternatives for.

The other possibility is that they rolled out a new model that is just completely 100% broken :-)
posted by scolbath at 6:32 AM on February 12, 2021

What is the actual translation of those Japanese hashtags?

The Japanese hashtags are nearly the same as the English ones: pizzatoru (twice, two different character sets, and translating it is beyond me), bushbaby, shougalago (Senegal bushbaby or lesser galago, says Wikipedia) and monkey. So in total, you've got:

#pizzatoru #pizza #pizzatoru #pizzatoru #bushbaby #bushbaby #lesser galago #galago #monkey #monkey #🐵
posted by ManyLeggedCreature at 7:23 AM on February 12, 2021

Best answer: It's a known phenomenon in the field of natural language processing in machine learning called text degeneration, and happens with many language models, including GPT-2.
posted by typify at 7:41 AM on February 12, 2021 [4 favorites]

Basically what typify said. This is a weird artifact that sometimes happens when using neural networks for text generation.
posted by mekily at 8:05 AM on February 12, 2021

Response by poster: Thanks, typify! That's super interesting.
posted by Jobst at 11:29 AM on February 12, 2021

« Older How to kill it at work when my boss is trying to...   |   I need a new computer keyboard. Newer »
This thread is closed to new comments.