What word has the most špeĉiäl chârãçtérs?
June 3, 2017 5:58 PM   Subscribe

Many languages add diacritics or special characters to the Latin alphabet. What word—a real word, not a made-up one—contains the most such characters? For example, if there was a Lake Ƶúṗė̃ṝčâⱡìᶂɍāğíļîs̊tɨçe̋x̊p̆ȉäłīðỗç̇īǿüs̥ in Lithuania, that would qualify.
posted by dontjumplarry to Writing & Language (6 answers total) 4 users marked this as a favorite
 
Vietnamese has a lot of diacritics because some mark tone. "Sự-phản-đối-việc-tách-nhà-thờ-ra-khỏi-nhà-nước" (antidisestablishmentarianism), for example, has a dozen at least.
posted by Paragon at 6:19 PM on June 3, 2017 [4 favorites]


Are you looking for purely a-z latin with diacritics? I think 'đ' doesn't actually decompose into a 'd' plus a diacritic (at leas as far as Unicode is concerned).

If anybody has dictionary/spelling files from suitable languages, this is an interesting little bit of programming.
posted by zengargoyle at 10:03 PM on June 3, 2017


Best answer: A bit of Perl 6 in need of source material.

my $s = "ự-phản-đối-việc-tách-nhà-thờ-ra-khỏi-nhà-nước"; say "length {$s.chars}"; my $s1 = $s.NFD.list.grep({.uniprop('GCB') eq 'Other' }); say "decomposed codepoints {$s.NFD.elems}"; say "not-diacritic {$s1.elems}"; say "diactrics {$s.NFD.elems - $s1.elems}"; say "non-decomposible {$s1.grep(* > 127).elems}"; say "diacritic and non-decomposable {$s.NFD.elems - $s1.elems + $s1.grep({$_ > 127}).elems}"


Output:
length 45
decomposed codepoints 61
not-diacritic 45
diactrics 16
non-decomposible 1
diacritic and non-decomposable 17

posted by zengargoyle at 10:22 PM on June 3, 2017 [2 favorites]


Response by poster: Amazing, zengargoyle! Looks like hermitdave over at Github has just the corpus for that. I'm now scrolling through a .txt file with 100,000 Polish words extracted from movie subtitles. I'll report back, though it might take a while (as I'm coding illiterate).
posted by dontjumplarry at 12:03 AM on June 4, 2017


Yeah, I think there's a need for a real dictionary of some sort. I found a 22k list of Vietnamese word/phrases and didn't come close to Paragon's Sự-phản-đối-việc-tách-nhà-thờ-ra-khỏi-nhà-nước" (antidisestablishmentarianism). The subtitles corpus is no better and has what appears to be Chinese and Russian and things like "wooooooaaaahhh!" mixed in.

Hopefully the hive mind can find a good actual dictionary or spell-checker sort of list-o-words to play around with. I don't even really know which languages are latin with funny marks. :)
posted by zengargoyle at 12:54 AM on June 4, 2017


Best answer: From A Collection of Word Oddities and Trivia:
The Hungarian words újjáépítéséről ("about its reconstruction") and újjáválaszthatóságáról ("about his/her re-electability") have seven accent marks. Also in Hungarian, alelölülő means "deputy chairperson" (lit.: "deputy fore-sitter"), although this is a made-up word that is not in use. Some words with five accent or diacritical marks are hétérogénéité (French for "heterogeneity") and Héréhérétué (an atoll in the Pacific Ocean near Tahiti). In Hungarian, a word which is widely used to test whether the diacritical marks remain intact (e.g. in sending an e-mail) is árvíztűrő tükörfúrógép ("flood-proof mirror drilling machine"). This is probably the shortest text which contains all the possible accented letters in Hungarian [Ádám Szegi, Tamas Lepesfalvi, Stuart Kidd].
posted by Johnny Assay at 5:44 AM on June 4, 2017 [9 favorites]


« Older Help Me Motivate!   |   What does it look like to come to love a pet in a... Newer »
This thread is closed to new comments.