What word has the most špeĉiäl chârãçtérs?
June 3, 2017 5:58 PM Subscribe
Many languages add diacritics or special characters to the Latin alphabet. What word—a real word, not a made-up one—contains the most such characters? For example, if there was a Lake Ƶúṗė̃ṝčâⱡìᶂɍāğíļîs̊tɨçe̋x̊p̆ȉäłīðỗç̇īǿüs̥ in Lithuania, that would qualify.
Are you looking for purely a-z latin with diacritics? I think 'đ' doesn't actually decompose into a 'd' plus a diacritic (at leas as far as Unicode is concerned).
If anybody has dictionary/spelling files from suitable languages, this is an interesting little bit of programming.
posted by zengargoyle at 10:03 PM on June 3, 2017
If anybody has dictionary/spelling files from suitable languages, this is an interesting little bit of programming.
posted by zengargoyle at 10:03 PM on June 3, 2017
Best answer: A bit of Perl 6 in need of source material.
Output:
posted by zengargoyle at 10:22 PM on June 3, 2017 [2 favorites]
my $s = "ự-phản-đối-việc-tách-nhà-thờ-ra-khỏi-nhà-nước"; say "length {$s.chars}"; my $s1 = $s.NFD.list.grep({.uniprop('GCB') eq 'Other' }); say "decomposed codepoints {$s.NFD.elems}"; say "not-diacritic {$s1.elems}"; say "diactrics {$s.NFD.elems - $s1.elems}"; say "non-decomposible {$s1.grep(* > 127).elems}"; say "diacritic and non-decomposable {$s.NFD.elems - $s1.elems + $s1.grep({$_ > 127}).elems}"
Output:
length 45
decomposed codepoints 61
not-diacritic 45
diactrics 16
non-decomposible 1
diacritic and non-decomposable 17
posted by zengargoyle at 10:22 PM on June 3, 2017 [2 favorites]
Response by poster: Amazing, zengargoyle! Looks like hermitdave over at Github has just the corpus for that. I'm now scrolling through a .txt file with 100,000 Polish words extracted from movie subtitles. I'll report back, though it might take a while (as I'm coding illiterate).
posted by dontjumplarry at 12:03 AM on June 4, 2017
posted by dontjumplarry at 12:03 AM on June 4, 2017
Yeah, I think there's a need for a real dictionary of some sort. I found a 22k list of Vietnamese word/phrases and didn't come close to Paragon's Sự-phản-đối-việc-tách-nhà-thờ-ra-khỏi-nhà-nước" (antidisestablishmentarianism). The subtitles corpus is no better and has what appears to be Chinese and Russian and things like "wooooooaaaahhh!" mixed in.
Hopefully the hive mind can find a good actual dictionary or spell-checker sort of list-o-words to play around with. I don't even really know which languages are latin with funny marks. :)
posted by zengargoyle at 12:54 AM on June 4, 2017
Hopefully the hive mind can find a good actual dictionary or spell-checker sort of list-o-words to play around with. I don't even really know which languages are latin with funny marks. :)
posted by zengargoyle at 12:54 AM on June 4, 2017
Best answer: From A Collection of Word Oddities and Trivia:
The Hungarian words újjáépítéséről ("about its reconstruction") and újjáválaszthatóságáról ("about his/her re-electability") have seven accent marks. Also in Hungarian, alelölülő means "deputy chairperson" (lit.: "deputy fore-sitter"), although this is a made-up word that is not in use. Some words with five accent or diacritical marks are hétérogénéité (French for "heterogeneity") and Héréhérétué (an atoll in the Pacific Ocean near Tahiti). In Hungarian, a word which is widely used to test whether the diacritical marks remain intact (e.g. in sending an e-mail) is árvíztűrő tükörfúrógép ("flood-proof mirror drilling machine"). This is probably the shortest text which contains all the possible accented letters in Hungarian [Ádám Szegi, Tamas Lepesfalvi, Stuart Kidd].posted by Johnny Assay at 5:44 AM on June 4, 2017 [9 favorites]
This thread is closed to new comments.
posted by Paragon at 6:19 PM on June 3, 2017 [4 favorites]