I have a ridiculous alphabetical / mathematical question
June 9, 2023 11:27 PM Subscribe
Three words: CHEF CHELSEA CHERRY (P.S. I'm not shouting) ...Alphabetically, which word is CHELSEA closer to: CHEF or CHERRY? And, more importantly* why? Can you help me come up with a mathematical model for this?
*this is not important at all.
I assume the way to do this is by assigning a value for each letter, A to Z being 1 to 26.
All three words begin with CHE, thus we move on to the 4th letter for each: cheF cheLsea cheRry.
F and R are equal distance away from L (6 letters away). It's still a tie.
CHEF has no 5th letter, so we can assign its 5th letter a value of 0.
chelSea's 5th letter is S, so we can assign it a value of 19.
cherRy's 5th letter is R, so we can assign it a value of 18.
But it's like Pac-Man or Space Invaders, where you fly off one side and end up on the other, right?
To get from chelSea to chef_, you'd need to go backward from S to A (19 letters in movement) plus all 26 letters from Z through A to nothing for chef since chef has no fifth letter. That's 45 letters, in distance. Right?
To get from chelSea to cherRy, you'd need to go from S to Z (7 letters) plus the 18 letters to go from A through R. That's 25 letters.
But what do you do with the remaining EA in chelsEA and the remaining Y in cherrY?
And, most importantly*, how do I create a mathematical model for this, or something, because yo I need to at least explain to myself the exact blah blah about why, for... reasons. In my head, this is a math problem, and I need to understand the numbers, because I'm weird like that.
I really hate that I got this stupid thought in my head tonight. It's so ridiculous, and I bet the answer will be facepalmingly obvious. The worst part is, there's no point to it other than the fact that I struggled to explain TO MYSELF** why one word was closer to the other, alphabetically, when the words don't have equal length.
**ok, I was shouting that time, mostly at myself tho.
I assume the way to do this is by assigning a value for each letter, A to Z being 1 to 26.
All three words begin with CHE, thus we move on to the 4th letter for each: cheF cheLsea cheRry.
F and R are equal distance away from L (6 letters away). It's still a tie.
CHEF has no 5th letter, so we can assign its 5th letter a value of 0.
chelSea's 5th letter is S, so we can assign it a value of 19.
cherRy's 5th letter is R, so we can assign it a value of 18.
But it's like Pac-Man or Space Invaders, where you fly off one side and end up on the other, right?
To get from chelSea to chef_, you'd need to go backward from S to A (19 letters in movement) plus all 26 letters from Z through A to nothing for chef since chef has no fifth letter. That's 45 letters, in distance. Right?
To get from chelSea to cherRy, you'd need to go from S to Z (7 letters) plus the 18 letters to go from A through R. That's 25 letters.
But what do you do with the remaining EA in chelsEA and the remaining Y in cherrY?
And, most importantly*, how do I create a mathematical model for this, or something, because yo I need to at least explain to myself the exact blah blah about why, for... reasons. In my head, this is a math problem, and I need to understand the numbers, because I'm weird like that.
I really hate that I got this stupid thought in my head tonight. It's so ridiculous, and I bet the answer will be facepalmingly obvious. The worst part is, there's no point to it other than the fact that I struggled to explain TO MYSELF** why one word was closer to the other, alphabetically, when the words don't have equal length.
**ok, I was shouting that time, mostly at myself tho.
I was going to say the standard way of measuring this is Levenshtein distance, but meijusa's link offers other options.
posted by adamrice at 12:22 AM on June 10, 2023 [2 favorites]
posted by adamrice at 12:22 AM on June 10, 2023 [2 favorites]
To get from chelSea to chef_, you'd need to go backward from S to A (19 letters in movement) plus all 26 letters from Z through A to nothing for chef since chef has no fifth letter. That's 45 letters, in distance. Right?
Why not just go forward from S to Z and then one more to nothing, one more step takes you back to A. You have a mod 27 ring consisting of the 26 alphabetic characters and the one space/null/whatever. It's a 27 digit clock. Any distance between letters is a maximum of 13 and it's max 14 to include going to the space/null.
I totally don't get the "backward from S to A ... plus all 26 letters from Z through A to nothing" formulation.
posted by zengargoyle at 12:30 AM on June 10, 2023
Why not just go forward from S to Z and then one more to nothing, one more step takes you back to A. You have a mod 27 ring consisting of the 26 alphabetic characters and the one space/null/whatever. It's a 27 digit clock. Any distance between letters is a maximum of 13 and it's max 14 to include going to the space/null.
I totally don't get the "backward from S to A ... plus all 26 letters from Z through A to nothing" formulation.
posted by zengargoyle at 12:30 AM on June 10, 2023
Response by poster: > Why not just go forward from S to Z and then one more to nothing, one more step takes you back to A
But that takes you forward, toward the next alphabetical word (in this case, cherry).
Picture all three words on a single line. The goal is to calculate the position of each word on that line based on the mathematical value of each word's letters, and then figure out which distance is shorter.
I think meijusa has me on the right track.
posted by 2oh1 at 12:39 AM on June 10, 2023
But that takes you forward, toward the next alphabetical word (in this case, cherry).
Picture all three words on a single line. The goal is to calculate the position of each word on that line based on the mathematical value of each word's letters, and then figure out which distance is shorter.
I think meijusa has me on the right track.
posted by 2oh1 at 12:39 AM on June 10, 2023
I think they are equidistant. As you say, there is no next letter in “Chef”. If you were filing them, they would be the same distance apart. In mathematics, some things are equal and that’s an acceptable answer. Truncating numbers (or rounding them) so they are the same number of decimal places is a thing. So if you put a decimal after the last letter of Chef (because it is the end of the number) and put a decimal in the same location for the other numbers you could either truncate or judge based off the decimal values. (I don’t think that would give you my pat answer of them being equidistant.) But those are the two methods I would use to approach it.
posted by Bottlecap at 1:13 AM on June 10, 2023 [1 favorite]
posted by Bottlecap at 1:13 AM on June 10, 2023 [1 favorite]
Note that from what I read, most of the edit functions treat "modification" (e.g. substitutions, etc.) as having a cost of one, whereas you are trying to actually take into account the distance between letters, so they won't be directly applicable.
In general, I think what you have to do is simply be precise about what you mean and define things carefully before you can calculate which is closer. For example, the following is extremely unclear what it _precisely_ means:
"To get from chelSea to chef_, you'd need to go backward from S to A (19 letters in movement) plus all 26 letters from Z through A to nothing for chef since chef has no fifth letter. That's 45 letters, in distance. Right?"
it sounds like you're trying to find the distance between s -> A (for chelsea) and _ -> A (for chef_), but that's doesn't match the other situations that you do. Why aren't you doing (for the fourth letter) in that case l -> A and f -> A? If you are trying to compare distances between letters--and you have defined _ to be the 27th/0th letter--then you should simply count the distance to be 27 - 19 = 8.
But that's just my interpretation. I think if you want to actually calculate this, you need to be precise about what you want to calculate.
posted by vernondalhart at 1:14 AM on June 10, 2023
In general, I think what you have to do is simply be precise about what you mean and define things carefully before you can calculate which is closer. For example, the following is extremely unclear what it _precisely_ means:
"To get from chelSea to chef_, you'd need to go backward from S to A (19 letters in movement) plus all 26 letters from Z through A to nothing for chef since chef has no fifth letter. That's 45 letters, in distance. Right?"
it sounds like you're trying to find the distance between s -> A (for chelsea) and _ -> A (for chef_), but that's doesn't match the other situations that you do. Why aren't you doing (for the fourth letter) in that case l -> A and f -> A? If you are trying to compare distances between letters--and you have defined _ to be the 27th/0th letter--then you should simply count the distance to be 27 - 19 = 8.
But that's just my interpretation. I think if you want to actually calculate this, you need to be precise about what you want to calculate.
posted by vernondalhart at 1:14 AM on June 10, 2023
Well in that case it's simple. You just embed all words into a vector space of the highest dimension needed and do a distance from origin calculation for each word and get three numbers out of it followed by some subtractions. Remember the triangle thing 'a^2 + b^2 = c^2'? Just take them all up into 7 dimensional vectors.
posted by zengargoyle at 1:14 AM on June 10, 2023 [7 favorites]
posted by zengargoyle at 1:14 AM on June 10, 2023 [7 favorites]
Oh, I like Bottlecap's line of thought. But put the decimal in front of the word and treat it as a fraction. Each position forward being along 1/26, 1/26^2, 1/26^3 just like the 1/10, 1/100, 1/1000 (just base 26 vs base 10). Then you could get a value and do something (maybe log) to bring the curve into a line and get a 'distance'.
My first thought was the Levenshtein distance from computer science that's mostly about how many edits are needed. But the question sorta sounded like a combination lock sort of thing "to defuse the bomb turn dials from X to Y or Z but you only have limited time". The sort of linear distance is tricky.
posted by zengargoyle at 1:28 AM on June 10, 2023 [1 favorite]
My first thought was the Levenshtein distance from computer science that's mostly about how many edits are needed. But the question sorta sounded like a combination lock sort of thing "to defuse the bomb turn dials from X to Y or Z but you only have limited time". The sort of linear distance is tricky.
posted by zengargoyle at 1:28 AM on June 10, 2023 [1 favorite]
With the number line idea, CHEF would have value 2/26 + 7/26*26 + 4/26*26*26 + 5/26*26*26*26. Each numerator is the letter's 0-based index.
If you are writing a program to calculate it, you can have a loop over the letters and multiply the current denominator by 26 at each iteration.
Then distance is just the difference in values.
posted by Horselover Fat at 7:29 AM on June 10, 2023
If you are writing a program to calculate it, you can have a loop over the letters and multiply the current denominator by 26 at each iteration.
Then distance is just the difference in values.
posted by Horselover Fat at 7:29 AM on June 10, 2023
Or maybe it needs to be 1-based, otherwise A and AA are .0 and .00. But I think it's close to what you described. Because it sounds like the usual idea of edit distance doesn't capture it at all. If I understand right, you think BEAR and YEAR are very far apart, while MAKE and MULE are much closer. Right?
posted by Horselover Fat at 7:37 AM on June 10, 2023
posted by Horselover Fat at 7:37 AM on June 10, 2023
Edit distance or similarity doesn’t feel quite right for what you’re trying to determine, because you want to weight the letters by their position in the word; THIEF has a greater similarity to/lower distance from CHEF than does CHELSEA, but I don’t think you’d consider it “closer” for your purpose.
(An interesting non-mathematical way to approach the question might be “how many dictionary words fall between these two words?” Of course that’s wholly dependent on what language and dictionary you’re using, but maybe it makes intuitive sense that a definition of proximity should be context-sensitive in this way.)
posted by staggernation at 7:59 AM on June 10, 2023 [2 favorites]
(An interesting non-mathematical way to approach the question might be “how many dictionary words fall between these two words?” Of course that’s wholly dependent on what language and dictionary you’re using, but maybe it makes intuitive sense that a definition of proximity should be context-sensitive in this way.)
posted by staggernation at 7:59 AM on June 10, 2023 [2 favorites]
If you had three numbers instead, you could easily subtract to find out which pair was closer.
So, let's turn these words into base-26 numbers.
To fit intuition about how the letters in each word should correspond to each other, we'll pad their right hand side with zeros until they're the same length; to fit intuition about which letters have a stronger effect in the distance calculation, we'll put larger powers of 26 on the left, the same way we do with ordinary base-10 numbers. (What I'm doing is similar to the approaches taken by zengargoyle and Horselover Fat, by the way, but I'm avoiding fractions.)
So, assign each letter to their corresponding number from 1 to 26:
So ultimately we encode them as:
posted by What is E. T. short for? at 9:44 AM on June 10, 2023 [2 favorites]
So, let's turn these words into base-26 numbers.
To fit intuition about how the letters in each word should correspond to each other, we'll pad their right hand side with zeros until they're the same length; to fit intuition about which letters have a stronger effect in the distance calculation, we'll put larger powers of 26 on the left, the same way we do with ordinary base-10 numbers. (What I'm doing is similar to the approaches taken by zengargoyle and Horselover Fat, by the way, but I'm avoiding fractions.)
So, assign each letter to their corresponding number from 1 to 26:
C H E F - - - 3 8 5 6 0 0 0 C H E L S E A 3 8 5 12 19 5 1 C H E R R Y - 3 8 5 18 18 25 0Then give them their powers of 26 - the highest will be 266. For example:
CHELSEA = 3*266 + 8*265 + 5*264 + 12*263 + 19*262 + 5*261 + 1*260Since in this case they all start with 3,8,5, those will cancel out and we can remove them to save typing.
So ultimately we encode them as:
CHEF = 6*263 + 0 + 0 + 0 CHELSEA = 12*263 + 19*262 + 5*261 + 1 CHERRY = 18*263 + 18*262 + 25*261 + 0Multiply it out ...
CHEF = 105456 CHELSEA = 223887 CHERRY = 329186And subtract.
223887 - 105456 = 118431 329186 - 223887 = 105299So, CHELSEA is closer to CHERRY than to CHEF.
posted by What is E. T. short for? at 9:44 AM on June 10, 2023 [2 favorites]
If you use base 26 like that, you’ll end up with multiple words that share the same value, for example:
An alternate way to calculate would be to treat each word as a base-27 number with a decimal point at the front, and then subtract.
posted by mbrubeck at 10:09 AM on June 10, 2023 [5 favorites]
B = 2×26¹ + 0 = 52 AZ = 1×26¹ + 26 = 52You can fix this by using base 27 instead, since you have 26 letters plus the empty 0 value.
An alternate way to calculate would be to treat each word as a base-27 number with a decimal point at the front, and then subtract.
0.CHERRY − 0.CHEF = 0.000LRY
0.CHELSEA − 0.CHEF = 0.000FSEA0.000FSEA is smaller than 0.000LRY, so CHELSEA and CHEF are closer than CHERRY and CHEF.
posted by mbrubeck at 10:09 AM on June 10, 2023 [5 favorites]
I worked in bookstores before personal computing did inventory. I've done So Much Alphabetizing. The extra letters have no effect and should not be counted. In a program, once you get an answer, you stop looking for more data (or you should). They are equidistant, at least from my functional standpoint.
posted by theora55 at 10:26 AM on June 10, 2023 [1 favorite]
posted by theora55 at 10:26 AM on June 10, 2023 [1 favorite]
cherry is closer to chelsea than chef because:
posted by scruss at 10:26 AM on June 10, 2023
>LIST 10 DIM W$(10),V(10) 20 GOTO 80 30 LET X=0 40 FOR I=1 TO LEN(A$) 50 LET X=X+(ASC(MID$(A$,I,1))-ASC("A")+1)/I 60 NEXT I 70 RETURN 80 LET N=1 90 READ A$ 100 IF A$="STOP" GOTO 160 110 LET W$(N)=A$ 120 GOSUB 30 130 LET V(N)=X 140 LET N=N+1 150 GOTO 90 160 FOR J=2 TO N-1 170 PRINT W$(J);"-";W$(J-1);"=";V(J)-V(J-1) 180 NEXT J 190 END 200 DATA "CHEF","CHELSEA","CHERRY","STOP" >RUN CHELSEA-CHEF=6.27619048 CHERRY-CHELSEA=4.49047619
posted by scruss at 10:26 AM on June 10, 2023
You're right that it should be base-27, mbrubeck.
And another problem with using whole numbers like I suggested originally is that it only works when you know in advance the maximum length of potential words; if we add in CHEAPSKATE we have to recalculate all the exponents.
However I do want to point out that your first comparison should be CHERRY-CHELSEA, not CHERRY-CHEF:
posted by What is E. T. short for? at 10:39 AM on June 10, 2023 [1 favorite]
And another problem with using whole numbers like I suggested originally is that it only works when you know in advance the maximum length of potential words; if we add in CHEAPSKATE we have to recalculate all the exponents.
However I do want to point out that your first comparison should be CHERRY-CHELSEA, not CHERRY-CHEF:
0.CHERRY - 0.CHELSEA = 0.000EZSZAnd 0.000EZSZ is smaller than 0.000FSEA.
posted by What is E. T. short for? at 10:39 AM on June 10, 2023 [1 favorite]
I like the base-27 suggestions, but let me try to more narrowly address the specific point of confusion in the original post, which has to do with CHEF lacking a "tiebreaker" letter in the fifth position.
Suppose we considered CHEFAAA instead of CHEF. In that case, 2oh1's own thoughts about the problem seem to place CHELSEA unambiguously closer to CHERRY than to CHEFAAA. But CHEF precedes CHEFAAA in any alphabetizing system I've ever seen. Thus, CHELSEA is a fortiori closer to CHERRY than to CHEF. The base-27 model affirms this result because it captures the logic behind it. (This is not to say one couldn't use a different model, but that this model accords better than others with what the OP already had in mind.)
posted by aws17576 at 11:03 AM on June 10, 2023 [2 favorites]
Suppose we considered CHEFAAA instead of CHEF. In that case, 2oh1's own thoughts about the problem seem to place CHELSEA unambiguously closer to CHERRY than to CHEFAAA. But CHEF precedes CHEFAAA in any alphabetizing system I've ever seen. Thus, CHELSEA is a fortiori closer to CHERRY than to CHEF. The base-27 model affirms this result because it captures the logic behind it. (This is not to say one couldn't use a different model, but that this model accords better than others with what the OP already had in mind.)
posted by aws17576 at 11:03 AM on June 10, 2023 [2 favorites]
This is absolutely a math problem. The branch of math that looks at various ways to calculate metrics on words is a sub-branch of group theory. The edit distances from meijusa's link are the most common metrics, especially the Hamming and Levenshtein distances, to the best of my knowledge. But you have a variety of options depending on what "distance" features are most important to you.
posted by eviemath at 11:13 AM on June 10, 2023
posted by eviemath at 11:13 AM on June 10, 2023
→
If you'd done it the right way (that is, in BASIC on a BBC Micro*) you wouldn't have to recalculate anything:
*: for greater clarity, also known as “the right way”.
posted by scruss at 11:13 AM on June 10, 2023 [1 favorite]
if we add in CHEAPSKATE we have to recalculate all the exponents.
If you'd done it the right way (that is, in BASIC on a BBC Micro*) you wouldn't have to recalculate anything:
* CHEF= 10.1666667 * CHELSEA= 16.4428571 * CHEAPSKATE= 19.7019841 * CHERRY= 20.9333333---
*: for greater clarity, also known as “the right way”.
posted by scruss at 11:13 AM on June 10, 2023 [1 favorite]
scruss, I'm here for the BASIC nostalgia, but if CHEAPSKATE is assigned a larger value than CHELSEA, what you're doing mathematically may not quite fit the alphabetization paradigm...
posted by aws17576 at 11:26 AM on June 10, 2023 [2 favorites]
posted by aws17576 at 11:26 AM on June 10, 2023 [2 favorites]
Response by poster: I don't know why this didn't occur to me sooner. The answer is simple. Compare them one letter at a time, assigning 1 to 26 for A to Z, and assigning a zero to any blank letters if one word is longer than the other.
CHEF CHELSEA CHERRY
1st letter: 03 03 03 means it's a tie.
2nd letter: 08 08 08 means it's a tie.
3rd letter: 05 05 05 means it's a tie.
4th letter: 06 12 18 means it's still a tie.
5th letter: 00 19 18 means CHERRY is closer to CHELSEA.
Since this is a comparison of a chosen word against two others, it still works if the chosen word is shorter than the words to compare it with, and there can only be ties if all three words are the same length.
Perhaps, this needs another step? Adding the values of each letter after each step?
1st letter: 03 03 03 means it's a tie.
2nd letter: 08+03=11, 08+03=11, 08+03=11 means it's a tie.
3rd letter: 11+05=16, 11+05=16, 11+05=16 means it's a tie.
4th letter: 16+06=22, 16+12=28, 16+18=34 means it's still a tie.
5th letter: 22+00=22, 19+28=47, 18+34=52 means CHERRY is closer to CHELSEA.
6th letter: 22+00=22, 47+05=52, 52+25=77 means CHERRY is still closer to CHELSEA.
...but does approach fall apart - or is it proven - if CHERRY was actually CHERRYBLOSSOM? That would put CHELSEA closer to CHEF, due to the extra weight or 'distance' of CHERRYBLOSSOM. I think that actually proves the method.
posted by 2oh1 at 12:57 PM on June 10, 2023
CHEF CHELSEA CHERRY
1st letter: 03 03 03 means it's a tie.
2nd letter: 08 08 08 means it's a tie.
3rd letter: 05 05 05 means it's a tie.
4th letter: 06 12 18 means it's still a tie.
5th letter: 00 19 18 means CHERRY is closer to CHELSEA.
Since this is a comparison of a chosen word against two others, it still works if the chosen word is shorter than the words to compare it with, and there can only be ties if all three words are the same length.
Perhaps, this needs another step? Adding the values of each letter after each step?
1st letter: 03 03 03 means it's a tie.
2nd letter: 08+03=11, 08+03=11, 08+03=11 means it's a tie.
3rd letter: 11+05=16, 11+05=16, 11+05=16 means it's a tie.
4th letter: 16+06=22, 16+12=28, 16+18=34 means it's still a tie.
5th letter: 22+00=22, 19+28=47, 18+34=52 means CHERRY is closer to CHELSEA.
6th letter: 22+00=22, 47+05=52, 52+25=77 means CHERRY is still closer to CHELSEA.
...but does approach fall apart - or is it proven - if CHERRY was actually CHERRYBLOSSOM? That would put CHELSEA closer to CHEF, due to the extra weight or 'distance' of CHERRYBLOSSOM. I think that actually proves the method.
posted by 2oh1 at 12:57 PM on June 10, 2023
OK, just to geek out about this a tiny bit more: I actually don't think the base-27 model is perfect, because it places P equidistant from OZ and PA. This feels like a poor result given that there are many possible words between OZ and P (e.g. OZEMPIC, OZONE, OZYMANDIAS, OZZFEST) but none between P and PA. "End of word" isn't like other characters, because it can't be followed by additional characters.
Here we start running into built-in deficiencies of mapping alphabetic order onto the number line. If there's nothing between P and PA, then they should be assigned the same number; if they aren't, we run into some version of the problem above. But assigning P and PA the same number goes against usual alphabetization, where P definitely precedes PA. I'll stop here before this turns into an argument about whether 0.999... angels can dance on the head of a pin and whether or not this is the same as one angel.
posted by aws17576 at 12:59 PM on June 10, 2023 [2 favorites]
Here we start running into built-in deficiencies of mapping alphabetic order onto the number line. If there's nothing between P and PA, then they should be assigned the same number; if they aren't, we run into some version of the problem above. But assigning P and PA the same number goes against usual alphabetization, where P definitely precedes PA. I'll stop here before this turns into an argument about whether 0.999... angels can dance on the head of a pin and whether or not this is the same as one angel.
posted by aws17576 at 12:59 PM on June 10, 2023 [2 favorites]
I think when we alphabetize, we are actually building a tree:
A
- AA
-AAA
-AAB
- AB
- AC
B
- BA
and so on.
In that conception, A and B are the same distance apart as AZIMUTH and B are (1 unit), while AARDVARK and ABSOLUTE are 0.1 units apart, and CHEF and CHELSEA are 0.001 units apart (as are CHELSEA and CHERRY).
For convenience, we flatten this tree when we put together a dictionary, but if we want to measure distance in any kind of logical way, we have to look at the tree, not the flattened list. Otherwise, we end up with all the weird edge cases that have come up in this thread.
posted by ssg at 1:54 PM on June 10, 2023
A
- AA
-AAA
-AAB
- AB
- AC
B
- BA
and so on.
In that conception, A and B are the same distance apart as AZIMUTH and B are (1 unit), while AARDVARK and ABSOLUTE are 0.1 units apart, and CHEF and CHELSEA are 0.001 units apart (as are CHELSEA and CHERRY).
For convenience, we flatten this tree when we put together a dictionary, but if we want to measure distance in any kind of logical way, we have to look at the tree, not the flattened list. Otherwise, we end up with all the weird edge cases that have come up in this thread.
posted by ssg at 1:54 PM on June 10, 2023
Response by poster: Well, I'm wrong again. I thought I solved it, but my previous answer definitely doesn't work. In fact, it automatically assumes the longest word is last, alphabetically, which should have been obvious to me at the time.
posted by 2oh1 at 2:51 PM on June 10, 2023
posted by 2oh1 at 2:51 PM on June 10, 2023
Yeah, distance metrics on words generally either restricted to a finite maximum length and “pad” longer words with blanks in some way, or assume infinite-length words (again, padded with blanks in some way) and weight the letters by position (often geometrically). Per standard alphabetical ordering that places “a” before “aa”, a blank would be treated as if it were a letter preceding the letter a.
posted by eviemath at 3:15 PM on June 10, 2023 [2 favorites]
posted by eviemath at 3:15 PM on June 10, 2023 [2 favorites]
Best answer: If you did restrict to a finite maximum length n, you could just list out all of the possible strings of letters of that length, allowing blanks at the end, in alphabetical order, and just count how many items are between any two words in the list. That would be equivalent to treating each string as a base-27 number (ending in zeros, where blank = 0, a = 1, b = 2, etc.) and just subtracting the word that comes first alphabetically from the word that comes later.
posted by eviemath at 3:21 PM on June 10, 2023 [2 favorites]
posted by eviemath at 3:21 PM on June 10, 2023 [2 favorites]
Response by poster: > If you did restrict to a finite maximum length n
And we can define that by using the length of the longest word in a comparison.
> you could just list out all of the possible strings of letters of that length, allowing blanks at the end, in alphabetical order, and just count how many items are between any two words in the list.
The shortest string is, indeed, closer alphabetically.
I think you solved it!
posted by 2oh1 at 4:45 PM on June 10, 2023
And we can define that by using the length of the longest word in a comparison.
> you could just list out all of the possible strings of letters of that length, allowing blanks at the end, in alphabetical order, and just count how many items are between any two words in the list.
The shortest string is, indeed, closer alphabetically.
I think you solved it!
posted by 2oh1 at 4:45 PM on June 10, 2023
Response by poster: Ugh. Still not there, but close.
Using these three words, the string-count model fails:
bob bow boys
By counting possible strings between them, the answer is a tie, both at 21, even though bow is only 2 letters away from boy.
I believe the solution is to set the maximum length to the character length of the middle word, increasing by 1 only if it's a tie, continuing until it is no longer a tie, at which point the shorter string wins (meaning, that word is alphabetically closer).
posted by 2oh1 at 5:59 PM on June 10, 2023
Using these three words, the string-count model fails:
bob bow boys
By counting possible strings between them, the answer is a tie, both at 21, even though bow is only 2 letters away from boy.
I believe the solution is to set the maximum length to the character length of the middle word, increasing by 1 only if it's a tie, continuing until it is no longer a tie, at which point the shorter string wins (meaning, that word is alphabetically closer).
posted by 2oh1 at 5:59 PM on June 10, 2023
how do I create a mathematical model for this, or something, because yo I need to at least explain to myself the exact blah blah about why, for... reasons. In my head, this is a math problem, and I need to understand the numbers, because I'm weird like that.
You create a mathematical model for this, or for anything at all, in any way that (a) pleases you and (b) has a formally defensible internal consistency.
Personally I'd just define a distance between words by pretending that the letters were digits with place values A=0, B=1, ... , Z=25 and sticking an implied decimal point to the left of the word's leftmost letter. That gives you the same ordering for words as conventional alphabetic order would, and distance measurements between arbitrary letter strings that are straightforward to calculate numerically using base 26 arithmetic, but if you don't like P ending up equidistant between OZ and PA it might not fulfil criterion (a) for you.
Distance, in particular, is a mathematical relationship that has been defined in all kinds of ways for all kinds of purposes. And, mathematics being what it is, there's a formalism for what kind of relationship counts as a distance. You can define any process you like for munging a given pair of words into a single real number, and as long as the results satisfy the axioms that define a distance as a distance, then what you've defined is a distance.
Note in particular that this gets rid of any notion of the distance. As long as you've coherently defined a process for calculating a distance, then your results don't need to agree with those that fall out of anybody else's process to be correct. Your process doesn't even need to involve handling the words a letter at a time: see semantic similarity.
You might also enjoy this recent Veritasium video that makes a pretty good fist of discussing a notion of distance that on first blush just looks weird.
posted by flabdablet at 6:06 PM on June 10, 2023 [2 favorites]
You create a mathematical model for this, or for anything at all, in any way that (a) pleases you and (b) has a formally defensible internal consistency.
Personally I'd just define a distance between words by pretending that the letters were digits with place values A=0, B=1, ... , Z=25 and sticking an implied decimal point to the left of the word's leftmost letter. That gives you the same ordering for words as conventional alphabetic order would, and distance measurements between arbitrary letter strings that are straightforward to calculate numerically using base 26 arithmetic, but if you don't like P ending up equidistant between OZ and PA it might not fulfil criterion (a) for you.
Distance, in particular, is a mathematical relationship that has been defined in all kinds of ways for all kinds of purposes. And, mathematics being what it is, there's a formalism for what kind of relationship counts as a distance. You can define any process you like for munging a given pair of words into a single real number, and as long as the results satisfy the axioms that define a distance as a distance, then what you've defined is a distance.
Note in particular that this gets rid of any notion of the distance. As long as you've coherently defined a process for calculating a distance, then your results don't need to agree with those that fall out of anybody else's process to be correct. Your process doesn't even need to involve handling the words a letter at a time: see semantic similarity.
You might also enjoy this recent Veritasium video that makes a pretty good fist of discussing a notion of distance that on first blush just looks weird.
posted by flabdablet at 6:06 PM on June 10, 2023 [2 favorites]
Personally I'd just define a distance between words by pretending that the letters were digits with place values A=0, B=1, ... , Z=25 and sticking an implied decimal point to the left of the word's leftmost letter.
To illustrate, using your "bob bow boys" example:
.bob = 1 * 26-1 + 14 * 26-2 + 1 * 26-3 = 0.0592284934
.bow = 1 * 26-1 + 14 * 26-2 + 22 * 26-3 = 0.0604233045061
.boys = 1 * 26-1 + 14 * 26-2 + 24 * 26-3 + 18 * 26-4 = 0.0605764854172
So the distance between "bob" and "bow" is 0.0604233045061 - 0.0592284934 = 0.00119481110605, and the distance between "bow" and "boys" is 0.0605764854172 - 0.0604233045061 = 0.000153180911033, which makes "bob" and "bow" 7.8 times as far apart as "bow" and "boys", which looks plausible to me.
posted by flabdablet at 6:20 PM on June 10, 2023
To illustrate, using your "bob bow boys" example:
.bob = 1 * 26-1 + 14 * 26-2 + 1 * 26-3 = 0.0592284934
.bow = 1 * 26-1 + 14 * 26-2 + 22 * 26-3 = 0.0604233045061
.boys = 1 * 26-1 + 14 * 26-2 + 24 * 26-3 + 18 * 26-4 = 0.0605764854172
So the distance between "bob" and "bow" is 0.0604233045061 - 0.0592284934 = 0.00119481110605, and the distance between "bow" and "boys" is 0.0605764854172 - 0.0604233045061 = 0.000153180911033, which makes "bob" and "bow" 7.8 times as far apart as "bow" and "boys", which looks plausible to me.
posted by flabdablet at 6:20 PM on June 10, 2023
If you'd prefer to have the distance between e.g. A and B come out to 1 rather than 1 * 26-1, just stick the implied radix point after the leftmost letter instead of before it.
That would have the effect of multiplying all the numbers in the above example by 26, except for the ratio between bow-bob and boys-bow, which would remain unchanged at 7.8.
posted by flabdablet at 6:25 PM on June 10, 2023
That would have the effect of multiplying all the numbers in the above example by 26, except for the ratio between bow-bob and boys-bow, which would remain unchanged at 7.8.
posted by flabdablet at 6:25 PM on June 10, 2023
→
Well, excuse me for not taking something tagged pointlesstimesuck seriously. Would it help if I gave you the same algorithm in a serious data science language?
As someone who's been professionally responsible for sorting dictionaries, the idea of words being a “distance” from one another is only valid with a fixed word list, and that distance being the number of entries apart. A fixed word list is something that we precisely don't have in English. Edit distances are not for collation: “ait” and “zit” have an edit distance of one, but are at bloody opposite ends of the dictionary.
When you have a list of three words, it's meaningless to say they have a distance apart of anything other than one. Dictionaries aren't sorted by creating a numeric index: at the very simplest, it's a case-insensitive strcmp() kind of a deal. It quickly gets wiggy when you're dealing with Celtic languages, and it would be really nice if languages that use Latin characters could agree in the order they go in.
Using anything other than letter comparison will get you non-intuitive results. Using my weighted sum, chelsea is exactly the same “distance” from chef as abides, bailed, bob, die and fee. Similarly, cherry scores the same as earthen and haunch.
posted by scruss at 6:41 PM on June 10, 2023
what you're doing mathematically may not quite fit the alphabetization paradigm...
Well, excuse me for not taking something tagged pointlesstimesuck seriously. Would it help if I gave you the same algorithm in a serious data science language?
def wordval(s): v = 0 for i, c in enumerate(s.upper()): v = v + (ord(c) - ord("A") + 1) / (i + 1) return vIt gives you a weighted sum of all the letters: w[1]/1 + w[2]/2 + w[3]/3 + ... + w[n]/n.
As someone who's been professionally responsible for sorting dictionaries, the idea of words being a “distance” from one another is only valid with a fixed word list, and that distance being the number of entries apart. A fixed word list is something that we precisely don't have in English. Edit distances are not for collation: “ait” and “zit” have an edit distance of one, but are at bloody opposite ends of the dictionary.
When you have a list of three words, it's meaningless to say they have a distance apart of anything other than one. Dictionaries aren't sorted by creating a numeric index: at the very simplest, it's a case-insensitive strcmp() kind of a deal. It quickly gets wiggy when you're dealing with Celtic languages, and it would be really nice if languages that use Latin characters could agree in the order they go in.
Using anything other than letter comparison will get you non-intuitive results. Using my weighted sum, chelsea is exactly the same “distance” from chef as abides, bailed, bob, die and fee. Similarly, cherry scores the same as earthen and haunch.
posted by scruss at 6:41 PM on June 10, 2023
→
Note that this would require using bigfloats to allow long words to sort correctly. Using regular double precision, wrongheadedly and wrongheadedness collide. This would be an enormous faff, terribly slow and use a load more memory than strcmp()-style collation. Which, incidentally, never ever collides.
posted by scruss at 7:18 PM on June 10, 2023
pretending that the letters were digits with place values A=0, B=1, ... , Z=25 and sticking an implied decimal point to the left of the word's leftmost letter
Note that this would require using bigfloats to allow long words to sort correctly. Using regular double precision, wrongheadedly and wrongheadedness collide. This would be an enormous faff, terribly slow and use a load more memory than strcmp()-style collation. Which, incidentally, never ever collides.
posted by scruss at 7:18 PM on June 10, 2023
Sure, if I were going to make these distances computable in practice then that would be a concern, but all I was trying to come up with is a mathematical distance function.
Using strcmp for collation doesn't actually address the underlying distance representation issue, which is a natural consequence of needing a consistent representation for both the distance between "aah" and "zzz" and that between "pneumoultramicroscopicsilicovolcanoconiosis" and "pneumoultramicroscopicsilicovolcanoconioses": the numbers being subtracted just have lots of bits.
Mathematically there's no problem, but as you say, something like bigfloats or bigint-pair rationals would be needed to represent these distances exactly in a computer. Or you could invent your own numeric format specifically for this kind of thing by implementing fixed-point arithmetic on top of strings and computing the distance digits code point by code point. But that would be a faff.
posted by flabdablet at 9:35 PM on June 10, 2023
Using strcmp for collation doesn't actually address the underlying distance representation issue, which is a natural consequence of needing a consistent representation for both the distance between "aah" and "zzz" and that between "pneumoultramicroscopicsilicovolcanoconiosis" and "pneumoultramicroscopicsilicovolcanoconioses": the numbers being subtracted just have lots of bits.
Mathematically there's no problem, but as you say, something like bigfloats or bigint-pair rationals would be needed to represent these distances exactly in a computer. Or you could invent your own numeric format specifically for this kind of thing by implementing fixed-point arithmetic on top of strings and computing the distance digits code point by code point. But that would be a faff.
posted by flabdablet at 9:35 PM on June 10, 2023
flabdablet, you need base 27, not base 26, with 0 as the empty letter and 1 = a. Unless you only allow words/strings of exactly a given fixed length.
2oh1, you’re not going to “solve it” because, as multiple folks have mentioned, there are multiple potential answers.
In my fixed maximum string length example with max length of 4, in your example you have:
bob = (2,15,2,0)
bow = (2,15,23,0)
boys = (2,15,25,19)
Then:
boys - bow = (0,0,2,19)
bow - bob = (0,0,15,0)
Convert from base 27 to base 10:
(0,0,2,19) = 2*27 + 19 = 54+19 = 73
(0,0,15,0) = 15*27 = 200+27+30+35 = 227+65 = 292
Conclusion: boys and bow are closer than bow and bob in this metric.
In the standard topology on infinite words from an alphabet of 27 letters (including the empty character):
bob = 2*27^{-1} + 15*27^{-2} + 2*27^{-3}
bow = 2*27^{-1} + 15*27^{-2} + 23*27^{-3}
boys = 2*27^{-1} + 15*27^{-2} + 25*27^{-3} + 19*27^{-4}
And then you can subtract the decimals to get a distance between any pair, eg.
boys - bow = (25 - 23)*27^{-3} + (19 - 0)*27^{-4}
There is, as I noted, quite an extensive branch of math behind the many various ways to do this. It’s useful in cryptography, among many other applications. You can also come up with your own metric that satisfies the required axioms if various previously defined options don’t capture the type of closeness or distance that you’re interested in.
Computationally, by the way, these tend to be represented as vectors of integers, not as floating point numbers, and to make use of the positional representation structure to do the subtraction. You’ll find such implementations already coded into any computer algebra system, but Sage/CoCalc or group theory specific software packages provide the best implementations.
posted by eviemath at 9:41 PM on June 10, 2023 [3 favorites]
2oh1, you’re not going to “solve it” because, as multiple folks have mentioned, there are multiple potential answers.
In my fixed maximum string length example with max length of 4, in your example you have:
bob = (2,15,2,0)
bow = (2,15,23,0)
boys = (2,15,25,19)
Then:
boys - bow = (0,0,2,19)
bow - bob = (0,0,15,0)
Convert from base 27 to base 10:
(0,0,2,19) = 2*27 + 19 = 54+19 = 73
(0,0,15,0) = 15*27 = 200+27+30+35 = 227+65 = 292
Conclusion: boys and bow are closer than bow and bob in this metric.
In the standard topology on infinite words from an alphabet of 27 letters (including the empty character):
bob = 2*27^{-1} + 15*27^{-2} + 2*27^{-3}
bow = 2*27^{-1} + 15*27^{-2} + 23*27^{-3}
boys = 2*27^{-1} + 15*27^{-2} + 25*27^{-3} + 19*27^{-4}
And then you can subtract the decimals to get a distance between any pair, eg.
boys - bow = (25 - 23)*27^{-3} + (19 - 0)*27^{-4}
There is, as I noted, quite an extensive branch of math behind the many various ways to do this. It’s useful in cryptography, among many other applications. You can also come up with your own metric that satisfies the required axioms if various previously defined options don’t capture the type of closeness or distance that you’re interested in.
Computationally, by the way, these tend to be represented as vectors of integers, not as floating point numbers, and to make use of the positional representation structure to do the subtraction. You’ll find such implementations already coded into any computer algebra system, but Sage/CoCalc or group theory specific software packages provide the best implementations.
posted by eviemath at 9:41 PM on June 10, 2023 [3 favorites]
the idea of words being a “distance” from one another is only valid with a fixed word list, and that distance being the number of entries apart
Number of entries apart in a sorted list, where inclusion in that specific list is what defines a string of letters as a word, is a completely valid distance function for words. But it's by no means the only one. As long as the distance function you define displays the mathematical properties that make a function a distance function, you can define distances in whatever way is useful to you.
you need base 27, not base 26, with 0 as the empty letter and 1 = a. Unless you only allow words/strings of exactly a given fixed length.
Fair point. The way I did it, there's no distance between any word and the same word suffixed by an arbitrary number of As.
posted by flabdablet at 9:56 PM on June 10, 2023 [1 favorite]
Number of entries apart in a sorted list, where inclusion in that specific list is what defines a string of letters as a word, is a completely valid distance function for words. But it's by no means the only one. As long as the distance function you define displays the mathematical properties that make a function a distance function, you can define distances in whatever way is useful to you.
you need base 27, not base 26, with 0 as the empty letter and 1 = a. Unless you only allow words/strings of exactly a given fixed length.
Fair point. The way I did it, there's no distance between any word and the same word suffixed by an arbitrary number of As.
posted by flabdablet at 9:56 PM on June 10, 2023 [1 favorite]
But it's like Pac-Man or Space Invaders, where you fly off one side and end up on the other, right?
You could define a distance function that works that way, if you want.
For example: define a single-letter distance as follows: the distance from any letter to any other is the smallest number of steps you could take to go between them alphabetically, including being allowed to step over a zero that glues the end of the alphabet back to the start. So for example, A - D = D - A = 3, because you can get from A to D (or the other way around) in three steps; A - A = zero, because the distance from any letter to itself is zero; A - zero = zero - A = 1, because there's only one step between A and the glue; likewise Z - zero = zero - Z = 1, and A - Z = Z - A = 2.
The maximum possible result of any such subtraction is therefore 13, because if you ever had to take more than 13 steps to get from one letter to another you could do it in less by going the other way around past the glue. Going from P to B, for example, takes 14 steps if you go backwards, but only 13 if you go Q, R, S, T, U , V, W, Z, Y, Z, zero, A, B.
Next, lay out your words as if they were a subtraction problem, using zeroes as padding, and calculate single-letter distances columnwise:
posted by flabdablet at 11:30 PM on June 10, 2023 [1 favorite]
You could define a distance function that works that way, if you want.
For example: define a single-letter distance as follows: the distance from any letter to any other is the smallest number of steps you could take to go between them alphabetically, including being allowed to step over a zero that glues the end of the alphabet back to the start. So for example, A - D = D - A = 3, because you can get from A to D (or the other way around) in three steps; A - A = zero, because the distance from any letter to itself is zero; A - zero = zero - A = 1, because there's only one step between A and the glue; likewise Z - zero = zero - Z = 1, and A - Z = Z - A = 2.
The maximum possible result of any such subtraction is therefore 13, because if you ever had to take more than 13 steps to get from one letter to another you could do it in less by going the other way around past the glue. Going from P to B, for example, takes 14 steps if you go backwards, but only 13 if you go Q, R, S, T, U , V, W, Z, Y, Z, zero, A, B.
Next, lay out your words as if they were a subtraction problem, using zeroes as padding, and calculate single-letter distances columnwise:
C . H E L S E A -C . H E F 0 0 0 -------------------- 0 . 0 0 6 8 5 1Likewise:
C . H E L S E A -C . H E R R Y 0 -------------------- 0 . 0 0 6 1 7 1Because the maximum value in any given result column is 13 and the minimum is 0, we can distil the result down to a single number without having the columns blur into each other by treating it as base 14. You could go through the exercise of converting that to decimal if you wanted, but even without doing that it's clear that by this metric, CHELSEA is further from CHEF than from CHERRY.
posted by flabdablet at 11:30 PM on June 10, 2023 [1 favorite]
LOL, I'm so glad I needed to get some sleep. Might think more a bit later after coffee.
posted by zengargoyle at 1:22 AM on June 11, 2023
posted by zengargoyle at 1:22 AM on June 11, 2023
Another circular-alphabet distance function, this one not based on sorting order or anything like a number line:
Imagine you have one of those combination bike locks with the numbered discs that all need to be rotated to the correct spot before the lock will open. Now make it bigger, so that instead of digits 0 - 9 each disc has the letters A - Z and a blank. Now make it longer, so it has as many discs as some arbitrary word you've set as the combination; say, seven discs for CHELSEA. Now close the lock and twiddle all the discs until the lock shows CHEF and three blanks.
Apply the same single-letter distance function I described above to corresponding letters from CHEF and CHELSEA, then just add up the results from all seven columns without applying any kind of positional scaling.
What you have now is a distance function that tells you the minimum number of disc steps it would take to get the lock to open.
posted by flabdablet at 2:54 PM on June 11, 2023
Imagine you have one of those combination bike locks with the numbered discs that all need to be rotated to the correct spot before the lock will open. Now make it bigger, so that instead of digits 0 - 9 each disc has the letters A - Z and a blank. Now make it longer, so it has as many discs as some arbitrary word you've set as the combination; say, seven discs for CHELSEA. Now close the lock and twiddle all the discs until the lock shows CHEF and three blanks.
Apply the same single-letter distance function I described above to corresponding letters from CHEF and CHELSEA, then just add up the results from all seven columns without applying any kind of positional scaling.
What you have now is a distance function that tells you the minimum number of disc steps it would take to get the lock to open.
posted by flabdablet at 2:54 PM on June 11, 2023
Plot twist: I think the word Chelsea is more similar to the word Cherry than is is to the word Chef, and no other metric matters except sound.
Chef and Chelsea actually don't have the same first sound - because of the difference between the soft CH and hard CH.
Say it out loud and your ear will know: SHef CHelsea doesn't sound like matching words. CHElseY CHErrY does, especially with the matching 'ee' sound at the end.
posted by nouvelle-personne at 3:15 PM on June 11, 2023
Chef and Chelsea actually don't have the same first sound - because of the difference between the soft CH and hard CH.
Say it out loud and your ear will know: SHef CHelsea doesn't sound like matching words. CHElseY CHErrY does, especially with the matching 'ee' sound at the end.
posted by nouvelle-personne at 3:15 PM on June 11, 2023
Ooo, so then you’re talking about encoding phonemes rather than letters, and putting some sort of distance metric on that. I like it!
posted by eviemath at 5:34 PM on June 11, 2023
posted by eviemath at 5:34 PM on June 11, 2023
Soundex codes are the oldest implementation of that idea that I'm aware of.
posted by flabdablet at 6:12 PM on June 11, 2023
posted by flabdablet at 6:12 PM on June 11, 2023
« Older Orange Holiday SIM for Paris Travel - How to Do 2... | How to get access to an old YouTube account Newer »
This thread is closed to new comments.
posted by meijusa at 11:55 PM on June 9, 2023 [11 favorites]