December 12, 2011 12:15 PM Subscribe

MathFilter: Help me calculate the location of points with lots and lots of distances.

Say I have a list of points A, B, and C. A is x away from B, B is y away from C, and A is z away from C. This is relatively straightforward to visualize as a triangle and compute with free tools online.

However, I have 12 points, and a list of the distances between each point and all of the others (in the form of real numbers, so A to B is .96, B to C is .85, etc. multiplied 144 times). I've been trying to find something online that can make this into a visualization but perhaps it's way too complicated?

I'd really like some sort of representation of the points in space that I can label. (It's for a cool project on language distance I did, but it's been years since I've done complex math and I'm struggling on this part).
posted by petiteviolette to Science & Nature (31 answers total)

Say I have a list of points A, B, and C. A is x away from B, B is y away from C, and A is z away from C. This is relatively straightforward to visualize as a triangle and compute with free tools online.

However, I have 12 points, and a list of the distances between each point and all of the others (in the form of real numbers, so A to B is .96, B to C is .85, etc. multiplied 144 times). I've been trying to find something online that can make this into a visualization but perhaps it's way too complicated?

I'd really like some sort of representation of the points in space that I can label. (It's for a cool project on language distance I did, but it's been years since I've done complex math and I'm struggling on this part).

Oh. I missed one part. If you want to stress the connection between the points, maybe distribute them on a plane in some fashion and use the thickness/color of the line to represent the strength of the connection. Graphviz can do stuff like this.

posted by smidgen at 12:25 PM on December 12, 2011

posted by smidgen at 12:25 PM on December 12, 2011

Well, my problem is that I don't actually HAVE points, just relative distances between them (each point is a language). How can I compute the centroid with only distances?

posted by petiteviolette at 12:26 PM on December 12, 2011

posted by petiteviolette at 12:26 PM on December 12, 2011

If this is in 2 dimensions, the distance between (a, b) and (c, d) = sqrt((a-c)^2 + (b-d)^2))

Set one of your points on the origin, then you can set up a big set of equations and unknowns.

Solving for those unknowns should give you points you can plug in

posted by chndrcks at 12:32 PM on December 12, 2011

Set one of your points on the origin, then you can set up a big set of equations and unknowns.

Solving for those unknowns should give you points you can plug in

posted by chndrcks at 12:32 PM on December 12, 2011

Given two points (x_{a}, y_{a}) and (x_{b}, y_{b}) and a distance between them d you have an equation:

sqrt( (x_{a} - x_{b})^{2} + (y_{a} - y_{b})^{2} ) = d

Since you have 12 points you have 24 unknowns and 144 equations, so you should be able to solve it as a system of equations, assuming the distances work out. It's possible they can't be embedded in the plane, so you might have to bump it up to higher dimensions (e.g. three dimensional space, line color, thickness, etc). Eventually you'll hit a high enough dimensional space that you can embed the points, even if you have to get to where you have more unknowns than equations, and thus an infinite number of possible solutions (i.e. more than 12 dimensions).

posted by jedicus at 12:32 PM on December 12, 2011

sqrt( (x

Since you have 12 points you have 24 unknowns and 144 equations, so you should be able to solve it as a system of equations, assuming the distances work out. It's possible they can't be embedded in the plane, so you might have to bump it up to higher dimensions (e.g. three dimensional space, line color, thickness, etc). Eventually you'll hit a high enough dimensional space that you can embed the points, even if you have to get to where you have more unknowns than equations, and thus an infinite number of possible solutions (i.e. more than 12 dimensions).

posted by jedicus at 12:32 PM on December 12, 2011

I'm not sure if this is what you're looking for, but you could place one point at the origin, and another some distance away on the x-axis. Then your third point is determined by how far it is from the first two, (well, up to a mirror image, so assume it's above the x-axis). Then every other point can be placed by its distance from the first two points, and its distance to the the third point tells you whether it's above or below the x-axis.

posted by monkeymadness at 12:33 PM on December 12, 2011 [1 favorite]

posted by monkeymadness at 12:33 PM on December 12, 2011 [1 favorite]

Also, you shouldn't have to deal with 144 distances, since a-b is the same as b-a. You should have C(12,2)=66 distances.

posted by monkeymadness at 12:36 PM on December 12, 2011

posted by monkeymadness at 12:36 PM on December 12, 2011

The keyword you want is multidimensional scaling I believe. The algorithm will find 12 vectors in N-dimensional space (usually N=2 or 3 for visualization) that best approximates your inter-point distance matrix.

MDS isn't a solved problem as far as I know. Because you only have 12 points, the suggestions to just directly solve for the positions exactly might be reasonable. However, since each point is a language (and may not actually exist in a 2D space), there may not be an*exact* solution to your problem. MDS algorithms will find some approximation that is reasonable.

A paper I read a while back is here, with some figures showing the accuracy of the model if error is present.

posted by bessel functions seem unnecessarily complicated at 12:40 PM on December 12, 2011 [2 favorites]

MDS isn't a solved problem as far as I know. Because you only have 12 points, the suggestions to just directly solve for the positions exactly might be reasonable. However, since each point is a language (and may not actually exist in a 2D space), there may not be an

A paper I read a while back is here, with some figures showing the accuracy of the model if error is present.

posted by bessel functions seem unnecessarily complicated at 12:40 PM on December 12, 2011 [2 favorites]

Well, that's assuming the data obeys the commutative property, but if it doesn't then the poster has bigger problems.

In any case, the system of equations approach still works with 66 distances, and it means the worst case number of dimensions is a mere 6.

posted by jedicus at 12:40 PM on December 12, 2011

I did the language to language by hand (including each language to itself) and got 91.

These distances were computed with vectors, so I could easily use the original vectors to plot the points, but that seemed more complicated. Would it be easier?

What makes most sense to me is chndrcks's suggestion, but does it matter which point is the origin? I also absolutely don't want to do this by hand. I know a little Scala, but that's about it as far as automatizing.

Here's a spreadsheet of what I'm working with, if that makes it clearer.

posted by petiteviolette at 1:14 PM on December 12, 2011

These distances were computed with vectors, so I could easily use the original vectors to plot the points, but that seemed more complicated. Would it be easier?

What makes most sense to me is chndrcks's suggestion, but does it matter which point is the origin? I also absolutely don't want to do this by hand. I know a little Scala, but that's about it as far as automatizing.

Here's a spreadsheet of what I'm working with, if that makes it clearer.

posted by petiteviolette at 1:14 PM on December 12, 2011

D'oh, 91 distances, not just "91." And that's including each language to itself.

posted by petiteviolette at 1:14 PM on December 12, 2011

posted by petiteviolette at 1:14 PM on December 12, 2011

I see. I'm not familiar with language distance, but if bessel is right there's a chance these points can't be plotted in 2D. Consider four points, each exactly 1 unit distance from the other three. You need 3D to plot them properly, (the vertices of a tetrahedron). Still, I'd bet you could approximate something in 2D. Please let us know when you've done it. This sounds interesting.

posted by monkeymadness at 1:15 PM on December 12, 2011

posted by monkeymadness at 1:15 PM on December 12, 2011

Yeah, what **monkeymadness** said. If this data actually comes from a real set of points that were embedded in the plane, then his procedure would work fine: just pick three points as your reference points, and use the distances to all three to figure out the locations of the other points.

However, do keep in mind that not all sets of pairwise distances correspond to possible sets of points on a plane, even a higher-dimensional one or a curved one. Consider, for example, the set of distances dist(A,B) = 1, dist(B,C) = 1, dist(A,C) = 10.

posted by Johnny Assay at 1:17 PM on December 12, 2011

However, do keep in mind that not all sets of pairwise distances correspond to possible sets of points on a plane, even a higher-dimensional one or a curved one. Consider, for example, the set of distances dist(A,B) = 1, dist(B,C) = 1, dist(A,C) = 10.

posted by Johnny Assay at 1:17 PM on December 12, 2011

What's the dimensionality of the vectors you used to compute the distances in the first place? Those vectors *are* essentially points in some higher-dimensional Euclidean space, so you could plot them precisely (with the appropriate distances) in that kind of space. However, bringing it down to a lower-dimensional space with any kind of fidelity is a much harder problem, as indicated by Dr. Bessel above.

posted by Johnny Assay at 1:26 PM on December 12, 2011

posted by Johnny Assay at 1:26 PM on December 12, 2011

Each vector has 136 dimensions. Talk about multidimensional.

posted by petiteviolette at 1:38 PM on December 12, 2011

posted by petiteviolette at 1:38 PM on December 12, 2011

Mapping points into a lower-dimensional space without distorting their pairwise distances very much is related to the Johnson–Lindenstrauss lemma. I only know of this work from the theoretical side; I don't know that anybody has done the work to make it practical.

posted by stebulus at 2:33 PM on December 12, 2011

posted by stebulus at 2:33 PM on December 12, 2011

Well, any two points can be contained in a line, any three in a plane, any four in space-- and I don't immediately see why that wouldn't continue, constructively by induction, so I'd guess you'd need a maximum of 11 dimensions for your set of 12 points.

posted by jamjam at 3:15 PM on December 12, 2011

posted by jamjam at 3:15 PM on December 12, 2011

But the question is, is that graphable in a way that a non-mathematician can do though some sort of GUI?

posted by petiteviolette at 3:29 PM on December 12, 2011

posted by petiteviolette at 3:29 PM on December 12, 2011

That's a violation of the triangle inequality, i.e., dist(A,B) + dist(B,C) ≤ dist(A,C). If the distances actually constitute a norm, it should be possible to visualize them in some space.

posted by Nomyte at 3:41 PM on December 12, 2011

Yeah, unfortunately, depending on what the distances are, it might not even be possible to draw them. That is, the set of distances may not be legal, under the mathematical laws of "distance" (technically, this forms what is called a

This won't work, because -- for example -- 3 points in a plane can't just have any arbitrary distances from each other (cf the triangle inequality above).

posted by Frobenius Twist at 3:50 PM on December 12, 2011

Hmm. Let's see if I can break this down:

These 12 languages have a total of 136 phonemes (units of sound) between them.

For each language, one space in the matrix corresponds to each phoneme. If the phoneme exists in that language, it gets a "0," if it doesn't exist, it gets a value from 0-1 (which is based on markedness, I won't bore you with that part)

So the distances are based on a shared data set and seeing if each one is a "match" for each slot in the matrix. Does this sound like it would have legal distances?

posted by petiteviolette at 3:54 PM on December 12, 2011

These 12 languages have a total of 136 phonemes (units of sound) between them.

For each language, one space in the matrix corresponds to each phoneme. If the phoneme exists in that language, it gets a "0," if it doesn't exist, it gets a value from 0-1 (which is based on markedness, I won't bore you with that part)

So the distances are based on a shared data set and seeing if each one is a "match" for each slot in the matrix. Does this sound like it would have legal distances?

posted by petiteviolette at 3:54 PM on December 12, 2011

Hmm . . . I don't know. However, looking at your answers above, it seems as though you've already plotted these points in 12-dimensional space? If so, it's certainly possible, since you've already done it. If your question is whether or not you'd be able to visualize these in two- or three-dimensional space, though, the answer is almost certainly no, since an arbitrary set of points in 12-dimensional space can't be mapped to two- or three-dimensional space without distorting distances (as Johnny Assay mentioned above).

It's possible that you could come up with a more clever way of visualizing it than just drawing a bunch of points. Eg, you could use color-coding to represent distances, and so on. (I think you actually might want to work with the vectors rather than the distances between them, since the set of vectors carries a lot more geometric information than just the distances.)

posted by Frobenius Twist at 4:22 PM on December 12, 2011

It's possible that you could come up with a more clever way of visualizing it than just drawing a bunch of points. Eg, you could use color-coding to represent distances, and so on. (I think you actually might want to work with the vectors rather than the distances between them, since the set of vectors carries a lot more geometric information than just the distances.)

posted by Frobenius Twist at 4:22 PM on December 12, 2011

In fact, here is an explicit technique for finding what dimension of space you *could* embed this in. You already have a matrix. Plug it into Maple or some other program and ask it to compute the rank of the matrix. (You don't have to know what the definition of "rank" is as long as you can just get a computer program to do it for you). If the rank is n, the best you'll be able to do is n-dimensional space. (If you're really lucky, the rank will be 2 or 3, but I would be shocked if that happened.)

posted by Frobenius Twist at 4:27 PM on December 12, 2011 [1 favorite]

posted by Frobenius Twist at 4:27 PM on December 12, 2011 [1 favorite]

The rank only gives the dimension for an exact embedding; lower dimensions can usually be achieved at the cost of some distortion.

posted by stebulus at 4:55 PM on December 12, 2011

posted by stebulus at 4:55 PM on December 12, 2011

If it helps, what you have measured is called the Hamming distance.

As others have said, you will almost certainly not be able to get an exact representation of all of these distances in 2 or 3 dimensions. You have a set of 12 points in a 136-dimensional space. You can always project them to a 2 or 3 dimension space, but it is unlikely you can do so without some loss of information.

However, some form of clustering analysis might yield some useful information, grouping "nearby" languages automatically. Hierarchical Clustering, for example, might apply well here. This Q&A might give a bit of insight into how that could work. You should be able to do it by hand, for the amount of data you have.

posted by whatnotever at 5:43 PM on December 12, 2011

As others have said, you will almost certainly not be able to get an exact representation of all of these distances in 2 or 3 dimensions. You have a set of 12 points in a 136-dimensional space. You can always project them to a 2 or 3 dimension space, but it is unlikely you can do so without some loss of information.

However, some form of clustering analysis might yield some useful information, grouping "nearby" languages automatically. Hierarchical Clustering, for example, might apply well here. This Q&A might give a bit of insight into how that could work. You should be able to do it by hand, for the amount of data you have.

posted by whatnotever at 5:43 PM on December 12, 2011

Sorry to come late to this. The easiest off-the-shelf thing to do, since you already have a 12x12 matrix of similarity measurements, is indeed multidimensional scaling.

However: you need to know that there are tons of people who do exactly this kind of thing -- visualization of data sets, usually much bigger and scarier ones than this, using some measure of similarity between objects. If you're in a university, look for people who are interested in data science -- you will probably find them in computer science, electrical engineering, statistics, or applied math. Ask one of them to help you. We LOVE when people actually need our techniques. And this is seriously under an hour's work for one of those guys.

And more importantly: if you do this on your own, the information you get out is probably not going to be very useful. A specialist can tell you whether the distance measure you're using is a good one -- how do distinguish the results you get from random noise -- and etc. etc.

posted by escabeche at 7:15 PM on December 12, 2011

However: you need to know that there are tons of people who do exactly this kind of thing -- visualization of data sets, usually much bigger and scarier ones than this, using some measure of similarity between objects. If you're in a university, look for people who are interested in data science -- you will probably find them in computer science, electrical engineering, statistics, or applied math. Ask one of them to help you. We LOVE when people actually need our techniques. And this is seriously under an hour's work for one of those guys.

And more importantly: if you do this on your own, the information you get out is probably not going to be very useful. A specialist can tell you whether the distance measure you're using is a good one -- how do distinguish the results you get from random noise -- and etc. etc.

posted by escabeche at 7:15 PM on December 12, 2011

E.G. here's a paper where MDS is used to make a map of languages using their perceived similarity to non-speakers.

posted by escabeche at 7:18 PM on December 12, 2011

posted by escabeche at 7:18 PM on December 12, 2011

Hooray data! Unfortunately, your dataset is small enough and your question interesting enough that I'm going to talk too much....

Unfortunately, I have some bad news for you: the data in your spreadsheet*cannot* be exactly represented by 12 points in a 2 or 3 dimensional space. As jamjam mentions, this can represented in an 11-D space, but that's terrible for visualization.

Clearly the "cosine distance" in your spreadsheet isn't the distance between points: all of the elements (i,i) are "cosine distance"=1 away from themselves (i.e. C_{ii}=1). The "Euclidean distance" doesn't suffer from this limitation (i.e. E_{ii}=0), so I only looked at that. The Euclidean distance between points has an awesome little aspect that I only learned about recently:

The matrix of squared distances d_{ij}^2 between N points in R^d is less than or equal to d+2.

Here are two papers (randomly selected from the top hits in Google, both pdfs) that mention this fact (and differs slightly from what Frobenius Twist mentioned). It's the worlds easiest way to check to see if you can exactly solve for the position of your points in d-dimensional space. And unfortunately, your matrix of distances has rank 13: your exact points are in an 11-dimensional space.

So the unfortunate situation is that you can't solve this problem by exactly solving for the position of 12 points in 2 or 3D space. Even more unfortunately, your matrix doesn't have well-separated eigenvalues*, so a 2- or 3-d approximation won't be all that great either. If your matrix had, for example, 5 large eigenvalues and 8 close to zero, you'd be in better shape.

Sorry to be the bearer of bad news, but your data is embedded in a high-dimensional space. There might be different methods to measure the closeness between the languages that allow for a low-dimensional representation, or there might be an approximation to your distance matrix that is low rank (but probably not, due to the eigenvalues issue*), but there's probably no "graphable in a way that a non-mathematician can do though some sort of GUI" way to do this.

On preview: escabeche is correct, talk to some CS / applied math guys. They'll love this problem.

*The eigenvalues of your distance matrix are

{43.3858, -6.96507, -5.06002, -4.5838, -4.55059, -4.45514, -3.70672, -3.3123,

-3.05796, -3.053, -2.0027, -1.62308, -1.01542}

A system that could be well-approximated by d-dimensional system would have d+2 non-negligible eigenvalues (with "negligible" poorly defined).

posted by bessel functions seem unnecessarily complicated at 7:36 PM on December 12, 2011 [1 favorite]

Unfortunately, I have some bad news for you: the data in your spreadsheet

Clearly the "cosine distance" in your spreadsheet isn't the distance between points: all of the elements (i,i) are "cosine distance"=1 away from themselves (i.e. C_{ii}=1). The "Euclidean distance" doesn't suffer from this limitation (i.e. E_{ii}=0), so I only looked at that. The Euclidean distance between points has an awesome little aspect that I only learned about recently:

The matrix of squared distances d_{ij}^2 between N points in R^d is less than or equal to d+2.

Here are two papers (randomly selected from the top hits in Google, both pdfs) that mention this fact (and differs slightly from what Frobenius Twist mentioned). It's the worlds easiest way to check to see if you can exactly solve for the position of your points in d-dimensional space. And unfortunately, your matrix of distances has rank 13: your exact points are in an 11-dimensional space.

So the unfortunate situation is that you can't solve this problem by exactly solving for the position of 12 points in 2 or 3D space. Even more unfortunately, your matrix doesn't have well-separated eigenvalues*, so a 2- or 3-d approximation won't be all that great either. If your matrix had, for example, 5 large eigenvalues and 8 close to zero, you'd be in better shape.

Sorry to be the bearer of bad news, but your data is embedded in a high-dimensional space. There might be different methods to measure the closeness between the languages that allow for a low-dimensional representation, or there might be an approximation to your distance matrix that is low rank (but probably not, due to the eigenvalues issue*), but there's probably no "graphable in a way that a non-mathematician can do though some sort of GUI" way to do this.

On preview: escabeche is correct, talk to some CS / applied math guys. They'll love this problem.

*The eigenvalues of your distance matrix are

{43.3858, -6.96507, -5.06002, -4.5838, -4.55059, -4.45514, -3.70672, -3.3123,

-3.05796, -3.053, -2.0027, -1.62308, -1.01542}

A system that could be well-approximated by d-dimensional system would have d+2 non-negligible eigenvalues (with "negligible" poorly defined).

posted by bessel functions seem unnecessarily complicated at 7:36 PM on December 12, 2011 [1 favorite]

I hope you don't mind me taking the data from your spreadsheet, but I put together a little thing in Flash which draws the languages as points and just moves them as if they're connected by springs.

It's super rough (on average it gets the distances to within about 20% of what they should be), but I thought it might still be fun to see.

It starts with the points in random positions, and you can try to get a new configuration by clicking the "randomise" button. The lines between languages represent how*incorrect* the distance between them is — if it's too short the line will be blue, and it will turn red if they're too far apart. If the distance is exactly right, the line will disappear. You can try to manually shift languages around by dragging them, and the "output" button will copy a space separated list of the languages with their x and y positions to your clipboard ([name] [x] [y]).

Also, I'm not sure if I missed something, but aren't there actually thirteen languages there?

posted by lucidium at 7:12 AM on December 13, 2011

It's super rough (on average it gets the distances to within about 20% of what they should be), but I thought it might still be fun to see.

It starts with the points in random positions, and you can try to get a new configuration by clicking the "randomise" button. The lines between languages represent how

Also, I'm not sure if I missed something, but aren't there actually thirteen languages there?

posted by lucidium at 7:12 AM on December 13, 2011

lucidum, that is so freaking cool. Thank you for sharing that!

I'm working with a friend who is doing her master's in math right now as well, but as she's busy with, you know, her own work, things will probably go pretty slowly.

Thanks everyone for your input. I would rather understand the problem on my own than just shunt it off to someone in the CS department (and this project was for a grad class in the CS dept, so I'm technically already there), so I'll be working through some of these papers and links in the future.

posted by petiteviolette at 9:23 AM on December 13, 2011

I'm working with a friend who is doing her master's in math right now as well, but as she's busy with, you know, her own work, things will probably go pretty slowly.

Thanks everyone for your input. I would rather understand the problem on my own than just shunt it off to someone in the CS department (and this project was for a grad class in the CS dept, so I'm technically already there), so I'll be working through some of these papers and links in the future.

posted by petiteviolette at 9:23 AM on December 13, 2011

And there are thirteen languages, by the way. My original project had 12, but I added one along the way and forgot about it (probably because the output of my code was ranking 12 languages *compared* to the native language input and I internalized that number so often).

posted by petiteviolette at 9:27 AM on December 13, 2011

posted by petiteviolette at 9:27 AM on December 13, 2011

This thread is closed to new comments.

posted by smidgen at 12:22 PM on December 12, 2011