How to not calculate a matrix twice
September 9, 2018 5:21 AM   Subscribe

ProgrammingFilter. The short version: I've realized I'll be doing twice the calculation work I need to do in an operation, and I could use a hand cutting it down. Details inside.

So I've got this pet project I've been working with on and off for a few years now, and it's actually getting past the concept stage into something that might be usable by actual humans.

In doing some of the planning, though, I realized I've hit an optimization problem. Part of what I'm doing is calculating how similar two users are based on their ratings of items (books, in this case). The actual details of the calculation aren't important, but what is is that the similarity is a symmetrical relationship. That is, User 1 being, say, 85% similar to User 2 implies that User 2 is also 85% similar to User 1.

Now, the naive version of this I did originally went something like "For each user, calculate their similarity with every other user and store it". And that works, but it scales terribly performance-wise and worse, repeats itself--i.e., when you get to user 2 in the loop, you've already calculated their similarity with user 1, so you're just duplicating work.

What I'm needing is some method of only calculating the values I haven't already done. Just off the top of my head, I'm thinking maybe some sort of "has been updated" flag column in the table I'm storing all of these relationships in, but that feels a bit...hacky. If there's a more elegant way to do it, I'm completely open to suggestions. Any ideas?
posted by Mr. Bad Example to Computers & Internet (14 answers total) 2 users marked this as a favorite
You can generate all pairs once each by only comparing each user to the users "greater than" them by some ordering. E.g., for a given user, only compare them to users with a greater userid (assuming unique IDs exist). And when you calculate one result, store it for both users.
posted by whatnotever at 5:36 AM on September 9, 2018 [5 favorites]

You just need a triangular structure to your for loop.
So that you compute R(1,2)...R(1,N), and then R(2,3)...R(2,N), and then
R(3,4)...R(3,N)... etc.

This assumes you have an enumerated list of users.
posted by SaltySalticid at 5:39 AM on September 9, 2018 [2 favorites]

Probably the easiest thing is something like this (in pseudocode):

# initialize similarity matrix to an identity matrix,
# since the correlation of user to themselves is 1.
mat = identity(nUsers, nUsers)

# loop over matrix
for i = 0; i < nUsers; i++:
  for j =0; j < i; j++:
    sim = get_sim(User[i],User[j])
    mat[i,j] = sim
    mat[j,i] = sim

Depending on what your similarity metric and on what language you're using, it might be possible to optimize further.
posted by tickingclock at 5:42 AM on September 9, 2018 [1 favorite]

I just saw Saltysalticid's answer. What they are suggesting is similar to what I posted, but is more like this:

for i = 0; i < nUsers; i++:
  for j = i; j < nUsers; j++: ### difference is here
    sim = get_sim(User[i],User[j])
    mat[i,j] = sim
    mat[j,i] = sim

Either will work.
posted by tickingclock at 5:46 AM on September 9, 2018

Depending on what your similarity metric and on what language you're using, it might be possible to optimize further.

... I should explain further. I'm used to scientific programming, which means that for the languages I am familiar with (e.g., Matlab), there is usually a way to vectorize the code. If this doesn't make sense to you, that's okay -- the nested for-loop solution is fine.
posted by tickingclock at 5:49 AM on September 9, 2018 [1 favorite]

Have you looked at how different recommender system algorithms approach optimization? Also, unsure if you’re already doing this, but I wonder if you convert the matrices into vectors and use the concept of cosine similarity, making on-the-fly comparisons would be less costly.
posted by elephantsvanish at 7:58 AM on September 9, 2018

This is a problem that comes up a ton in machine learning, so there are some fast ways of doing it.

What programming language is this in? It makes a big difference. If it's python, you should strongly consider using something like the pdist function from scipy. That will use a bunch of highly-optimized C code. I'm sure R has something similar.

Try googling for "pairwise distance" and whatever language you're using.

There is also a vectorization trick, as previously mentioned. Essentially, this would require you to put all users as rows in a big matrix. Then the cost is just a single matrix-matrix multiply and some scalar transformations on the result.

The only advantage of these techniques (vectorization, or just calling out to a library function) is it would let you take advantage of code that's been optimized to death for 20+ years, and which may also be in a faster language.

You should probably just try the nested-loop version first though, it might be fast enough.
posted by vogon_poet at 9:03 AM on September 9, 2018 [1 favorite]

I think you need a separate entity for all the similarities, call it something like cliques, you will need to scan the data twice, once to figure out what the cliques should be and perhaps how many there should be, and then a second scan to assign every user to the relevant clique.
Then when you add additional users all you need to calculate is which clique they should join rather than comparing them to every existing user.
posted by Lanark at 9:34 AM on September 9, 2018

another option is to think of the similarities as undirected edges of a graph (where the nodes are the users). Then just iterate over the edges.
Same concept in matrix form is discussed here:
posted by forforf at 9:59 AM on September 9, 2018

Thanks, everybody. I think I was groping towards something like the triangular structure stuff above, but I haven't done anything with similarity/distance matrices since my (terrible, terrible) thesis, so I couldn't get my head around it.

I realized after I posted that I might be overcomplicating things for myself by calculating every user's similarity against every other user all at once. After all, every user isn't going to be changing their ratings that drastically that often. What I may end up doing is a one-to-many recalculation after some threshold number of changes by an individual user...or something along those lines.

For those of you who're curious, this is going to be Python, mainly because I really like Python and I've been playing around with both Django and Flask lately. I'm offloading as much as I can into native code, though--I'm using a similarity function from SciPy for now, for instance. It should be fast enough, although I'm also thinking about handing off the calculation to something like Celery as a background task if it starts taking too long.
posted by Mr. Bad Example at 11:41 AM on September 9, 2018

Depending on how you're using the results, you may only need to fully calculate a fraction of them. If you only care about users very similar to one another, perhaps you could use a hueristic to discard pairs you know (or strongly suspect, in a probabilistic sense) will be far apart. Can you guess which factors will contribute most to distance, evaluate those first, and prune the search early if they exceed a threshold?
posted by sourcequench at 2:17 PM on September 9, 2018 [1 favorite]

Algorithms for nearest neighbors or (if dimensionality for each user is high) approximate nearest neighbors techniques might be useful to do that and could be a fun algorithmic rabbit hole to go down.
posted by vogon_poet at 6:32 PM on September 9, 2018

If you don't want to think too hard, assuming your operation is commutative, and you can come with an ordering for your inputs, wouldn't memoization do the job?
posted by rdr at 4:19 AM on September 10, 2018

It's going to be pretty simple, at least at first--I'm just going to take the ratings where two users have rated the same items and then generate either a cosine similarity or Pearson correlation coefficient from that. (I've played around with those and a few other similarity algorithms, and those seem to get me the best results so far.)

Then after I get everything working, I get to start having fun with the sparse user problem, but one obstacle at a time. I might fall back on a default "People who liked what you liked also really liked these other things" search if I don't find similar enough users.
posted by Mr. Bad Example at 8:04 AM on September 10, 2018

« Older Voicemail service question   |   Whole House Generator: How does it work Newer »

You are not logged in, either login or create an account to post comments