# Help me track myself, or: 23.2% chance of the East Village

April 15, 2012 11:57 PM Subscribe

How can I write an algorithm that predicts my future location?

I'm working on a project that involves tracking my location with OpenPaths, and creating a website displaying a map of probable locations that I would be at the current point in time, based on two factors: the time of day and day of week.

So, if you accessed the website at 4:52pm on a Tuesday, the website would calculate according to historical location data, the probability that I'm at X, Y, and Z locations at 4:52pm on Tuesday. This information would be shown on a map with vague fuzzy locations, so the point is not so much that the user knows exactly where I am, but that he/she gets a vague sense of where I might be. This is a screenshot of a prototype website, built with a mix of php and python, in which gradients are overlaid onto a Google Maps-driven site.

I have everything working except for the actual algorithm. Currently the website just grabs data from a similar timeframe, and privileges the data if it happens to lie on the same day of the week. So if I've been somewhere at 5:10pm on a previous Tuesday, it'll display that moment with a higher opacity than a 4:52pm location point on a previous Saturday, for example. But this algorithm leads to certain inaccuracies -- for example, my schedule depends on the specific day of the week, but I can't just look at weekday data, since I also follow cycles of day/night that are similar across all spectrums.

I'm trying to implement a better, 'real' algorithm. Would anyone happen to have any ideas for an effective and relatively accurate algorithm to thus predict my location based on historical data? I'm not interested in hyper-accuracy, since I'm trying to obfuscate my specific location anyways, just a kind of proof-of-concept workability.

Location-prediction algorithms seem to result in a slew of massively complex algorithms. Some simple machine learning/markov model/decision tree would be interesting, I think, but I'm not sure how to apply location (gps coords), day of week, and time as two different variables and one output. I've looked at the Google Prediction API, which is perfect, except that it can only deal with one result, not two (lat & long coordinates). Something even more simple would suffice. Any ideas? Thanks!

I'm working on a project that involves tracking my location with OpenPaths, and creating a website displaying a map of probable locations that I would be at the current point in time, based on two factors: the time of day and day of week.

So, if you accessed the website at 4:52pm on a Tuesday, the website would calculate according to historical location data, the probability that I'm at X, Y, and Z locations at 4:52pm on Tuesday. This information would be shown on a map with vague fuzzy locations, so the point is not so much that the user knows exactly where I am, but that he/she gets a vague sense of where I might be. This is a screenshot of a prototype website, built with a mix of php and python, in which gradients are overlaid onto a Google Maps-driven site.

I have everything working except for the actual algorithm. Currently the website just grabs data from a similar timeframe, and privileges the data if it happens to lie on the same day of the week. So if I've been somewhere at 5:10pm on a previous Tuesday, it'll display that moment with a higher opacity than a 4:52pm location point on a previous Saturday, for example. But this algorithm leads to certain inaccuracies -- for example, my schedule depends on the specific day of the week, but I can't just look at weekday data, since I also follow cycles of day/night that are similar across all spectrums.

I'm trying to implement a better, 'real' algorithm. Would anyone happen to have any ideas for an effective and relatively accurate algorithm to thus predict my location based on historical data? I'm not interested in hyper-accuracy, since I'm trying to obfuscate my specific location anyways, just a kind of proof-of-concept workability.

Location-prediction algorithms seem to result in a slew of massively complex algorithms. Some simple machine learning/markov model/decision tree would be interesting, I think, but I'm not sure how to apply location (gps coords), day of week, and time as two different variables and one output. I've looked at the Google Prediction API, which is perfect, except that it can only deal with one result, not two (lat & long coordinates). Something even more simple would suffice. Any ideas? Thanks!

Let me ignore time for a moment, and come back for it later. Let's suppose you have a large number of xy point observations of locations (call them "positions") you have been. One approach (almost certainly not the most elegant) is to consider a grid of some sort. Then, for each point on the grid, consider how far you would have to go to visit each of the positions. Pythagoras is probably fine. Points that score low are close to many of the positions; points that score highly are far away. You could transform that distance, say, take the log of it. That steepens the slope at the short end, so that being very close to a position is more important, and being further away is less important. This isn't maybe the best way of dealing with the location part, but it's a start.

The key thing is that once you're happy with assessing how far away a point is from the positions (and therefore how light the indicator should be), you just then apply it to all four dimensions. The question goes from being "how far is Fort Greene from the coffee shop I hang out in in Park Slope?", which is two dimensions, to adding in "how far is 7:23 PM from 7:45 PM?", and "how far is Thursday from Saturday"? In all cases, the answer is pretty clear; maybe 2 miles, 22 minutes, and 2 days. So all you need is just a scale, which is the sort of thing you can play around with. 2.5 miles per hour is "typical" walking speed, but my guess is that will privilege distance too much; I'd start with maybe 0.5 mile per hour. It's probably easier to start by ignoring the fourth, day-of-the-week dimension, and working with time. It may be more interesting to see how the 8 PM map looks on all of the days of the week.

posted by Homeboy Trouble at 12:57 AM on April 16, 2012 [1 favorite]

The key thing is that once you're happy with assessing how far away a point is from the positions (and therefore how light the indicator should be), you just then apply it to all four dimensions. The question goes from being "how far is Fort Greene from the coffee shop I hang out in in Park Slope?", which is two dimensions, to adding in "how far is 7:23 PM from 7:45 PM?", and "how far is Thursday from Saturday"? In all cases, the answer is pretty clear; maybe 2 miles, 22 minutes, and 2 days. So all you need is just a scale, which is the sort of thing you can play around with. 2.5 miles per hour is "typical" walking speed, but my guess is that will privilege distance too much; I'd start with maybe 0.5 mile per hour. It's probably easier to start by ignoring the fourth, day-of-the-week dimension, and working with time. It may be more interesting to see how the 8 PM map looks on all of the days of the week.

posted by Homeboy Trouble at 12:57 AM on April 16, 2012 [1 favorite]

To simplify your problem, just to start, compute "what are the odds you would be at a given place on a given day." (where places are named, discrete entities like home and the corner store, instead of lat/long).

So you have a set of all the places you've visited, and you have observations of on what days you visited each place.

You can then compute a few things:

P(place) - the probability that you'd be at a particular place, instead of all the other places. You can compute this as the number of times you visited a particular place, divided by all the times you visited any place.

P(place|day) - the probability that you'd be at a particular place on a given day. This is your place observation data divided by day. So here you count the number of times you visited a particular place on a certain day, divided by all the times you visited any place on that day.

Now you have your first, simple model of where you will be at a given time on your map. You can refine this further by chopping your data into finer time slices (every hour instead of every day), and also by computing the conditional probability that you visit Place Y given that you're at Place X, P(Y|X).

You can then add more constraints or smarts (it's impossible to get from Place 1 to Place 2 in only an hour, and it's unlikely that you'd visit three supermarkets in one day).

posted by zippy at 2:29 AM on April 16, 2012 [1 favorite]

So you have a set of all the places you've visited, and you have observations of on what days you visited each place.

You can then compute a few things:

P(place) - the probability that you'd be at a particular place, instead of all the other places. You can compute this as the number of times you visited a particular place, divided by all the times you visited any place.

P(place|day) - the probability that you'd be at a particular place on a given day. This is your place observation data divided by day. So here you count the number of times you visited a particular place on a certain day, divided by all the times you visited any place on that day.

Now you have your first, simple model of where you will be at a given time on your map. You can refine this further by chopping your data into finer time slices (every hour instead of every day), and also by computing the conditional probability that you visit Place Y given that you're at Place X, P(Y|X).

You can then add more constraints or smarts (it's impossible to get from Place 1 to Place 2 in only an hour, and it's unlikely that you'd visit three supermarkets in one day).

posted by zippy at 2:29 AM on April 16, 2012 [1 favorite]

Here's a very rough Markov model.

Start-Of-Day is followed by Location A 85% of the time.

Location A is followed by Location B 80% of the time.

Location A is followed by Location C 20% of the time.

Location C is followed by Location D 100% of the time.

Location C is followed by Location D 100% of the time.

...

Location Z is followed by End-Of-Day 95% of the time.

I'm saying build a database of journeys, not a database of locations. For extra credit...

Location A is followed by Location B 80% of the time on Weekdays.

Location A is followed by Location B 1% of the time on Weekends.

Location A is followed by Location C 15% of the time on Tuesdays.

posted by Leon at 2:35 AM on April 16, 2012 [1 favorite]

Start-Of-Day is followed by Location A 85% of the time.

Location A is followed by Location B 80% of the time.

Location A is followed by Location C 20% of the time.

Location C is followed by Location D 100% of the time.

Location C is followed by Location D 100% of the time.

...

Location Z is followed by End-Of-Day 95% of the time.

I'm saying build a database of journeys, not a database of locations. For extra credit...

Location A is followed by Location B 80% of the time on Weekdays.

Location A is followed by Location B 1% of the time on Weekends.

Location A is followed by Location C 15% of the time on Tuesdays.

posted by Leon at 2:35 AM on April 16, 2012 [1 favorite]

Thinking about it, I wonder if a first- or second-order model would be a better predictor. And you might simplify the problem by reducing lat/log (messy, noisy) to, say, foursquare checkins (atomic, discrete).

posted by Leon at 3:19 AM on April 16, 2012

posted by Leon at 3:19 AM on April 16, 2012

A problem with the simple markov model I outlined is that locations you visit more than once (eg you start and end your day at home) will kinda short-circuit it. Hmm. Timebox the locations, maybe?

posted by Leon at 3:55 AM on April 16, 2012

posted by Leon at 3:55 AM on April 16, 2012

If Google Prediction looks good to you, but it only returns one result, why not run it twice?

posted by flabdablet at 5:25 AM on April 16, 2012

posted by flabdablet at 5:25 AM on April 16, 2012

It sounds like your interest is not so much in predicting your location, but building a cool algorithm, otherwise this is easy - the chances that I will be at my house between 10:00pm and 6:00am is like 99%, and obviously, because you're trying to predict your own behavior, you can cheat to make your results more accurate.

Any model trying to predict the location of a creature with its own agency is going to be inherently complex because to do it accurately, you have to model the decision model of that creature accurately, and that doesn't even take into account external factors. how is your algorithm to know if your co-workers invite you out for a drink after work? It doesn't, so it will miss and guess that you went home instead of to the bar on those days.

Think of the same algorithm applied to some other animal besides people - try to predict where a hawk will be at any given time, you'll probably find that a component in your equations requires you to be able to model mouse populations.

posted by tylerkaraszewski at 7:51 AM on April 16, 2012

Any model trying to predict the location of a creature with its own agency is going to be inherently complex because to do it accurately, you have to model the decision model of that creature accurately, and that doesn't even take into account external factors. how is your algorithm to know if your co-workers invite you out for a drink after work? It doesn't, so it will miss and guess that you went home instead of to the bar on those days.

Think of the same algorithm applied to some other animal besides people - try to predict where a hawk will be at any given time, you'll probably find that a component in your equations requires you to be able to model mouse populations.

posted by tylerkaraszewski at 7:51 AM on April 16, 2012

The passing mention of Markov interests me, though it might be tough to apply. Markov chains work upon a countable number of states, where one state tends to follow another. Position is a continuous value, rather than discrete states, so to apply Markov chains to the problem you have to solve the sub-problem of converting a continuous value into a discrete value. There are clustering algorithms that could help with that, but accidental gerrymandering could ruin your results even if the Markov chains would have given good results if you had divided locations differently.

Flabdablet: The possible risk of just running Google Prediction twice, once for Long and once for Lat, is that, say one time it grabs the 40 from 40, 90 and another time it grabs the 100 from 50, 100, and so predicts you are at 40, 100, even if that's not a location you've ever been.

However, maybe the same idea of clustering locations could be applied to Google Prediction. You create a list of clusters, and Google Prediction predicts the likelihood of each cluster, which you turn back into locations to show on the map.

posted by RobotHero at 10:58 AM on April 16, 2012

Flabdablet: The possible risk of just running Google Prediction twice, once for Long and once for Lat, is that, say one time it grabs the 40 from 40, 90 and another time it grabs the 100 from 50, 100, and so predicts you are at 40, 100, even if that's not a location you've ever been.

However, maybe the same idea of clustering locations could be applied to Google Prediction. You create a list of clusters, and Google Prediction predicts the likelihood of each cluster, which you turn back into locations to show on the map.

posted by RobotHero at 10:58 AM on April 16, 2012

This thread is closed to new comments.

posted by suedehead at 11:58 PM on April 15, 2012