Tracing you by the hole you make in the world
July 9, 2011 3:24 PM Subscribe
How many people would have to sign up for facebook before facebook can infer the existence of the rest?
This is something that I've been wondering about lately, and I suspect there's math that could tell me: how many people have to sign up for facebook before facebook could effectively profile the remainder of the people who haven't joined its network?
That is to say, it's known that sites with like buttons have cookies that track IP addresses, and when facebook collects emails from people who join it cross-checks and retains emails from people who haven't joined. Both of these practices are so that, if you then go to sign up later, they can link you up with this info, hooking you up with friends who's already joined and sites you might like and so forth. Also, they have facial recognition software built in to their photo ap and people can tag you in a photo even if you don't have an account --- I understand that these photos can also then be picked up as part of your profile if you join at a later date.
So I'm thinking, even if I don't have a facebook account, at this point, given how many of my friends and family have joined, it would probably be possible to infer my existence from my email address turning up in their address books and photos of me on their profiles. In effect I have been captured by the network even though I don't have an account.
What I'm wondering is, what percentage of people would have to join facebook before FB could effectively capture the remainder in this way? Granted that there are exceptions --- people who aren't that linked in to society itself in ways that would make it less likely they would be captured (infants, homeless people, the elderly, etc.). And obviously much of this depends on the connected to the internet-ness of one's real life society as a whole --- a country with a big chunk of it population living in remote villages, etc., is not going to be well captured (or not for some time). So absolutely 100% of people is an impossible goal.
But for, say, the English-speaking and/or developed world, I'm wondering what the tipping point would be. Is there anyone who's aware of any branch of mathematics/social science/computer science that's looked at this type of network effect?
This is something that I've been wondering about lately, and I suspect there's math that could tell me: how many people have to sign up for facebook before facebook could effectively profile the remainder of the people who haven't joined its network?
That is to say, it's known that sites with like buttons have cookies that track IP addresses, and when facebook collects emails from people who join it cross-checks and retains emails from people who haven't joined. Both of these practices are so that, if you then go to sign up later, they can link you up with this info, hooking you up with friends who's already joined and sites you might like and so forth. Also, they have facial recognition software built in to their photo ap and people can tag you in a photo even if you don't have an account --- I understand that these photos can also then be picked up as part of your profile if you join at a later date.
So I'm thinking, even if I don't have a facebook account, at this point, given how many of my friends and family have joined, it would probably be possible to infer my existence from my email address turning up in their address books and photos of me on their profiles. In effect I have been captured by the network even though I don't have an account.
What I'm wondering is, what percentage of people would have to join facebook before FB could effectively capture the remainder in this way? Granted that there are exceptions --- people who aren't that linked in to society itself in ways that would make it less likely they would be captured (infants, homeless people, the elderly, etc.). And obviously much of this depends on the connected to the internet-ness of one's real life society as a whole --- a country with a big chunk of it population living in remote villages, etc., is not going to be well captured (or not for some time). So absolutely 100% of people is an impossible goal.
But for, say, the English-speaking and/or developed world, I'm wondering what the tipping point would be. Is there anyone who's aware of any branch of mathematics/social science/computer science that's looked at this type of network effect?
I think you are mistaken in your assumption that Facebook stores photo-tagging data in a way that allows it to retro-actively be linked to new users, and it's also news to me if they do anything with your non-facebook'd email contacts beyond gently and incessantly prodding you to invite them to facebook (I remember there being quite the brouhaha when they first started requesting this information, and their assurances that it wasn't used for anything beyond that). I could very well be mistaken about this, and I would be eager to see evidence to the contrary.
That unhelpful contrarianism out of the way, this is definitely something that mathematicians have been addressing since long before a Social Network was a thing you could visit in your web browser, though it's not a topic I know much about. I would start with the wikipedia article on social networks.
posted by brightghost at 3:54 PM on July 9, 2011
That unhelpful contrarianism out of the way, this is definitely something that mathematicians have been addressing since long before a Social Network was a thing you could visit in your web browser, though it's not a topic I know much about. I would start with the wikipedia article on social networks.
posted by brightghost at 3:54 PM on July 9, 2011
A major issue that I think would contribute to this issue would be discerning actual humans from identities. Let's say your co-worker on facebook has your work email, your parent on facebook has your old yahoo account, most of your facebook friends have a gmail account, but some have your isp's email. And your classmates have your school address. That is five identities for just one person, which is possible to figure out, but what incentive does facebook have in consolidating and tracking such people down?
It is an interesting problem - at what point would the overlapping data allow an autonomous system to accurately capture the true social structure of a digital society (which automatically excludes the infants and elderly and incarcerated and that one cranky neighbor who hates computers). Most stat's folks wouldn't be bothered to figure everyone out census style - they would get take a representative sample and extrapolate from the data. I don't know but I would hazard to guess that the epidemiologists would be the place to look- they are interested in the connections between actual human populations with a particular difference. I would guess that they would have the best tools to figure out those who were healthy from a network of those who were sick (facebookers in this case).
posted by zenon at 7:42 PM on July 9, 2011
It is an interesting problem - at what point would the overlapping data allow an autonomous system to accurately capture the true social structure of a digital society (which automatically excludes the infants and elderly and incarcerated and that one cranky neighbor who hates computers). Most stat's folks wouldn't be bothered to figure everyone out census style - they would get take a representative sample and extrapolate from the data. I don't know but I would hazard to guess that the epidemiologists would be the place to look- they are interested in the connections between actual human populations with a particular difference. I would guess that they would have the best tools to figure out those who were healthy from a network of those who were sick (facebookers in this case).
posted by zenon at 7:42 PM on July 9, 2011
Response by poster: I think you are mistaken in your assumption that Facebook stores photo-tagging data in a way that allows it to retro-actively be linked to new users, and it's also news to me if they do anything with your non-facebook'd email contacts beyond gently and incessantly prodding you to invite them to facebook (I remember there being quite the brouhaha when they first started requesting this information, and their assurances that it wasn't used for anything beyond that). I could very well be mistaken about this, and I would be eager to see evidence to the contrary.
It's possible I haven't understood how the photo feature currently works properly. But once you're in and they have a profile pic of you which they know is you, and facial recognition in place to pick you out of photos others suggest, it seems to me entirely possible to apply such recognition technology to photos uploaded by others prior to your joining the site. And such photos can be manually tagged with your name as it is.
In re the emails, from the user experience perspective, they don't do much with them besides occasionally prompt you to invite people. But when a new person joins and creates an account using an email that has previously been found in someone else's address book, it sends notifications to all those someone elses to say "hey, your friend just joined up, come say hello." So it's clear that they're storing them. Even if I, leo.tolstoy@gmail.ru, have not joined facebook, it would be entirely possible for them to search through the data they have and figure out who most of my friends and family. That's what I mean by infer. Though I myself my not have elected to join facebook, because Vronsky and Anna Karenina have, the fact that I, Leo, exist and am friends with them can be inferred.
It's not that I necessarily think they're actively sussing out such inferences now. I mean, god knows what they're doing over there. My curiosity is for the potential --- it seems clear to me that there will come a point where, regardless of my personal decision to join or not join it, they could have so much data on me through what my friends and family have given up that they could make useful conclusions about me for advertising or other purposes. There comes a point when the virtual network is so information rich you could make a highly accurate model of the real world network from which it springs. So what is that point? That's what I'm interested in .
posted by Diablevert at 8:00 PM on July 9, 2011
It's possible I haven't understood how the photo feature currently works properly. But once you're in and they have a profile pic of you which they know is you, and facial recognition in place to pick you out of photos others suggest, it seems to me entirely possible to apply such recognition technology to photos uploaded by others prior to your joining the site. And such photos can be manually tagged with your name as it is.
In re the emails, from the user experience perspective, they don't do much with them besides occasionally prompt you to invite people. But when a new person joins and creates an account using an email that has previously been found in someone else's address book, it sends notifications to all those someone elses to say "hey, your friend just joined up, come say hello." So it's clear that they're storing them. Even if I, leo.tolstoy@gmail.ru, have not joined facebook, it would be entirely possible for them to search through the data they have and figure out who most of my friends and family. That's what I mean by infer. Though I myself my not have elected to join facebook, because Vronsky and Anna Karenina have, the fact that I, Leo, exist and am friends with them can be inferred.
It's not that I necessarily think they're actively sussing out such inferences now. I mean, god knows what they're doing over there. My curiosity is for the potential --- it seems clear to me that there will come a point where, regardless of my personal decision to join or not join it, they could have so much data on me through what my friends and family have given up that they could make useful conclusions about me for advertising or other purposes. There comes a point when the virtual network is so information rich you could make a highly accurate model of the real world network from which it springs. So what is that point? That's what I'm interested in .
posted by Diablevert at 8:00 PM on July 9, 2011
Depends on how you want to model Facebook's discovery process. I say "want to" because I have no idea how it should be modeled, but I do know multiple ways which are all plausible-yet-contradictory.
Way 1: Facebook users are random people chosen out of the social network. Some random subset of those people will use the "import my email contacts" button. What percentage of the vertices of a social graph using that "import" button will end up covering x% of the graph?
Way 2: Facebook users mostly join through invitation. This means that Facebook is like a socially-transmitted disease. Some people are immune, some people have it, and some are unexposed (small category, that last one). SIR models are classic, and SIR models on social networks have also been studied. How long until either everybody is infected, or everybody has an infected friend?
Way 3: Maybe you want to discover a minimal vertex cover of the social graph - every person is either in the network, or has a friend in the network. This is a well-known hard problem (NP complete, actually!) but has some approximations which you might be able to try out. Basic answer: I suspect that, if the social network truly is a small-world network, then it should be a remarkably small percentage of the population. Under 10%. But they would have to be the right people.
Note that all of these presume that the social network of the population whose coverage you are concerned about is well enough understood to be used in one of these models. I am unsure that this is the case (we have lots of theories, but it turns out that dealing with measurement error is somewhere between hard and impossible), so it might be true that we actually know so little that this question is unanswerable.
posted by pmb at 8:44 PM on July 9, 2011 [1 favorite]
Way 1: Facebook users are random people chosen out of the social network. Some random subset of those people will use the "import my email contacts" button. What percentage of the vertices of a social graph using that "import" button will end up covering x% of the graph?
Way 2: Facebook users mostly join through invitation. This means that Facebook is like a socially-transmitted disease. Some people are immune, some people have it, and some are unexposed (small category, that last one). SIR models are classic, and SIR models on social networks have also been studied. How long until either everybody is infected, or everybody has an infected friend?
Way 3: Maybe you want to discover a minimal vertex cover of the social graph - every person is either in the network, or has a friend in the network. This is a well-known hard problem (NP complete, actually!) but has some approximations which you might be able to try out. Basic answer: I suspect that, if the social network truly is a small-world network, then it should be a remarkably small percentage of the population. Under 10%. But they would have to be the right people.
Note that all of these presume that the social network of the population whose coverage you are concerned about is well enough understood to be used in one of these models. I am unsure that this is the case (we have lots of theories, but it turns out that dealing with measurement error is somewhere between hard and impossible), so it might be true that we actually know so little that this question is unanswerable.
posted by pmb at 8:44 PM on July 9, 2011 [1 favorite]
The question is theoretical, not about what facebook actually does. So "Facebook doesn't do that" doesn't answer the question.
The answer, I think is actually zero. It doesn't matter how many people use facebook. All that matters is how many people visit websites with little facebook buttons. Each of those buttons gives FB the opportunity to cross check your IP and profile you based on what sites you visit. Doubleclick and other internet advertising companies also do this, and they didn't need anyone to sign up for anything (except for web masters)
As far as the email addresses, it depends. There would be a huge difference between how many they need to get 90% vs. how many they need to get 99%. They would never be able to get 100% because not everyone has an email address in anyone else's address book. And of course not everyone uploads their address book to facebook.
posted by delmoi at 5:17 AM on July 12, 2011
The answer, I think is actually zero. It doesn't matter how many people use facebook. All that matters is how many people visit websites with little facebook buttons. Each of those buttons gives FB the opportunity to cross check your IP and profile you based on what sites you visit. Doubleclick and other internet advertising companies also do this, and they didn't need anyone to sign up for anything (except for web masters)
As far as the email addresses, it depends. There would be a huge difference between how many they need to get 90% vs. how many they need to get 99%. They would never be able to get 100% because not everyone has an email address in anyone else's address book. And of course not everyone uploads their address book to facebook.
posted by delmoi at 5:17 AM on July 12, 2011
This thread is closed to new comments.
posted by ewiar at 3:49 PM on July 9, 2011