What information can website owner track about visitors?
May 15, 2010 10:33 AM   Subscribe

This is going to be a super-broad question: What kinds of information can the owner of a website glean about its visitors? For instance, can they see where you were just prior, or where you go when you leave? Under what circumstances would you be individually identifiable? Etc.
posted by HotToddy to Computers & Internet (7 answers total) 12 users marked this as a favorite
 
You can use a site like What's my IP to tell you the things that someone can tell about you from visiting their website. As you can see, they can tell how you got to that page [referral link] and some information about your general location [from ISP details] and your browser configuration. I recall a site that would put all this information together and let you know how "unique" your footprint was and it seemed like many people were unique because of the list of fonts they had on their system.

The difference, however, between being unique and being personally identifiable is a fairly large one. In most cases people couldn't figure out who you were if you were some stranger, but if you were a person I knew and I wanted to know if you'd hit my site, I could look for what I knew about your footprint. So I could, for example, somewhat sneakily, send you an email with a link to something on my website and once you looked at it [say it was a link to an image that no one else had the web address to] I'd know the footprint for the computer/ISP that you were using.

There are also ways to avoid this by using proxies and other things. I'm not sure exactly what you want to know specifically but generally as far as being personally identifiable the answer is "maybe" but if someone were TRYING to do this and you didn't know how to keep them from doing it, the answer is more like "yes"
posted by jessamyn at 10:40 AM on May 15, 2010 [1 favorite]


The Electronic Frontier Foundation has produced a tool to demonstrate how all the peripheral information provided can actually uniquely identify you. They use basic HTTP Header info (all the data in the Whats my IP link jessamyn gave) and then they gather more unusual information like what fonts Flash has available (and in what order they're reported) HTTP cookies, and what plugins you have installed. The FAQ has a lot more documentation on what was and wasn't implemented, but the takeaway is that 85 percent of visitors were uniquely identifiable. Browserspy has a much more comprehensive listing of crap website owners can snarf.

And if you've ever seen or tried Google Analytics, website owners can and do collect a lot of information. We know what version of Flash our users have installed, and Java. We know browser sizes, and browser/OS versions. We know how people click away, and how they got there. We know what search queries drove traffic to our site. We have a good idea of what city our users live in, but that metric isn't as reliable. All this information comes from Javascript.
posted by pwnguin at 11:04 AM on May 15, 2010 [3 favorites]


The site owner can generally see where you were immediately prior to arriving at the site. This would include knowing what Google Search term you used to find the site.

They can also tell what operating system you are on, what your screen resolution is, and if they are a large ad network like DoubleClick, for example, they can tell what other sites in the network that you have visited.

They can tell, with a moderate amount of accuracy, what country and city you are coming from. They can probably tell what languages you speak. They may be able to tell whether you are a technological early adopter.

Without access to your computer or your ISP, they probably cannot tell who you are (name and address). But they may be able to make a good guess, if they wanted to or if you are unlucky enough to have identifying information come through in the referrer (like, the search that led you to the site was "MY UNIQUE NAME")
posted by zippy at 11:06 AM on May 15, 2010


There are ways to counter most of these things, BTW. An add-on like RefControl lets you configure the browser to forge the Referer header, which prevents the site from seeing how you got there. (The 'forge' part is used so that a fake referer consisting of the website's root address is sent instead of a blank referer so that the site always thinks you got there from its main page.) The information regarding your operating system and installed software usually comes from the User-Agent string which is another thing that is completely under your control. The things like screen resolution require scripting to determine so using something like NoScript and only allowing scripting where necessary will result in much less leakage. Finally blocking ads means that those advertisers like DoubleClick can't build a cross-site profile of your browsing behavior.
posted by Rhomboid at 11:18 AM on May 15, 2010


Oh, and your general geographic location is not something you can hide or modify, but you can use a proxy which means accessing the site through a remote server so that the access appears to come from there instead of from where you really are.
posted by Rhomboid at 11:21 AM on May 15, 2010


(Disclaimer: I am not a privacy researcher.)

Identification.

There are really three different types of identity. I might know exactly who you are, e.g., Todd Hotts, SS#024-98-2425. I might know that you're the same as a previous person, e.g., "Prince" or the internal leaker who posted the last three leaks about a company. I might know that you're part of a group of some size, e.g., people in the US, or people within 20 miles of Atlanta, GA. We might call these three types individual, pseudonymous and group identity.

Advertisers want to target any of these levels of identity, so tools have been developed to make it possible to advertise to all three types. For example, on Facebook, you can more or less target an ad at a single person. On the other hand, if you were advertising for a Chick-fil-a in Atlanta, you would probably be happy with some targeting based on the group of people searching for related terms in a geographical area.

A website owner can establish individual identity in various ways. The user might directly enter their personal information. The user might use a single sign-on system, like Facebook Connect (although I don't know off hand what details Facebook Connect gives to website owners). Lastly, the user might give enough information about themselves such that they are uniquely identified based on other data. For example, in the US, the combination of date of birth, five digit zip code, and gender is sufficient to uniquely identify the majority of individuals given census data (see work by Latanya Sweeney).

Pseudonymous identity is similar, though we might also be happy with information like browser headers, IP address, and so on to uniquely identify the user, rather than needing personal details about who they actually are. As others have posted above, browser details, even independent of cookies, are often sufficient to identify someone in a pseudonymous sense (as long as they're using the same computer/browser, which probably isn't a terrible assumption).

Group identity is what you get when the added tactics for pseudonymous identification could point to a number of people and not just one person whose name you don't know.

Information.

People probably run their web browsers in a fairly unmodified form.

A web browser sends the following information to a website you visit:
  • IP address (which together with a database will provide your approximate geographic location, and if you have a static IP address, will at least pseudonymously identify you)
  • Referer headers (the previous URL you were at, if you clicked a link to get to this URL)
  • Some browser information, like what character sets your browser supports
If the website writes some code, your browser is intended to send (ignoring security problems)
  • A cookie which may uniquely pseudonymously identify you to the website (also, possibly "super cookies" in Flash or MSIE which will not show up in your cookie dialogs)
  • Which link you click to leave the website
  • Any interaction you do with the web page, like typing, mouse movement, when you leave the page, how long you're idle, so long as it's not interaction with a plugin like Flash, and maybe even if it is (see, e.g., Crazy Egg)
  • Browser details, like names of browser plugins, fonts, screen size, many of which will be mostly pseudonymously identifying
If the website colludes with other websites, or is a major website like Google running an analytics business, they can do analysis to learn
  • Many of the web pages you visit
  • The order in which you visit those websites
(Note that most analytics products will provide at least browser details about visitors, inlinks to reach a website, outlinks the user followed, and their search query to reach a web site with little to no effort on the part of the web site owner.)

Other.

Note that this discussion just looks at automated ways to determine identity and details about regular web visitors. Privacy laws may apply in certain jurisdictions to prevent certain data from being collected, or kept beyond a particular period of time. Also, if you have a whole mob of people, often an individual can be identified from very little information (see, e.g., the AOL query log dataset incident or human flesh search engines).
posted by pbh at 1:23 PM on May 15, 2010 [1 favorite]


There's a design flaw in most browsers that allows a malicious website to read the complete contents of your browsing history. A page's stylesheet can specify different styles for links depending on whether you've previously visited the target of the link, and JavaScript code can detect the resulting change in appearance in a bunch of different ways. If someone has a big list of candidate pages and can get you to visit a page under their control, they can test thousands of possibilities per second. Here's a demo.

Some recent versions of Chrome and Firefox are making attempts to mitigate this attack, but only in developer preview versions as far as I know.
posted by teraflop at 1:33 PM on May 15, 2010 [1 favorite]


« Older Don't worry, we'll put a tarp down   |   What's the name of this song from This American... Newer »
This thread is closed to new comments.