How many working hyperlinks does the Web have now?
May 30, 2009 3:30 PM   Subscribe

How many working hyperlinks does the Web have now?

"Working hyperlink" defined as "will bring up something other than a 404 when clicked". It doesn't matter if that something is a URL-squatting advertiser or a never-to-be-followed "This is my first blog entry" blog entry from 1996.

Back-of-napkin math welcome.
posted by Joe Beese to Technology (32 answers total) 7 users marked this as a favorite
 
Since Google says the number of pages out there is infinite, and therefore can't come up with a solid number, I'd say the same goes for links, too.
So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite -- for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We're not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what's a useful page, and there is no exact answer.
posted by nitsuj at 3:40 PM on May 30, 2009 [1 favorite]


What nitsuj quoted. Also, a nontrivial portion of the web at this point is dynamically generated, including hyperlinks, or consists of just one "page" with an effectively infinite amount of content.
posted by Tomorrowful at 3:49 PM on May 30, 2009


Look. I'm creating a hyperlink RIGHT NOW! My vote is for infinite as well.
posted by zerokey at 4:02 PM on May 30, 2009


Response by poster: I was guessing the debate would be between trillions and quadrillions. But from what you're saying, at most it would be between "effectively infinite" and "mathematically infinite"?
posted by Joe Beese at 4:03 PM on May 30, 2009


The problem is that the number is changing constantly, and there's no practical way to take an instantaneous snapshot of the entire web. (Especially since a nontrivial portion of it is locked up behind passwords.)

Which is another way of saying that no one knows and no one will ever be able to find out. And even if they did, their knowledge would soon be out of date.
posted by Chocolate Pickle at 4:08 PM on May 30, 2009 [1 favorite]


mathematically infinite on one of my websites alone, there's a calendar where you can always click to the next day.
posted by Mick at 4:16 PM on May 30, 2009 [1 favorite]


Yeah, I think the boring answer is that it depends on your definitions of 'link' and 'page.'

If you want a rough estimate, just pick a number and then add a bunch of zeroes to it.
posted by box at 4:38 PM on May 30, 2009


"mathematically infinite on one of my websites alone, there's a calendar where you can always click to the next day."

So I could click up to the year 100,000,000,000 if I had the time?

Whatever the calendar's limit is a large number of clicks for sure, but it's not even close to infinite.
posted by santaliqueur at 4:45 PM on May 30, 2009


Response by poster: santaliqueur: "Whatever the calendar's limit is a large number of clicks for sure, but it's not even close to infinite."

Perhaps I'm misunderstanding how dynamic links work. But isn't there an upper limit imposed by the amount of hosting space that exists in the world?

That would seem to make it "effectively infinite" rather than "mathematically infinite".
posted by Joe Beese at 4:47 PM on May 30, 2009


Imagine a page that, every time you click a link, the background changes to either black or white. Is that an infinity of pages, or two, or one? Man, this is a hard question to answer.
posted by box at 5:02 PM on May 30, 2009 [1 favorite]


I'm not sure what you exactly mean by 'mathematically infinite' (erm, uncountably? countably?) but I think maybe you need to rephrase the question. Do you really want to know how many links? Or how many pages? And how do you handle dynamically generated pages? Surely, I could build a calendar app that would show you a different page for any given day into perpetuity; such an app could be said to have (countably) infinite pages. Estimating the number of pages actually stored on disks (versus pages generated for some input like the calendar app) is a whole other kettle of fish.

I believe google claims to index billions of pages (without putting too fine a point on the distinction between types of pages). If you really don't need a very accurate guess, well, there you go. At least a couple of billion pages.
posted by axiom at 5:43 PM on May 30, 2009


Response by poster: axiom: "I'm not sure what you exactly mean by 'mathematically infinite"

The sequence of positive integers is what I would call - however clumsily - "mathematically infinite". By definition, for any integer you could specify, someone else could specify that integer plus one. And all those integers already have the same "existence".

On the other hand, while a calendar app could create a succeeding-day web page every time you clicked the correct link, without limit, those web pages wouldn't have the same existence as this web page [as bits stored on a server somewhere] until the links were actually clicked. And even if every atom in the universe could store a bit of information, eventually you would run out of storage space. So in that sense, I would describe the number of hyperlinks as "effectively infinite".

Unless I'm misunderstanding the nature of dynamic links. (Or the sequence of positive integers.)
posted by Joe Beese at 6:19 PM on May 30, 2009


An interesting question, and good answers. I have to point out, however, as long as we keep answering this, the number changes....
posted by HuronBob at 6:38 PM on May 30, 2009


On the other hand, a calendar app with a 'next' link only exists after pressing the previous page. You could argue that while the page can be created at any time, it doesn't currently "exist"

So while there are an infinite number of pages that can be created, there is not an infinite number in existence at any one point in time.

The real question here is the amount of information on each page. A new calender page has zero information, because all the data on the page is based on the link. You just have one increasing day counter and that's it. You already know what the next 'page' will have on it, even though it has a different URL.

The real question isn't "how many unique pages" but "how many pages with nonzero information values" (i.e. pages with actual 'stuff' on them).

That number must be finite.
posted by delmoi at 7:14 PM on May 30, 2009


If we have 100 million domain names, and each has a day calendar calender a signed 32 bit integer to calculate unix time, it can display the days from Dec 13, 1901 to January 19, 2038, or 49,711 days. So we get 4 trillion, 971 billion, 100 million links.

Which is, in essence a completely random number.

How about we look at how many unique links are possible. That way we get an upper bound, at least.

In Internet Explorer, the maximum length of a URL is 2,083 character. Since 7 of them are taken up by, at least, "http://" that leaves 2076. Valid URLS can be made from the characters A-Z, a-z, 0-9, and .:;@&?=%+, or 71 characters. Which give us something like 3.3 x 10^235 combinations for an upper bound. (Though in theory it's a little smaller.)

Or another way to calculate it would be to sample a random assortment of pages of content, figure out what percentage of that content is links, (say 0.8% of an average Internet document) and multiply that by the estimated amount of data on the internet, say 500 terabytes. That gives 4 Tb. Figure out how long the average link is, divide through and you'll get an answer.
posted by Ookseer at 7:34 PM on May 30, 2009 [2 favorites]


Response by poster: Ookseer: "How about we look at how many unique links are possible. ... something like 3.3 x 10^235 combinations for an upper bound."

If there is an upper limit of 3.3 x 10^235 possible URLs, would that make an upper limit of (3.3 x 10^235)! possible unique hyperlinks?
posted by Joe Beese at 7:43 PM on May 30, 2009


Alright, how about this: How many publicly accessible documents ending in ".htm" or ".html" are stored on a server somewhere?

I realize that cuts away all the dynamically generated pages, but that's my point.
posted by argybarg at 8:11 PM on May 30, 2009


I realize that cuts away all the dynamically generated pages, but that's my point.

It would also cut out all of metafilter, almost all blogs, etc.
posted by delmoi at 9:15 PM on May 30, 2009


I think what we would actually want to count would be 1) Html pages, and 2) records in databases that are used to fill in text for HTML pages plus 3) Other types of text storage (like huge ass XML files, non-relational databases, text files, emails, etc) The third group wouldn't be very big compared to the second one.
posted by delmoi at 9:17 PM on May 30, 2009


If there is an upper limit of 3.3 x 10^235 possible URLs, would that make an upper limit of (3.3 x 10^235)! possible unique hyperlinks?

More or less. Actually less. And more. And after thinking about it for a few hours, less.

The less: Not all arrangements of characters are potentially valid urls, they need to start with a domain name or IP address, etc. But my math and regular expression ability is too weak to figure it out. Also some links will be identical. For example 209.85.171.100 is identical to google.com.

The more: The 2,083 limit is just for Internet Explorer. Other browsers can accept more. However Apache, a common web server, accepts 8,192 character limits. So if we go with that, we get 7 x 10^277. links. Only 42 orders of magnitude more.

And the less: It would take 8.5 x 10^261 Terabytes to store just those links. (Without compression, of course.) If there's an estimated 500 Exabytes of information on the net, and if it was all links of, say 500 characters, then there would be 10^18 (a million trillion) links as an upper bound. But few pages are all links, images, sounds and videos aren't links at all, and most links are shorter than 500 characters. So play around with those numbers until you find one you like.

(Note I dropped a few orders of magnitude on my earlier post. Not 500Tb, 500Eb.)
posted by Ookseer at 1:17 AM on May 31, 2009 [1 favorite]


(Looks like Metafilter eats IP addresses as URLS. But trust me, 209.85.171.100 is one of the many IP address that brings you Google search.)
posted by Ookseer at 1:22 AM on May 31, 2009


Though not links, the Google blog came up with a figure just under a year ago for number of URLs, while Netcraft does a monthly survey for number of sites, which seems too low.
posted by TheRaven at 4:27 AM on May 31, 2009 [1 favorite]


The sequence of positive integers is what I would call - however clumsily - "mathematically infinite". By definition, for any integer you could specify, someone else could specify that integer plus one. And all those integers already have the same "existence".

Joe Beese, that's not how mathematicians talk about infinity. Basically - and someone please correct me if needed - there are two types of infinity, countable and uncountable. A countable infinite set can, basically, be listed (and on to infinity) - so, you can list positive integers, or all integers (0, 1, -1, 2, -2. . .), or all fractions (0, 1, 2, 1/2, 3, 1/3, 2/3 . . .) etc. All countable infinite sets are considered to be the same "size". The other "size" of infinity is uncountable. Sets like all real numbers, or all real positive numbers, are uncountably infinite, because they cannot be listed. All uncountably infinite sets are considered to be the same "size."

So, there is a real difference between the idea that the internet has countably infinite pages, or uncountably infinite pages. Though both are infinite, they're very different sizes of infinity.
posted by insectosaurus at 8:50 AM on May 31, 2009


Napkin math. That Google blog says they know about 1 trillion URLs. But there's a lot of dynamic garbage. Three or four years ago people were excited about search engines crawling 1 billion pages. I'm gonna take a stab and say there's 100 billion interesting web pages today. That may be 10x too many. My other stab is that there are roughly 10 interesting links on the average web page. (Why? I made it up). So 10 links * 100 billion pages = 1 trillion "interesting" hyperlinks on "interesting" pages.
posted by Nelson at 9:24 AM on May 31, 2009 [1 favorite]


Response by poster: insectosaurus: "Joe Beese, that's not how mathematicians talk about infinity."

This does not surprise me. :-) Thanks for the clarification.
posted by Joe Beese at 9:33 AM on May 31, 2009


Joe Beese: The sequence of positive integers is what I would call - however clumsily - "mathematically infinite". By definition, for any integer you could specify, someone else could specify that integer plus one. And all those integers already have the same "existence".
That's what is meant by countably infinite, as insectosaurus ably points out. Essentially, if you can take a set and define a 1-1 mapping between its members and the natural numbers (1,2,3,...) then it's countably infinite. Some sets are uncountably infinite, like the real numbers, and are in some sense "bigger infinities."
Ookseer: If we have 100 million domain names, and each has a day calendar calender a signed 32 bit integer to calculate unix time, it can display the days from Dec 13, 1901 to January 19, 2038, or 49,711 days. So we get 4 trillion, 971 billion, 100 million links.

Which is, in essence a completely random number.
Why use signed 32 bit integers? With a little extra effort we can allow arbitrary precision integers as input and get a much larger output range. I think that at least theoretically there are no limits on the size of the data posted to a website so we could (again, theoretically) get to countable infinity with a calendar app alone.
posted by axiom at 12:31 PM on May 31, 2009


Alot.
posted by history is a weapon at 1:08 PM on May 31, 2009


Best answer: Effectively infinite for all practical purposes, but not actually infinite. The web sits on top of a finite number of computers with finite capacity, which means there is a practical (but very large) limit to how many working hyperlinks can exist.
posted by qxntpqbbbqxl at 1:52 PM on May 31, 2009 [1 favorite]


Why use signed 32 bit integers? With a little extra effort we can allow arbitrary precision integers as input and get a much larger output range.

Because unix time is traditionally represented as a signed 32 bit integer. And since this was obviously a dead-end for determining the number of links it didn't seem worth mentioning stuff like servers using 64 bit integers or using custom date calculations that don't rely on unix time, which would make an already arbitrary number even worse.

I still think the best bet for estimating is:
Number of bytes of data on the 'net / percent of data that is links / average character length of links = number of links on the 'net.
The tricky part is figuring out the second number since 20% of a web page might be links, but an image or video is 0%, and I haven't found a 'representative sample' of internet content to test.

Or, if you just want to cheat, Google returns 5,670,000,000 results when searching for "http://"
posted by Ookseer at 7:48 PM on May 31, 2009


There is, unfortunately, no meaningful answer to this question, for a host of reasons both technical (URLs aren't the easily quantifiable things you're imagining them to be) and practical (even if they were, how would you find the answer?).
posted by ixohoxi at 9:15 AM on June 1, 2009


...it's really depressing what comes up when you google "www.". That search comes up with 48,840,000,000 (48 billion) results.
posted by Night_owl at 4:16 PM on June 2, 2009


You can't rely on Google estimates to be particularly meaningful, particularly when they are a large number.
posted by Nelson at 7:31 PM on June 2, 2009


« Older To Swim or Not To Swim   |   These boots aren't made for walking...yet Newer »
This thread is closed to new comments.