Internet indexes - do they still exist and if not why not?
June 17, 2020 12:22 PM Subscribe
I remember that there used to be internet indexers, but don’t remember much beyond their sheer existence. Google is doing a version of this (All, Images, Videos, News, Shopping, etc), but I think it is insufficient and at times self-serving. I think the internet should be indexed. Do internet indexes still exist? If yes, what are they? If no, why not?
The internet right now seems like a huge mess, a hodgepodge of everything with no easy way to browse in a systematic fashion. Google is useful-ish as far as it goes, but feels insufficient as a systematic classification of the internet. Other search engines seem to have even less. Additionally, this feels like a project that should be more disinterested and global than a single corporation, search engine, country – I think it should be done by an international committee that doesn’t depend on any one stakeholder or stakeholder type.
I vaguely remember that something like this existed – a sort of collection of internet taxonomies with associated site listings. Maybe more than one. I don’t remember much beyond this, certainly not anything about the mechanics (such as who built the taxonomies, who decided what belonged where, etc).
A random example of what I mean: I’d like to be able to browse for something like ‘gardening’ but to distinguish between, let's say, ‘publications’ (something someone takes some responsibility for), and therein more generally books (i.e. every page that is specifically about a book), and next level ‘publishers’, or ‘reviews’, or ‘author’ etc. with additional choices such as ‘experience’, ‘education’, etc., vs ‘gardening’ in ‘shopping’ (with various associated categories such as company size, directly from producer, barter, etc.) vs ‘gardening’ in ‘opinion’, where I’d like to choose between blogs, fora, etc. Emphasizing that this is a made-up example, not actually a classification I want.
If the above is too confusing, I mean the difference between something like:
Give me all pages on the internet that contain reviews of gardening books
Vs
Give me all pages on the internet that relate to shopping for your garden, specifically (say) heritge vegetable seeds directly from growers
Vs
Give me all pages on the internet where people discuss their personal experience with gardening.
I realize some of these exist in Google, but they are limited in number and scope, are profit-driven rather than elaborated for the good of humanity, and don’t easily allow for a multi-level faceted approach beyond choosing a category and then searching within.
Is there someone who does this? Any of the old ones survive? If nothing like this exists, what are the obstacles, given the rise in automated classification, which should make matters easier, at least in theory?
The internet right now seems like a huge mess, a hodgepodge of everything with no easy way to browse in a systematic fashion. Google is useful-ish as far as it goes, but feels insufficient as a systematic classification of the internet. Other search engines seem to have even less. Additionally, this feels like a project that should be more disinterested and global than a single corporation, search engine, country – I think it should be done by an international committee that doesn’t depend on any one stakeholder or stakeholder type.
I vaguely remember that something like this existed – a sort of collection of internet taxonomies with associated site listings. Maybe more than one. I don’t remember much beyond this, certainly not anything about the mechanics (such as who built the taxonomies, who decided what belonged where, etc).
A random example of what I mean: I’d like to be able to browse for something like ‘gardening’ but to distinguish between, let's say, ‘publications’ (something someone takes some responsibility for), and therein more generally books (i.e. every page that is specifically about a book), and next level ‘publishers’, or ‘reviews’, or ‘author’ etc. with additional choices such as ‘experience’, ‘education’, etc., vs ‘gardening’ in ‘shopping’ (with various associated categories such as company size, directly from producer, barter, etc.) vs ‘gardening’ in ‘opinion’, where I’d like to choose between blogs, fora, etc. Emphasizing that this is a made-up example, not actually a classification I want.
If the above is too confusing, I mean the difference between something like:
Give me all pages on the internet that contain reviews of gardening books
Vs
Give me all pages on the internet that relate to shopping for your garden, specifically (say) heritge vegetable seeds directly from growers
Vs
Give me all pages on the internet where people discuss their personal experience with gardening.
I realize some of these exist in Google, but they are limited in number and scope, are profit-driven rather than elaborated for the good of humanity, and don’t easily allow for a multi-level faceted approach beyond choosing a category and then searching within.
Is there someone who does this? Any of the old ones survive? If nothing like this exists, what are the obstacles, given the rise in automated classification, which should make matters easier, at least in theory?
Those indices worked—barely—when the Internet was young. The problem is that they were human-compiled, and the size of the task is just too big. I'm not aware of automatic classification being applied to this problem in a public-facing way.
You can get an approximation of what you're looking for through bookmarking sites like Pinboard.
An archive of the old DMOZ site is still up.
posted by adamrice at 12:28 PM on June 17, 2020 [4 favorites]
You can get an approximation of what you're looking for through bookmarking sites like Pinboard.
An archive of the old DMOZ site is still up.
posted by adamrice at 12:28 PM on June 17, 2020 [4 favorites]
Yahoo worked this way at first. New sites were submitted to Yahoo indexers with recommended search areas. But, as adamrice said, the volume of new and changed websites grew far too large for that system to be maintained.
posted by tmdonahue at 12:35 PM on June 17, 2020 [9 favorites]
posted by tmdonahue at 12:35 PM on June 17, 2020 [9 favorites]
Directories were a thing before search engines, because search engines were all terrible and the internet was small.
Google PageRank was the breakthrough that let Google search the web without human intervention. This is much much cheaper than manually cataloguing the web, and much easier than automatically cataloguing the web. Sure, web searches return bad results, but we all more or less forgive it. A wonky automatically-generated directory would be much less useful.
So, there's no more money in it to maintain a web directory. Google has eaten all the ad revenue that might have sustained such an endeavor.
posted by BungaDunga at 12:46 PM on June 17, 2020
Google PageRank was the breakthrough that let Google search the web without human intervention. This is much much cheaper than manually cataloguing the web, and much easier than automatically cataloguing the web. Sure, web searches return bad results, but we all more or less forgive it. A wonky automatically-generated directory would be much less useful.
So, there's no more money in it to maintain a web directory. Google has eaten all the ad revenue that might have sustained such an endeavor.
posted by BungaDunga at 12:46 PM on June 17, 2020
The closest thing to in recent times was the the idea of a Semantic Web. It didn't come to much, because it relied on programmers actually encoding the semantic information into websites by hand.
posted by BungaDunga at 12:52 PM on June 17, 2020 [1 favorite]
posted by BungaDunga at 12:52 PM on June 17, 2020 [1 favorite]
> But, as adamrice said, the volume of new and changed websites grew far too large for that system to be maintained.
One estimate is that currently about six new Web sites are published every second.
posted by davcoo at 1:25 PM on June 17, 2020
One estimate is that currently about six new Web sites are published every second.
posted by davcoo at 1:25 PM on June 17, 2020
I think the reason that they don't exist is that search is much better than browsing for finding the specific information. Relatively speaking, categorisation is hard and finding answers from categorised information is hard, but if you can search through the entire contents of every-ish page, then finding the information you need is not so hard. It's also less work to amend as you're hunting, if you can't find what you need you can try and different combination of words rather than having to hunt through multiple categorisation lists.
Added to that, search is monetised much more easily. Advertisers definitely want to put their products in front of the eyes of the people who are literally searching for that product (and then it extends).
There are other things that browsing through categories is useful for (I sit and browse TV tropes through its indexes sometimes), but they're too niche/non-capitalist to sustain the effort it would take to index the modern web.
posted by plonkee at 1:25 PM on June 17, 2020
Added to that, search is monetised much more easily. Advertisers definitely want to put their products in front of the eyes of the people who are literally searching for that product (and then it extends).
There are other things that browsing through categories is useful for (I sit and browse TV tropes through its indexes sometimes), but they're too niche/non-capitalist to sustain the effort it would take to index the modern web.
posted by plonkee at 1:25 PM on June 17, 2020
I looked into this pretty deeply last year for work. DMOZ and OpenDNS were the two big crowdsourced ones, which is pretty much the only way to offer this service for free (but as mentioned above, quickly runs into problems of scale).
There are new solutions that rely on AI classification, but they require a lot of computing and infrastructure, and the best ones include a layer of human review, which means they are expensive to produce. Our vendor uses a neural net approach rather than cheaper solutions based on picking out keywords, but even then, there are noticeable gaps and errors in their classification.
You asked about obstacles. Here are several that we ran into when we were thinking about building an in-house solution:
- the sheer volume of webpages
- the rapidity with which those pages change
- the cost of scraping and analyzing all that text
- global language coverage
- pages can talk about multiple things
- just generating a comprehensive taxonomy for webpages with the right level of granularity is challenging. a taxonomy cannot have infinite depth; in your example, if you go down to the level of "shopping for heritage seeds directly from growers," your directory would have to include millions and millions of categories and would be harder to use than search! if you look at the old directories, they weren't that specific either.
- this is also a technical limitation: classifications are kept pretty high level (ours has ~400 unique) because the finer the cut, the harder it is to model. it's much easier to distinguish 'gardening' from 'sports' than it is to separate out 'gardening from growers' and 'gardening from retailers'. additionally, subject matter is easier to differentiate than the type of text -- 'reviews' will use a lot of similar language to 'personal experiences'.
None of this is to say it's impossible; there are many models that do this kind of thing, but the error rate is high. Our prototype was about 70-80% accurate -- decent for a model, useless for the user!
posted by ohkay at 1:31 PM on June 17, 2020 [1 favorite]
There are new solutions that rely on AI classification, but they require a lot of computing and infrastructure, and the best ones include a layer of human review, which means they are expensive to produce. Our vendor uses a neural net approach rather than cheaper solutions based on picking out keywords, but even then, there are noticeable gaps and errors in their classification.
You asked about obstacles. Here are several that we ran into when we were thinking about building an in-house solution:
- the sheer volume of webpages
- the rapidity with which those pages change
- the cost of scraping and analyzing all that text
- global language coverage
- pages can talk about multiple things
- just generating a comprehensive taxonomy for webpages with the right level of granularity is challenging. a taxonomy cannot have infinite depth; in your example, if you go down to the level of "shopping for heritage seeds directly from growers," your directory would have to include millions and millions of categories and would be harder to use than search! if you look at the old directories, they weren't that specific either.
- this is also a technical limitation: classifications are kept pretty high level (ours has ~400 unique) because the finer the cut, the harder it is to model. it's much easier to distinguish 'gardening' from 'sports' than it is to separate out 'gardening from growers' and 'gardening from retailers'. additionally, subject matter is easier to differentiate than the type of text -- 'reviews' will use a lot of similar language to 'personal experiences'.
None of this is to say it's impossible; there are many models that do this kind of thing, but the error rate is high. Our prototype was about 70-80% accurate -- decent for a model, useless for the user!
posted by ohkay at 1:31 PM on June 17, 2020 [1 favorite]
Response by poster: Hi, and thanks a lot for the answers so far.
I'd like to request answerers to ignore the 'why not' question. I am quite familiar with the act of ... I'll call it 'grouping' (weird terminology for privacy reasons - I don't want this post easily found/ want to preserve my privacy) and the drawbacks described here are completely inaccurate in my experience. My experience is fairly insular though, hence the question. Even if you disagree with the premise of my ask (aka accessing information on the internet should not be left to a few capitalist behemots, information should be systematized by a disinterested international body, NOT reliant on advertizing - WTF, that is the opposite of what I asked, tech developments re. statistical analysis as well as NLP should be co-opted, etc.), please believe me that my experience is not at all in agreement with assessements re. viability of the tech.
Additionally, I didn't ask just to entertain myself - search is much more complicated to use than appears to people who are thoroughly familiarized with it; outside of work, I don't know a single person in my environment who knows how to use it beyond simple searches. That's how/ why most people I know don't access the internet - they access Facebook. That's it. Not everybody is a US techie.
It appears that this is a question with no answer; thank you very much for taking your time, and if anyone has secret knowledge up their sleeve that is not widely known, I would be interested to hear it.
posted by doggod at 2:07 PM on June 17, 2020
I'd like to request answerers to ignore the 'why not' question. I am quite familiar with the act of ... I'll call it 'grouping' (weird terminology for privacy reasons - I don't want this post easily found/ want to preserve my privacy) and the drawbacks described here are completely inaccurate in my experience. My experience is fairly insular though, hence the question. Even if you disagree with the premise of my ask (aka accessing information on the internet should not be left to a few capitalist behemots, information should be systematized by a disinterested international body, NOT reliant on advertizing - WTF, that is the opposite of what I asked, tech developments re. statistical analysis as well as NLP should be co-opted, etc.), please believe me that my experience is not at all in agreement with assessements re. viability of the tech.
Additionally, I didn't ask just to entertain myself - search is much more complicated to use than appears to people who are thoroughly familiarized with it; outside of work, I don't know a single person in my environment who knows how to use it beyond simple searches. That's how/ why most people I know don't access the internet - they access Facebook. That's it. Not everybody is a US techie.
It appears that this is a question with no answer; thank you very much for taking your time, and if anyone has secret knowledge up their sleeve that is not widely known, I would be interested to hear it.
posted by doggod at 2:07 PM on June 17, 2020
web search (as I've said here before) is the onscreen equivalent of a shitty 90s Phoenix strip mall.
dewey decimal page metadata could work, I guess. someone has to assign them.
self-generated metadata tags are so much bullshit.
posted by j_curiouser at 2:15 PM on June 17, 2020 [1 favorite]
dewey decimal page metadata could work, I guess. someone has to assign them.
self-generated metadata tags are so much bullshit.
posted by j_curiouser at 2:15 PM on June 17, 2020 [1 favorite]
I guess there's a self-described successor to DMOZ at Curlie.org. Unclear who or what is behind it or if it's actually updated.
posted by BungaDunga at 2:59 PM on June 17, 2020 [2 favorites]
posted by BungaDunga at 2:59 PM on June 17, 2020 [2 favorites]
YAHOO's official bacronym is "Yet Another Hierarchically Organized Oracle" . So I think they were trying to be hierarchical and categorical about it? Kind of before my time.
Also, don't forget about Bing? People at Bing also work hard at indexing the internet.
It's hard to do as a private person because it takes a lot of computing power to keep up with the size of the internet, and is tricky when you get into the details.
These private companies also spend a lot of effort in understanding natural language queries, so fully using a search engine doesn't have to mean knowing how to use all the operators anymore.
posted by batter_my_heart at 3:03 PM on June 17, 2020
Also, don't forget about Bing? People at Bing also work hard at indexing the internet.
It's hard to do as a private person because it takes a lot of computing power to keep up with the size of the internet, and is tricky when you get into the details.
These private companies also spend a lot of effort in understanding natural language queries, so fully using a search engine doesn't have to mean knowing how to use all the operators anymore.
posted by batter_my_heart at 3:03 PM on June 17, 2020
As a librarian whose job it is to organize information, I have to agree with the others: indexing an appreciable portion of the web would require a vast number of subject experts working at a breakneck pace to catalog the millions of sites that exist and which are constantly being created, and a second vast team of subject experts to continually review the existing index, because websites are constantly changing, updating, going in new directions, or overtaken by malware and pharma spam. And the other sites are never updated for continuously changing subject areas so that they are outdated after a couple of years, and that also requires subject specialists to review and see if they're useful or not.
You could do this on a significantly smaller scale for a limited range of subjects, and some people/organizations do. It's the library, where you can contact a search specialist, ask them for what you want, and get a curated list of results. You can also check their websites for subject guides, where librarians have spent time curating resources and keeping them updated, although they tend to spend their time doing those for subjects they are more frequently asked about. If you need a subject specialist, you can call or email an academic library: most of them actually serve the public as well as their university community.
By the way, the top 3 results the exact query "Give me all pages on the internet where people discuss their personal experience with gardening" supplies me when I drop it into Google include a page that lists ten internet gardening forums and two pages that respectively list 50 and 100 gardening blogs, which seems to me to be a good place to start.
posted by telophase at 3:10 PM on June 17, 2020 [4 favorites]
You could do this on a significantly smaller scale for a limited range of subjects, and some people/organizations do. It's the library, where you can contact a search specialist, ask them for what you want, and get a curated list of results. You can also check their websites for subject guides, where librarians have spent time curating resources and keeping them updated, although they tend to spend their time doing those for subjects they are more frequently asked about. If you need a subject specialist, you can call or email an academic library: most of them actually serve the public as well as their university community.
By the way, the top 3 results the exact query "Give me all pages on the internet where people discuss their personal experience with gardening" supplies me when I drop it into Google include a page that lists ten internet gardening forums and two pages that respectively list 50 and 100 gardening blogs, which seems to me to be a good place to start.
posted by telophase at 3:10 PM on June 17, 2020 [4 favorites]
There's another consideration I learned years ago when I had an archivist for a roommate. If you classify entries in advance, your searches are limited to the keyterms used. A brute-force search, which is sort of what we have now, is challenging to tune to the searcher's interest, but a keyword search is limited to the subjects the original classifiers thought would be relevant at the time of indexing. More work for searchers under brute-force, but many more possibilities.
posted by tmdonahue at 6:27 PM on June 17, 2020 [2 favorites]
posted by tmdonahue at 6:27 PM on June 17, 2020 [2 favorites]
Along with the enormous scale and speed of the problem, it would have to deal with cheating - lots of shopping sites want to come up in lists of discussions, for instance. And there is money in it for the cheaters.
posted by clew at 7:30 PM on June 17, 2020 [2 favorites]
posted by clew at 7:30 PM on June 17, 2020 [2 favorites]
That's how/ why most people I know don't access the internet - they access Facebook. That's it. Not everybody is a US techie.
I'm also a librarian and we have a saying "Librarians like to search, everyone else likes to find" My experience has been that a lot of people still use search engines, they just don't use them particularly well. Which is not their fault! The search engine general MO is to find you something, anything, that seems relevant quickly. As opposed to going to the library where you might be able to find the perfect thing but it might take a while. And for an average "just curious" query, this is usually fine for people, the speed. The same with facebook, it's fine for people even if librarians and others who know what's going on there are horrified.
But, what we know about information-seeking behavior is: it's always been this way. That study after study has shown that when people have an information need the first thing they are likely to do is "Ask a friend' even if the friend may not be the right person to answer the question. I call it the "Light is better over here" phenomenon. And so a better index isn't likely to become a thing average people are going to even want to use, according to research.
Add to this that people looking things up are usually either tending towards
- egghead researchers who already don't mind using the existing tools and can dig deep on their own to find what they want. When I was in college we would use KWIC (keyword in context) directories which were print versions of... keywords you could look up with a few words around them to show how they were used and then you could tell if it was worth digging up the actual article. Nowadays this sort of thing is what "snippets" do in Google
- in a hurry people who just want to learn about a thing in which case whatever few articles about bats show up first on Google are probably okay with them, they don't need me-the-librarian to find them the best bat article (and honestly a lot of what they need is on Wikipedia anyhow which, while not perfect, is pretty good and better than asking your friend in most cases)
I used to classify stuff for DMOZ and it was a slog because most of the actual submissions were spammers and you'd have to fight with people ALL THE TIME about what you decided to include or not include. And I had friends who did similar stuff for Yahoo and it was worse.
However! There are a lot of pretty-big library catalogs which use what is called "faceted classification" to identify key aspects of a book or other media item, so you can more easily limit searches, find stuff like other stuff, etc. So if you look at a search like this one on Open Library you can see that you can limit by author, people, places, times, publisher and other "facets." Using a system like this along with what librarians call "Controlled vocabulary" (so there is one canonical term for an author, say, or a publisher that has had three name changes, you can find all their books grouped together, or subjects get classified so you don't find different books under 'handgun legislation' than you do under 'gun control') can really get you a lot of the way there. Combine this with keyword indexing and you've got a pretty robust system BUT it mostly only works well on print materials, and only works with books because someone was front-loading the work of assigning subject headings, picking canonical author names and etc.
So the whole issue is tricky. It's tough for people who are some level of information expert to realize that not only are other people unaware of how these systems work but in many cases, they don't care much and it doesn't solve a problem for them. In cases where it's deadly important--law, medicine, science--there are actually capable indexers who are doing a lot of really good work, they're just not in generalist fields and they're rarely free.
posted by jessamyn at 8:42 PM on June 17, 2020 [4 favorites]
I'm also a librarian and we have a saying "Librarians like to search, everyone else likes to find" My experience has been that a lot of people still use search engines, they just don't use them particularly well. Which is not their fault! The search engine general MO is to find you something, anything, that seems relevant quickly. As opposed to going to the library where you might be able to find the perfect thing but it might take a while. And for an average "just curious" query, this is usually fine for people, the speed. The same with facebook, it's fine for people even if librarians and others who know what's going on there are horrified.
But, what we know about information-seeking behavior is: it's always been this way. That study after study has shown that when people have an information need the first thing they are likely to do is "Ask a friend' even if the friend may not be the right person to answer the question. I call it the "Light is better over here" phenomenon. And so a better index isn't likely to become a thing average people are going to even want to use, according to research.
Add to this that people looking things up are usually either tending towards
- egghead researchers who already don't mind using the existing tools and can dig deep on their own to find what they want. When I was in college we would use KWIC (keyword in context) directories which were print versions of... keywords you could look up with a few words around them to show how they were used and then you could tell if it was worth digging up the actual article. Nowadays this sort of thing is what "snippets" do in Google
- in a hurry people who just want to learn about a thing in which case whatever few articles about bats show up first on Google are probably okay with them, they don't need me-the-librarian to find them the best bat article (and honestly a lot of what they need is on Wikipedia anyhow which, while not perfect, is pretty good and better than asking your friend in most cases)
I used to classify stuff for DMOZ and it was a slog because most of the actual submissions were spammers and you'd have to fight with people ALL THE TIME about what you decided to include or not include. And I had friends who did similar stuff for Yahoo and it was worse.
However! There are a lot of pretty-big library catalogs which use what is called "faceted classification" to identify key aspects of a book or other media item, so you can more easily limit searches, find stuff like other stuff, etc. So if you look at a search like this one on Open Library you can see that you can limit by author, people, places, times, publisher and other "facets." Using a system like this along with what librarians call "Controlled vocabulary" (so there is one canonical term for an author, say, or a publisher that has had three name changes, you can find all their books grouped together, or subjects get classified so you don't find different books under 'handgun legislation' than you do under 'gun control') can really get you a lot of the way there. Combine this with keyword indexing and you've got a pretty robust system BUT it mostly only works well on print materials, and only works with books because someone was front-loading the work of assigning subject headings, picking canonical author names and etc.
So the whole issue is tricky. It's tough for people who are some level of information expert to realize that not only are other people unaware of how these systems work but in many cases, they don't care much and it doesn't solve a problem for them. In cases where it's deadly important--law, medicine, science--there are actually capable indexers who are doing a lot of really good work, they're just not in generalist fields and they're rarely free.
posted by jessamyn at 8:42 PM on June 17, 2020 [4 favorites]
you might be interested in what the folks at golden.com are doing...
posted by wowenthusiast at 9:06 PM on June 17, 2020
posted by wowenthusiast at 9:06 PM on June 17, 2020
Here’s a forum for people interested in creating internet directories. Here are some examples of what they’re talking about. I came across it via this post on the always interesting Kicks Condor blog.
posted by fabius at 12:50 AM on June 18, 2020
posted by fabius at 12:50 AM on June 18, 2020
This thread is closed to new comments.
DMOZ (archive here) was one of these things, IIRC.
posted by mandolin conspiracy at 12:27 PM on June 17, 2020 [4 favorites]