Where do they get all those wonderful data?
December 18, 2006 4:07 PM   Subscribe

WhereTheHellDoTheyGetAllThatData Filter: When someone wants to start a site that requires huge amounts of data like IMDB or Pandora or Last.FM, where do they get the data?

I can't really see that they take the time to find all the data sources and compile and link it all together into a database format. There has to be some place where they get the data. Another example is a mapping site like Mapquest or Google Maps. I know Google has more money than God or whatever, but where does such a huge amount of data come from?

Then, the side note, is how the hell do they go about processing it or relating it all together to produce a product like Pandora?

I am interested in this partly from a pure curiosity standpoint and partly from an IfIEverWantedToDoSomethingSimilar standpoint.
posted by jxpx777 to Technology (18 answers total) 5 users marked this as a favorite
 
From the Wikipedia article on IMDB:

"Information is largely provided by a cadre of volunteer contributors, with only 17 members of the staff dedicated to monitoring the data received"
posted by saraswati at 4:14 PM on December 18, 2006


IMDB just grew. It started in 1989 and has grown from then in terms of personell, finance and data.
posted by fire&wings at 4:14 PM on December 18, 2006


I think IMDB started as a completely user-submitted site. I don't know what they do now, though.

Any mapping site is probably licensing data from NavTeq, MapLink/TeleAtlas, InfoUSA and the like.

On the IfIEverWantedToDoSomethingSimilar front, it's a pain in the ass. I'm interested in starting a site that would require a lot of business listing data, and the companies that license that sort of thing tend not to even return my email asking for price quotes.
posted by hades at 4:16 PM on December 18, 2006


Good to know about IMDB.

I just printed Google Maps directions to a friend's new apartment for tomorrow noticed the directions say, "Map data (c) 2006 NAVTEQ(tm)"

Still curious about Pandora and the rest of the music sites. I know a lot of that data is user submitted as well...

@hades: That's also good to know what I would be up against if I ever did try to get something going as well. I don't have any definite plans right now, but I always have ambitions. :D
posted by jxpx777 at 4:25 PM on December 18, 2006


Reading the stories of CDDB and FreeDB is educational in this regard. Many people felt burned that after contributing their volunteer effort to CDDB (which began as a public, open-source project), the database was relicensed for private commercial use. This may or may not have been wise, but there's no arguing that thousands of people felt burned by the move.
posted by ardgedee at 4:30 PM on December 18, 2006


Pandora uses the music genome project which started out at least by populating themselves. [1][2]
posted by juv3nal at 4:32 PM on December 18, 2006


Converting the data from X datasource to Y database isn't nearly as difficult as obtaining the data source. Dealing with merging the data into your spec can be tricky as you'll no doubt find places where the rubber hits the road in a inconvenient manner. (CDDB, as an example, has +1 entry for many albums which would be problematic if you only wanted to have only one entry per album).

A lot of the newer stuff collects data from users (Wikipedia, IMDB, Last.FM, Amazon's Recommendation) though there's some value-add tagging going on in some of these projects I'm sure.

Looks like Pandora is doing the value-add bit maybe with a ibt of slick algorithym bumping it.

Those map apps are often times derived from companies who make data aquisition of that sort their business -- I'm work for such a company myself, though the data is much different. The map folks, I believe, do a lot of GIS driving these days to get their stuff going and have their roots back in the old paper maps -- they've been doing this stuff for a while. GIS has been heating up over the past ten years, this mapping thing being only the public tip of a rather massive field. I'm sure county surveys and the like end up in some of these aggregator's systems these days.

I think any big public data project starts with a few folks entering data until it reaches the tipping point in terms of draw and other people are inspired to contribute their own stuff. This MeFi thing here draws from that to no small degree. IMDB -- as an example -- started from an emailed list and just grew and grew and grew.

Otherwise its all about the cash and putting data entry people in seats. Back when I started here (like 12 years ago) we used to actually FedEx stuff off to the Philipines where it went to some data entry sweatshop to be typed in. Thence we went to scanning, now mostly page scrapes.
posted by Ogre Lawless at 4:32 PM on December 18, 2006


I'd guess that at least in last.fm's case there's a similar user-submitted operation at work, except that they have programs thar live on people's computer that then compile the individual statistics on what people are listening to - I think this is the case because you can find misspellings and other variations of songs on their artist "most played" charts, but the correct spelling/attributions will float to the top of the chart.

As for pandora, the only information I can find is here.
posted by ofthestrait at 4:34 PM on December 18, 2006


IMDB does get official writing credits provided by the Writer's Guild. Of course, the Writer's Guild has a vested interest in only having the official credits easily available.
posted by smackfu at 4:34 PM on December 18, 2006


Here's an interesting article on Google Maps maps.
posted by muddgirl at 4:36 PM on December 18, 2006


Last.fm's database started as a project called Audioscrobbler - which uses members of the public submitting their own playlist activity via iTunes jukebox etc and special plugins that connect to the Audioscrobbler / Last.fm database.
posted by skylar at 4:40 PM on December 18, 2006


expanding on what Smackfu said, there are a lot of people who have a vested interest in seeing the right info get on IMDB: studios, filmmakers, actors, and their representation, etc.

For very small films at least, the filmmakers tend to submit the info themselves. There is a a "submit" form, although it is incredibly buried.
posted by drjimmy11 at 4:53 PM on December 18, 2006


For Last.fm's album listings and other metadata, some is gleaned from MusicBrainz, a collaborative music metadata site. It works something like a wiki; anyone can edit and add new information, but edits are voted on. It's far from perfect, but it's the most comprehensive system of its sort.
posted by RobotAdam at 5:40 PM on December 18, 2006


When I was helping build the Borders.com site in 1998, we got most of our book data from a company called Bowker that has a database called "Books in Print" that they sell to companies like Borders, Barnes & Noble, Amazon, etc. At the time the Bowker BIP database was in incredibly bad shape from a data integrity perspecive and Borders.com had a staff of several people that did nothing but check data, clean it up, respond to customer service issues where authors were complaining about incorrect data, etc. I remember at thye time that Amazon also had a pretty large team of people also doing data integrity and data clean-up.

Similarly, a database of music can be bought from Muze, which also had data integrity issues some years ago but I am certain that these available-for-purchase databases have improved gretly since online shopping and ecommerce became popular in the late 1990s.

On a more rcent project, I recently launched Confabb, a database of conferences around the world. We took a 2-step approach in acquiring our data. The first step was to buy any available databases. One database available was from The Ultimate Trade Show Network (TSNN). The second step was to scrape a number of identified sources with a web-scraping tool like Nutch. Identified sources could be things like the Las Vegas Convention Center Calendar and IEEE Conference Database.

Lastly, we allow anyone to add a conference to the Confabb database and we simply verify that the information is correct and not spam) before it is added.
posted by camworld at 6:47 PM on December 18, 2006 [1 favorite]


AMG / AllMusic.com licenses their data. MP3.com is one example in the real world.
posted by smackfu at 7:57 PM on December 18, 2006


IMDb originated on USENET as "The List". From the rec.arts.movies FAQ, 1990:
Another project is called, simply, "The List." It is currently maintained by Andrew Kreig (k...@jupiter.med.ge.com), and is a long list of female actors and the films they have been in. This list has been described as "Actresses I'd most like to pork." although Andrew Krieg's reply to that is: "I wouldn't say the list is a collection of 'Actresses I'd most like to pork.' It's more of a 'What movies can I rent if I want to see Miss XXXX.' True, most of the women on this list are lookers, but hey, that's why they're in the movies. We've all had our secret crushes on movie stars, and 'THE LIST' is a way to locate the films they are in."
Later other people started maintaining lists of male actors, directors, etc.
posted by russilwvong at 10:28 PM on December 18, 2006


I have a small hobby site that aggregates information about certain collectibles. There are about 12,500 items now and 50,000 photos. Fully half of those have been added, one at a time, by a single individual who enjoys the challenge of finding them. I didn't know him before I put the site up and have never paid him (or anyone, including myself).

I guess what I'm suggesting is that for many things there are people so enthusiastic about the process that the data will accumulate naturally over time.

(This obviously doesn't necessarily apply to a commercial endeavor or one with complex technical requirements.)
posted by maxwelton at 10:41 PM on December 18, 2006


Thanks to all for these responses. Curiosity satisfied. :D
posted by jxpx777 at 6:49 AM on December 19, 2006


« Older Where can I find cheap semi-long-term parking in...   |   I become dull and emotionless for stretches of 3-5... Newer »
This thread is closed to new comments.