<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: Where do they get all those wonderful data?</title>
	<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data/</link>
	<description>Comments on Ask MetaFilter post Where do they get all those wonderful data?</description>
	<pubDate>Mon, 18 Dec 2006 16:14:29 -0800</pubDate>
	<lastBuildDate>Mon, 18 Dec 2006 16:14:29 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: Where do they get all those wonderful data?</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data</link>	
		<description>WhereTheHellDoTheyGetAllThatData Filter: When someone wants to start a site that requires huge amounts of data like IMDB or Pandora or Last.FM, where do they get the data? &lt;br /&gt;&lt;br /&gt; I can&apos;t really see that they take the time to find all the data sources and compile and link it all together into a database format. There has to be some place where they get the data. Another example is a mapping site like Mapquest or Google Maps. I know Google has more money than God or whatever, but where does such a huge amount of data come from?&lt;br&gt;
&lt;br&gt;
Then, the side note, is how the hell do they go about processing it or relating it all together to produce a product like Pandora?&lt;br&gt;
&lt;br&gt;
I am interested in this partly from a pure curiosity standpoint and partly from an IfIEverWantedToDoSomethingSimilar standpoint.</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2006:site.53440</guid>
		<pubDate>Mon, 18 Dec 2006 16:07:16 -0800</pubDate>
		<dc:creator>jxpx777</dc:creator>
		
			<category>data</category>
		
			<category>database</category>
		
			<category>pandora</category>
		
			<category>lastfm</category>
		
			<category>imdb</category>
		
			<category>map</category>
		
			<category>maps</category>
		
			<category>mapquest</category>
		
			<category>google</category>
		
			<category>googlemaps</category>
		
	</item> <item>
		<title>By: saraswati</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805265</link>	
		<description>From the &lt;a href=&quot;http://en.wikipedia.org/wiki/Imdb&quot;&gt;Wikipedia article on IMDB&lt;/a&gt;:&lt;br&gt;
&lt;br&gt;
&quot;Information is largely provided by a cadre of volunteer contributors, with only 17 members of the staff dedicated to monitoring the data received&quot;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805265</guid>
		<pubDate>Mon, 18 Dec 2006 16:14:29 -0800</pubDate>
		<dc:creator>saraswati</dc:creator>
	</item><item>
		<title>By: fire&amp;wings</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805266</link>	
		<description>IMDB just grew. It started in 1989 and has grown from then in terms of personell, finance and data.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805266</guid>
		<pubDate>Mon, 18 Dec 2006 16:14:43 -0800</pubDate>
		<dc:creator>fire&amp;wings</dc:creator>
	</item><item>
		<title>By: hades</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805268</link>	
		<description>I think IMDB started as a completely user-submitted site. I don&apos;t know what they do now, though.&lt;br&gt;
&lt;br&gt;
Any mapping site is probably licensing data from NavTeq,  MapLink/TeleAtlas, InfoUSA and the like.&lt;br&gt;
&lt;br&gt;
On the IfIEverWantedToDoSomethingSimilar front, it&apos;s a pain in the ass. I&apos;m interested in starting a site that would require a lot of business listing data, and the companies that license that sort of thing tend not to even return my email asking for price quotes.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805268</guid>
		<pubDate>Mon, 18 Dec 2006 16:16:54 -0800</pubDate>
		<dc:creator>hades</dc:creator>
	</item><item>
		<title>By: jxpx777</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805275</link>	
		<description>Good to know about IMDB.&lt;br&gt;
&lt;br&gt;
I just printed Google Maps directions to a friend&apos;s new apartment for tomorrow noticed the directions say, &quot;Map data (c) 2006 NAVTEQ(tm)&quot;&lt;br&gt;
&lt;br&gt;
Still curious about Pandora and the rest of the music sites. I know a lot of that data is user submitted as well...&lt;br&gt;
&lt;br&gt;
@hades: That&apos;s also good to know what I would be up against if I ever did try to get something going as well. I don&apos;t have any definite plans right now, but I always have ambitions. :D</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805275</guid>
		<pubDate>Mon, 18 Dec 2006 16:25:35 -0800</pubDate>
		<dc:creator>jxpx777</dc:creator>
	</item><item>
		<title>By: ardgedee</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805278</link>	
		<description>Reading the stories of &lt;a href=&quot;http://en.wikipedia.org/wiki/Cddb&quot;&gt;CDDB&lt;/a&gt; and &lt;a href=&quot;http://en.wikipedia.org/wiki/Freedb&quot;&gt;FreeDB&lt;/a&gt; is educational in this regard. Many people felt burned that after contributing their volunteer effort to CDDB (which began as a public, open-source project), the database was relicensed for private commercial use. This may or may not have been wise, but there&apos;s no arguing that thousands of people felt burned by the move.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805278</guid>
		<pubDate>Mon, 18 Dec 2006 16:30:16 -0800</pubDate>
		<dc:creator>ardgedee</dc:creator>
	</item><item>
		<title>By: juv3nal</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805280</link>	
		<description>Pandora uses the music genome project which started out at least by populating themselves. &lt;a href=&quot;http://en.wikipedia.org/wiki/Music_Genome_Project&quot;&gt;[1]&lt;/a&gt;&lt;a href=&quot;http://www.pandora.com/mgp.shtml&quot;&gt;[2]&lt;/a&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805280</guid>
		<pubDate>Mon, 18 Dec 2006 16:32:07 -0800</pubDate>
		<dc:creator>juv3nal</dc:creator>
	</item><item>
		<title>By: Ogre Lawless</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805281</link>	
		<description>Converting the data from X datasource to Y database isn&apos;t nearly as difficult as obtaining the data source.  Dealing with merging the data into your spec can be tricky as you&apos;ll no doubt find places where the rubber hits the road in a inconvenient manner.  (CDDB, as an example, has +1 entry for many albums which would be problematic if you only wanted to have only one entry per album).&lt;br&gt;
&lt;br&gt;
A lot of the newer stuff collects data from users (Wikipedia, IMDB, Last.FM, Amazon&apos;s Recommendation) though there&apos;s some value-add tagging going on in some of these projects I&apos;m sure.&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.pandora.com/mgp.shtml&quot;&gt;Looks like Pandora is doing the value-add bit&lt;/a&gt; maybe with a ibt of slick algorithym &lt;a href=&quot;http://en.wikipedia.org/wiki/Music_Genome_Project&quot;&gt;bumping it&lt;/a&gt;.&lt;br&gt;
&lt;br&gt;
Those map apps are often times derived from companies who make data aquisition of that sort their business -- I&apos;m work for such a company myself, though the data is much different.  The map folks, I believe, do a lot of GIS driving these days to get their stuff going and have their roots back in the old paper maps -- they&apos;ve been doing this stuff for a while.  GIS has been heating up over the past ten years, this mapping thing being only the public tip of a rather massive field.  I&apos;m sure county surveys and the like end up in some of these aggregator&apos;s systems these days.&lt;br&gt;
&lt;br&gt;
I think any big public data project starts with a few folks entering data until it reaches the tipping point in terms of draw and other people are inspired to contribute their own stuff.  This MeFi thing here draws from that to no small degree.  IMDB -- as an example -- started from an emailed list and just grew and grew and grew.&lt;br&gt;
&lt;br&gt;
Otherwise its all about the cash and putting data entry people in seats.  Back when I started here (like 12 years ago) we used to actually FedEx stuff off to the Philipines where it went to some data entry sweatshop to be typed in.  Thence we went to scanning, now mostly page scrapes.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805281</guid>
		<pubDate>Mon, 18 Dec 2006 16:32:48 -0800</pubDate>
		<dc:creator>Ogre Lawless</dc:creator>
	</item><item>
		<title>By: ofthestrait</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805283</link>	
		<description>I&apos;d guess that at least in last.fm&apos;s case there&apos;s a similar user-submitted operation at work, except that they have programs thar live on people&apos;s computer that then compile the individual statistics on what people are listening to - I think this is the case because you can find misspellings and other variations of songs on their artist &quot;most played&quot; charts, but the correct spelling/attributions will float to the top of the chart.&lt;br&gt;
&lt;br&gt;
As for pandora, the only information I can find is &lt;a href=&quot;http://pandora.com/mgp.shtml&quot;&gt;here.&lt;/a&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805283</guid>
		<pubDate>Mon, 18 Dec 2006 16:34:23 -0800</pubDate>
		<dc:creator>ofthestrait</dc:creator>
	</item><item>
		<title>By: smackfu</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805284</link>	
		<description>IMDB does get &lt;a href=&quot;http://www.imdb.com/wga&quot;&gt;official writing credits&lt;/a&gt; provided by the Writer&apos;s Guild.  Of course, the Writer&apos;s Guild has a vested interest in only having the official credits easily available.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805284</guid>
		<pubDate>Mon, 18 Dec 2006 16:34:51 -0800</pubDate>
		<dc:creator>smackfu</dc:creator>
	</item><item>
		<title>By: muddgirl</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805286</link>	
		<description>&lt;a href=&quot;http://radar.oreilly.com/archives/2005/10/google_maps_and_their_data_pro_1.html&quot;&gt;Here&apos;s an interesting article&lt;/a&gt; on Google Maps maps.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805286</guid>
		<pubDate>Mon, 18 Dec 2006 16:36:11 -0800</pubDate>
		<dc:creator>muddgirl</dc:creator>
	</item><item>
		<title>By: skylar</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805287</link>	
		<description>Last.fm&apos;s database started as a project called Audioscrobbler - which uses members of the public submitting their own playlist activity via iTunes jukebox etc and special plugins that connect to the Audioscrobbler / Last.fm database.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805287</guid>
		<pubDate>Mon, 18 Dec 2006 16:40:04 -0800</pubDate>
		<dc:creator>skylar</dc:creator>
	</item><item>
		<title>By: drjimmy11</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805295</link>	
		<description>expanding on what Smackfu said, there are a lot of people who have a vested interest in seeing the right info get on IMDB: studios, filmmakers, actors, and their representation, etc.&lt;br&gt;
&lt;br&gt;
For very small films at least, the filmmakers tend to submit the info themselves. There is a a &quot;submit&quot; form, although it is incredibly buried.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805295</guid>
		<pubDate>Mon, 18 Dec 2006 16:53:05 -0800</pubDate>
		<dc:creator>drjimmy11</dc:creator>
	</item><item>
		<title>By: RobotAdam</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805322</link>	
		<description>For Last.fm&apos;s album listings and other metadata, some is gleaned from &lt;a href=&quot;http://musicbrainz.org&quot;&gt;MusicBrainz&lt;/a&gt;, a collaborative music metadata site. It works something like a wiki; anyone can edit and add new information, but edits are voted on. It&apos;s far from perfect, but it&apos;s the most comprehensive system of its sort.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805322</guid>
		<pubDate>Mon, 18 Dec 2006 17:40:58 -0800</pubDate>
		<dc:creator>RobotAdam</dc:creator>
	</item><item>
		<title>By: camworld</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805360</link>	
		<description>When I was helping build the Borders.com site in 1998, we got most of our book data from a company called &lt;a href=&quot;http://www.bowker.com/&quot;&gt;Bowker&lt;/a&gt; that has a database called &quot;Books in Print&quot; that they sell to companies like Borders, Barnes &amp;amp; Noble, Amazon, etc. At the time the Bowker BIP database was in incredibly bad shape from a data integrity perspecive and Borders.com had a staff of several people that did nothing but check data, clean it up, respond to customer service issues where authors were complaining about incorrect data, etc. I remember at thye time that Amazon also had a pretty large team of people also doing data integrity and data clean-up.&lt;br&gt;
&lt;br&gt;
Similarly, a database of music can be bought from &lt;a href=&quot;http://www.bowker.com/&quot;&gt;Muze&lt;/a&gt;, which also had data integrity issues some years ago but I am certain that these available-for-purchase databases have improved gretly since online shopping and ecommerce became popular in the late 1990s.&lt;br&gt;
&lt;br&gt;
On a more rcent project, I recently launched &lt;a href=&quot;http://confabb.com&quot;&gt;Confabb&lt;/a&gt;, a database of conferences around the world. We took a 2-step approach in acquiring our data. The first step was to buy any available databases. One database available was from The &lt;a href=&quot;http://tsnn.com/&quot;&gt;Ultimate Trade Show Network&lt;/a&gt; (TSNN). The second step was to &lt;a href=&quot;http://en.wikipedia.org/wiki/Screen_scraping&quot;&gt;scrape&lt;/a&gt; a number of identified sources with a web-scraping tool like &lt;a href=&quot;http://lucene.apache.org/nutch/&quot;&gt;Nutch&lt;/a&gt;. Identified sources could be things like the &lt;a href=&quot;http://www.lvcva.com/meetings/convention-calendar.jsp&quot;&gt;Las Vegas Convention Center Calendar&lt;/a&gt; and &lt;a href=&quot;http://www.ieee.org/conferencesearch/&quot;&gt;IEEE Conference Database&lt;/a&gt;.&lt;br&gt;
&lt;br&gt;
Lastly, we allow anyone to add a conference to the Confabb database and we simply verify that the information is correct and not spam) before it is added.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805360</guid>
		<pubDate>Mon, 18 Dec 2006 18:47:29 -0800</pubDate>
		<dc:creator>camworld</dc:creator>
	</item><item>
		<title>By: smackfu</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805429</link>	
		<description>&lt;a href=&quot;http://www.allmediaguide.com/data.html&quot;&gt;AMG / AllMusic.com licenses their data&lt;/a&gt;.  MP3.com is one example in the real world.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805429</guid>
		<pubDate>Mon, 18 Dec 2006 19:57:47 -0800</pubDate>
		<dc:creator>smackfu</dc:creator>
	</item><item>
		<title>By: russilwvong</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805517</link>	
		<description>IMDb originated on USENET as &quot;The List&quot;. From the &lt;a href=&quot;http://groups.google.ca/group/rec.arts.movies/msg/36771f5c33541d2a&quot;&gt;rec.arts.movies FAQ&lt;/a&gt;, 1990: &lt;blockquote&gt;Another project is called, simply, &quot;The List.&quot;  It is currently maintained by Andrew Kreig (k...@jupiter.med.ge.com), and is a long list of female actors and the films they have been in.  This list has been described as &quot;Actresses I&apos;d most like to pork.&quot; although Andrew Krieg&apos;s reply to that is: &quot;I wouldn&apos;t say the list is a collection of &apos;Actresses I&apos;d most like to pork.&apos; It&apos;s more of a &apos;What movies can I rent if I want to see Miss XXXX.&apos;  True, most of the women on this list are lookers, but hey, that&apos;s why they&apos;re in the movies.  We&apos;ve all had our secret crushes on movie stars, and &apos;THE LIST&apos; is a way to locate the films they are in.&quot;&lt;/blockquote&gt; Later &lt;a href=&quot;http://groups.google.ca/group/rec.arts.movies.current-films/msg/e0333b5fee5936f6&quot;&gt;other people&lt;/a&gt; started maintaining lists of male actors, directors, etc.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805517</guid>
		<pubDate>Mon, 18 Dec 2006 22:28:40 -0800</pubDate>
		<dc:creator>russilwvong</dc:creator>
	</item><item>
		<title>By: maxwelton</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805522</link>	
		<description>I have a small hobby site that aggregates information about certain collectibles. There are about 12,500 items now and 50,000 photos. Fully half of those have been added, one at a time, by a single individual who enjoys the challenge of finding them. I didn&apos;t know him before I put the site up and have never paid him (or anyone, including myself).&lt;br&gt;
&lt;br&gt;
I guess what I&apos;m suggesting is that for many things there are people so enthusiastic about the process that the data will accumulate naturally over time.&lt;br&gt;
&lt;br&gt;
(This obviously doesn&apos;t necessarily apply to a commercial endeavor or one with complex technical requirements.)</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805522</guid>
		<pubDate>Mon, 18 Dec 2006 22:41:12 -0800</pubDate>
		<dc:creator>maxwelton</dc:creator>
	</item><item>
		<title>By: jxpx777</title>
		<link>http://ask.metafilter.com/53440/Where-do-they-get-all-those-wonderful-data#805698</link>	
		<description>Thanks to all for these responses. Curiosity satisfied. :D</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2006:site.53440-805698</guid>
		<pubDate>Tue, 19 Dec 2006 06:49:53 -0800</pubDate>
		<dc:creator>jxpx777</dc:creator>
	</item>
	</channel>
</rss>
