<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel>
	  <title>Ask MetaFilter questions tagged with scraping</title>
      <link>http://ask.metafilter.com/tags/scraping</link>
      <description>Questions tagged with 'scraping' at Ask MetaFilter.</description>
	  <pubDate>Thu, 19 Nov 2009 15:13:41 -0800</pubDate> <lastBuildDate>Thu, 19 Nov 2009 15:13:41 -0800</lastBuildDate>

      <language>en-us</language>
	  <docs>http://blogs.law.harvard.edu/tech/rss</docs>
	  <ttl>60</ttl>	  
	<item>
	<title>Scraping &amp;amp; saving?</title>
	<link>http://ask.metafilter.com/138587/Scraping%2Dand%2Dsaving</link>	
	<description>I have perhaps a thousand delicious links (to documents in the SEC database).
All of these could be broken at anytime if the SEC changes the way it displays these.
How do I automate the process of copying the contents of those documents so I can save them in a database? I have checked into previous questions and web scraping software, but the web scraping/crawling/spidering software out there requires what looks a little to much to me like programming.&lt;br&gt;
&lt;br&gt;
Is there an easy way to collect the documents? I am hoping to feed something the list of links and be done. Fair warning: if I can figure this out, then I will ask how best to save the documents in a database. I have considered using Mechanical Turk, or something, but I think this really ought to be a job for a machine. Free software solutions preferred, but willing to pay to make it easy for me to do...&lt;br&gt;
&lt;br&gt;
Sample document: &lt;br&gt;
&lt;a href=&quot;http://www.sec.gov/Archives/edgar/data/66740/000110465909047028/a09-17166_2ex99d1.htm&quot;&gt;&lt;/a&gt;</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2009:site.138587</guid>
	<pubDate>Thu, 19 Nov 2009 15:13:41 -0800</pubDate>
	<category>scraping</category>
	<dc:creator>extropy</dc:creator>
	</item>
	<item>
	<title>Not-so-sof(f)t shoes</title>
	<link>http://ask.metafilter.com/133681/Notsosofft%2Dshoes</link>	
	<description>Shoe problems.  How do I keep a new-ish pair of shoes from scraping my heels/ankles? I recently bought a pair of kitten-heeled Sofft pumps at a thrift store.  They are adorable and fit me perfectly, except for one problem: The back of the shoe scrapes against my heel when I walk.  I assumed this would stop after I&apos;ve broken them in, but after I wore them out for two full days they persisted in scraping against my heels.  &lt;br&gt;
&lt;br&gt;
I recently had to purchase a few new pairs of shoes, and had similar problems during the break-in period.  The blisters from those are still present, and breaking in YET ANOTHER new pair of shoes may be exacerbating an existing problem.  Would stretching the heels help?  (When I bought them, the heels were a bit narrow, as though the owner kicked them off by stepping out of one with the opposite foot.)</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2009:site.133681</guid>
	<pubDate>Thu, 24 Sep 2009 06:37:01 -0800</pubDate>
	<category>breaking-in</category>
	<category>scraping</category>
	<category>shoes</category>
	<category>sofft</category>
	<dc:creator>pxe2000</dc:creator>
	</item>
	<item>
	<title>I don&apos;t work for the government, promise.</title>
	<link>http://ask.metafilter.com/124574/I%2Ddont%2Dwork%2Dfor%2Dthe%2Dgovernment%2Dpromise</link>	
	<description>I&apos;m trying to map protests in the United States, but I&apos;m grappling with data sources (and will eventually tangle with data management).  Any ideas? I&apos;d like to map out protests, riots, bombings, and other cheerful social outings - ideally in the United States, where I have the most contextual knowledge, but that&apos;s not a necessity.&lt;br&gt;
&lt;br&gt;
My original plan was to scrape AP&apos;s US news RSS feed, store everything in some sort of XML database, and then query that for what I need.  I just checked their RSS format, and it unfortunately doesn&apos;t include the full article.  Nor does it include a separate tag for the location, which would make geocoding a bit/much nastier.  NYT&apos;s feeds are basically the same story.  I don&apos;t really know where to go from here.&lt;br&gt;
&lt;br&gt;
There are basically five steps, and I would love advice on any:&lt;br&gt;
1. Scrape database of news articles.&lt;br&gt;
2. Store in a format that would allow querying by date or location.  I&apos;d like to keep all the articles, too, because... really, that would be an awesome dataset.&lt;br&gt;
3. Tag protests (method: NLP, Mech Turk, or caffeinated McB).&lt;br&gt;
4. Tag with date and location.&lt;br&gt;
5. Make pretty maps.&lt;br&gt;
&lt;br&gt;
Step 6 is going crazy with spatial stats, but I&apos;ve got that part covered.  I&apos;ve been letting this project fester for too long, and it is now certifiably &lt;a href=&quot;http://www.zefrank.com/theshow/archives/2006/07/071106.html&quot;&gt;brain crack&lt;/a&gt;.  Any advice on 1-5 would be greatly appreciated.&lt;br&gt;
&lt;br&gt;
Aside: I really have thought about the ethical consequences of this.  If you&apos;re concerned, MeFiMail me and I&apos;ll do my best to assuage your doubts.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2009:site.124574</guid>
	<pubDate>Thu, 11 Jun 2009 18:32:32 -0800</pubDate>
	<category>database</category>
	<category>geocoding</category>
	<category>gis</category>
	<category>maps</category>
	<category>news</category>
	<category>protests</category>
	<category>rss</category>
	<category>scraping</category>
	<category>spatial</category>
	<dc:creator>McBearclaw</dc:creator>
	</item>
	<item>
	<title>Scraping paint.</title>
	<link>http://ask.metafilter.com/121543/Scraping%2Dpaint</link>	
	<description>How will I know when I&apos;m done scraping the paint off of this window molding?  What are the final steps before painting? I&apos;ve been using a combination of a stripper and scraping to remove paint from a reasonably ornate window molding.  Some areas are bare wood, some are close, and some still have several layers of paint on them.  Do I need to go to bare wood all over, or is some paint left ok as long as there aren&apos;t abrupt level transitions between areas?&lt;br&gt;
&lt;br&gt;
My assumption has been that the final step will be to use a coarse grit paper to get rid of any leftover paint nubbins.  What is the best way to clean off all the paint dust after that?  Should I be doing something else?&lt;br&gt;
&lt;br&gt;
The whole will be repainted with latex primer and enamel.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2009:site.121543</guid>
	<pubDate>Thu, 07 May 2009 16:17:43 -0800</pubDate>
	<category>diy</category>
	<category>painting</category>
	<category>paintscraping</category>
	<category>resolved</category>
	<category>scraping</category>
	<dc:creator>OmieWise</dc:creator>
	</item>
	<item>
	<title>Newspaper Clippings 2.0?</title>
	<link>http://ask.metafilter.com/121540/Newspaper%2DClippings%2D20</link>	
	<description>Can I automate archiving/saving news articles on a certain topic I pull from google new&apos;s rss feed? For a while, I was manually copying and saving all news articles from google on a certain topic I got in my rss feed. But it became cumbersome so I stopped. Now I&apos;ve looked back at those articles from a few years ago, and wish I had kept up. Is there a way to automate something like that? A modern day newspaper clipping collection only automated? I don&apos;t want to save just the url, but the actual text of the article, where it is from, date, and possibly pictures.&lt;br&gt;
&lt;br&gt;
This is for my own personal use, so I doubt it would fall under any copyright issues (I would assume).&lt;br&gt;
&lt;br&gt;
I did a search but my google fu is failing me. I keep coming up on the google news archive, but thats not really what I&apos;m looking for. I want my own personal copies. I don&apos;t now how google news archive works, but I know that some articles I had gotten from google news originally are not in their archive (I just checked.).</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2009:site.121540</guid>
	<pubDate>Thu, 07 May 2009 15:36:38 -0800</pubDate>
	<category>archiving</category>
	<category>google</category>
	<category>mining</category>
	<category>news</category>
	<category>scraping</category>
	<dc:creator>[insert clever name here]</dc:creator>
	</item>
	<item>
	<title>Running into Ruby Restriction</title>
	<link>http://ask.metafilter.com/116197/Running%2Dinto%2DRuby%2DRestriction</link>	
	<description>In Ruby, how can you get around the ~65,500 character limit when grabbing a web page? I&apos;m new to Ruby, but have started to use it to scrape information from websites.  I have been using either the Hpricot package or net:http.  Unfortunately, when I&apos;ve tried using these to scrape larger web pages, the streams cut off after 65,500 characters.  I haven&apos;t found any information online about this.  Is there a way to get around this limit? Can you separate the stream over two arrays or strings? Or will I have to manage the stream myself with new code?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2009:site.116197</guid>
	<pubDate>Mon, 09 Mar 2009 04:06:27 -0800</pubDate>
	<category>html</category>
	<category>limit</category>
	<category>ruby</category>
	<category>scraping</category>
	<dc:creator>FuManchu</dc:creator>
	</item>
	<item>
	<title>Web scraping for dummies</title>
	<link>http://ask.metafilter.com/98518/Web%2Dscraping%2Dfor%2Ddummies</link>	
	<description>How does web scraping work with PHP/mySQL? What best practices are there? I&apos;m curious about how price comparison services do and manage web scraping, i.e. finding information in unstructured HTML files over many different sites and presenting the information on their own sites. Ultimately, I would like to learn enough about web scraping so that I can create a functional site that, for example, displays a list of dishes that are linked to various recipe sites.&lt;br&gt;
&lt;br&gt;
Stuff that I wonder about:&lt;br&gt;
1. In general terms, how would you code the project using PHP/mySQL? Any code libraries that can be used for scraping?&lt;br&gt;
&lt;br&gt;
2. I understand that you can regexp data from the scraped html files, but aren&apos;t there more intelligent ways of extracting the data? I&apos;m thinking about XSLT and such.&lt;br&gt;
&lt;br&gt;
3. How do you handle form generated pages? For example, recipe sites that allow you to search form recipe by using check boxes, pull down menus, etc? Again, are there any smart code libraries out there that simplifies this?&lt;br&gt;
&lt;br&gt;
4. Are there any best practices regarding managing scraping, storage, data manipulation, performance, ethics, etc, that I should be aware of?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.98518</guid>
	<pubDate>Wed, 06 Aug 2008 15:06:05 -0800</pubDate>
	<category>coding</category>
	<category>mysql</category>
	<category>php</category>
	<category>programming</category>
	<category>scraping</category>
	<dc:creator>Foci for Analysis</dc:creator>
	</item>
	<item>
	<title>Gathering ebay infos.</title>
	<link>http://ask.metafilter.com/97355/Gathering%2Debay%2Dinfos</link>	
	<description>cp /mnt/com/ebay/completed_auctions/ps3 ~/ps3.txt  ?? I have the following pieces of information:&lt;br&gt;
&lt;br&gt;
a. Search criteria&lt;br&gt;
b. Date range.&lt;br&gt;
&lt;br&gt;
And from this; I would like an Excel spreadsheet that has the following columns:&lt;br&gt;
&lt;br&gt;
end price, end date, description, item #, seller, etc&lt;br&gt;
&lt;br&gt;
Is there any way to do such a thing in an automated fashion without screen scraping myself?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2008:site.97355</guid>
	<pubDate>Wed, 23 Jul 2008 15:48:29 -0800</pubDate>
	<category>auctions</category>
	<category>completed</category>
	<category>database</category>
	<category>ebay</category>
	<category>excel</category>
	<category>scraping</category>
	<category>screen</category>
	<dc:creator>SirStan</dc:creator>
	</item>
	<item>
	<title>Datamining the public web</title>
	<link>http://ask.metafilter.com/68120/Datamining%2Dthe%2Dpublic%2Dweb</link>	
	<description>How do i build a data warehouse that scrapes data from public websites for my own use? Tools? Tips? Hi. I would like to track apartments on a classifieds site and use the data for analyzing the inpact of diffrent things on price. What i need is a tool or scripting language that would make it easy for me to spider the website and put the data in a database. Preferable this would be an open source solution. &lt;br&gt;
&lt;br&gt;
I am also looking for good tools for extracting information out of longer pieces of text. For example on the site i want to mine users can put in comments on every object. I would like to be able to decide if a comment is positive, negative och neither. I have seen this be done on one online art site that i cant remember the name of right now. The artist used blog post and decided the mood of the writer by what words were used.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2007:site.68120</guid>
	<pubDate>Mon, 30 Jul 2007 01:52:51 -0800</pubDate>
	<category>datamining</category>
	<category>datawarehouse</category>
	<category>scraping</category>
	<category>spidering</category>
	<dc:creator>ilike</dc:creator>
	</item>
	<item>
	<title>What is that noise?</title>
	<link>http://ask.metafilter.com/64352/What%2Dis%2Dthat%2Dnoise</link>	
	<description>Bicyclefilter: A pronounced grinding/scraping from my rear wheel. Is this potentially unsafe, or simply embarrassing, to ride? When I say pronounced, I mean passersby chuckle. The wheel sounds really old, really sad and really grumpy. When I spin the wheel freely, the gears wobble a bit, leading me to think the threads on the gears or the axle are close to stripped (?? I can still pedal just fine, so I don&apos;t know).&lt;br&gt;
&lt;br&gt;
What&apos;s the worst that can happen, realistically? The wheel falls apart? Or I just suddenly find I&apos;m pedalling but not moving?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2007:site.64352</guid>
	<pubDate>Fri, 08 Jun 2007 09:52:00 -0800</pubDate>
	<category>bicycle</category>
	<category>bike</category>
	<category>grinding</category>
	<category>noise</category>
	<category>rear</category>
	<category>scraping</category>
	<category>wheel</category>
	<dc:creator>poweredbybeard</dc:creator>
	</item>
	<item>
	<title>Web scraping onling banking sites?</title>
	<link>http://ask.metafilter.com/46126/Web%2Dscraping%2Donling%2Dbanking%2Dsites</link>	
	<description>Is there an tool/tutorial to web scrape secure sites like Online Banking or other sites that make sure that it&apos;s you that is logging in? I&apos;ve been self learning web scraping and found it extremely useful, building proxies for my Nokia tablet, automation, through Perl &amp;amp; LWP.&lt;br&gt;
&lt;br&gt;
I want to write something that will grab my bank balance, but those online banking sites seem to go through lots of hoops to make sure you are the one sitting at your computer trying to login.&lt;br&gt;
&lt;br&gt;
I&apos;ve tried deciphering all the tactics they use, javascripts, cookies, but it seems like theres more tricks they are using that I don&apos;t know about.&lt;br&gt;
&lt;br&gt;
Is there some util that lets you analyze the actual behind the scenes events, headers, cookies, etc.. so I can just repeat them?&lt;br&gt;
&lt;br&gt;
A low level tool would be great, but I&apos;d be happy with a web recorder/automation program that would log me in (based on keystrokes, etc) and save the html so I could get my balance.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2006:site.46126</guid>
	<pubDate>Thu, 07 Sep 2006 16:28:36 -0800</pubDate>
	<category>data</category>
	<category>mining</category>
	<category>perl</category>
	<category>scraping</category>
	<category>web</category>
	<dc:creator>mphuie</dc:creator>
	</item>
	<item>
	<title>Image search engine scraper/downloader</title>
	<link>http://ask.metafilter.com/38328/Image%2Dsearch%2Dengine%2Dscraperdownloader</link>	
	<description>I need a web scraper script/proggie to download and thumbnailize images for a long list of search terms. .&lt;br&gt;
&lt;br&gt;
Input: Long list of search terms.&lt;br&gt;
&lt;br&gt;
Output: corresponding images and thumbnails on my hard drive, arranged/accessed/organized by search term.&lt;br&gt;
&lt;br&gt;
My already feeble google-fough (&amp;lt;-- note ignorance of even the spelling) is failing me.</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2006:site.38328</guid>
	<pubDate>Tue, 16 May 2006 22:08:34 -0800</pubDate>
	<category>image</category>
	<category>robots</category>
	<category>scraping</category>
	<category>scripts</category>
	<category>searchengines</category>
	<category>thumbnails</category>
	<category>web</category>
	<dc:creator>Moistener</dc:creator>
	</item>
	<item>
	<title>Web Scraping for dummies?</title>
	<link>http://ask.metafilter.com/21574/Web%2DScraping%2Dfor%2Ddummies</link>	
	<description>A project at work has come up, and I would save a lot of time and hassle if I could somehow get my hands on a free (cheap is acceptable, as long as I can try it first), easy to use web-scraping program. The URL from which I will be scraping is static, unencrypted, and otherwise extremely vanilla. Suggestions?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2005:site.21574</guid>
	<pubDate>Fri, 22 Jul 2005 11:37:48 -0800</pubDate>
	<category>scraper</category>
	<category>scraping</category>
	<category>web</category>
	<category>web-scraping</category>
	<dc:creator>Kwantsar</dc:creator>
	</item>
	<item>
	<title>real estate hacks: scraping together a house</title>
	<link>http://ask.metafilter.com/21147/real%2Destate%2Dhacks%2Dscraping%2Dtogether%2Da%2Dhouse</link>	
	<description>I&apos;m looking for a home closer to my work, and I&apos;m trying to be systematic about it. My plan is to scrape online real estate databases and put them into a spreadsheet, then start ranking and comparing. So, I guess I have two questions:&lt;br&gt;
&lt;br&gt;
1. What would be the best tool (text wrangler?) to extract fields from an archived html page? (I am using OS X Tiger and am loosely familiar with regex)&lt;br&gt;
&lt;br&gt;
2. Would I probably be better off just driving around key neighborhoods and jotting stuff down by hand?&lt;br&gt;
&lt;br&gt;
(Sample database: &lt;a href=&quot;http://www.utahrealestate.com&quot;&gt;Utah real estate&lt;/a&gt;)</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2005:site.21147</guid>
	<pubDate>Wed, 13 Jul 2005 08:11:38 -0800</pubDate>
	<category>archive</category>
	<category>datamunging</category>
	<category>realestate</category>
	<category>regex</category>
	<category>scraping</category>
	<dc:creator>craniac</dc:creator>
	</item>
	<item>
	<title>Mining news sites for data. </title>
	<link>http://ask.metafilter.com/4865/Mining%2Dnews%2Dsites%2Dfor%2Ddata</link>	
	<description>Is there a way, without constant human intervention, to (1) mine either Google News, Yahoo News, or the AP for new obituaries and (2) drop the name, age, blurb, and URL into a database? I&apos;ve pondered this for a while. A really crude way would be to search headlines for &quot;, [0-9][0-9], &quot; and &quot; dies at [0-9][0-9].&quot; But I&apos;m not sure this would pick up everything. For example, if I search Google News for &lt;a href=&quot;http://news.google.com/news?hl=en&amp;edition=us&amp;q=kangaroo&amp;btnG=Search+News&quot;&gt;&quot;kangaroo&quot;&lt;/a&gt; I get only two links out of about 20 that identify Bob Keeshan&apos;s name, the reason for his fame, and his age. Most say simply &quot;Captain Kangaroo Dies&quot;. And only the &lt;a href=&quot;http://www.nytimes.com/2004/01/23/obituaries/23CND-KEESHA.html?ex=1075525200&amp;en=50c2381303be3953&amp;ei=5062&amp;partner=GOOGLE&quot;&gt;NYT headline&lt;/a&gt; has all the data elements separated by commas (and is likely not consistent on that point with each obit.)&lt;br&gt;
&lt;br&gt;
Any cleaner ideas?</description>
	<guid isPermaLink="false">tag:ask.metafilter.com,2004:site.4865</guid>
	<pubDate>Fri, 23 Jan 2004 17:02:47 -0800</pubDate>
	<category>computers</category>
	<category>database</category>
	<category>internet</category>
	<category>mining</category>
	<category>obituary</category>
	<category>scraping</category>
	<dc:creator>PrinceValium</dc:creator>
	</item>
	
	</channel>
</rss>

