Scraping real estate sites for private statistical use
December 17, 2014 6:06 PM Subscribe
I would like to grab as many "sold" listings as possible from a site like realestate.com.au as xml or some sort of structured list that I can reformat into a csv type file for running some statistical analyses. I know javascript and some python, and once I have the data in some sort of structured form, I'm comfortable reformatting and then using R for the stats. It's just the scraping stage I need help with.
This is just for private use, and will be a one-off, so I don't think I need to worry about getting banned or anything. I would just like tips on the best way to do it.
I'm happy to use a different Australian real estate site if there's a better one for my purposes, but I don't think any of them have a public API.
This is just for private use, and will be a one-off, so I don't think I need to worry about getting banned or anything. I would just like tips on the best way to do it.
I'm happy to use a different Australian real estate site if there's a better one for my purposes, but I don't think any of them have a public API.
Best answer: I'm on my phone so I don't have links, but since you mentioned python, it has an amazing scraping library called BeautifulSoup. Take a look at that.
posted by cgg at 6:17 PM on December 17, 2014
posted by cgg at 6:17 PM on December 17, 2014
I'm with cgg--I've loved working with Beautiful Soup. Here's an article on what to do if the site relies on JS for rendering.
Since you are good with Javascript and only know some Python, you could probably even just use phantomjs and then use Javascript (with jQuery maybe to make it easier?) to get the data you want.
posted by foxfirefey at 6:24 PM on December 17, 2014
Since you are good with Javascript and only know some Python, you could probably even just use phantomjs and then use Javascript (with jQuery maybe to make it easier?) to get the data you want.
posted by foxfirefey at 6:24 PM on December 17, 2014
BeautifulSoup (bs4) is definitely worth looking at. I had to complete a one off web crawling task earlier this month and I found it pretty easy to use after a quick read through their docs and one question on Stack Overflow for a part that stumped me.
Given that the site passes all of the property information and pagination into the URL, it should be easy enough to do with just bs4. However, if there are some problems I haven't foreseen with what you're attempting, you might also find the Mechanize and Cookielib modules helpful for traversing the property criteria forms or storing cookies. Mechanize is a 3rd party module that behaves like a browser and it's great for handling forms. Again, the docs are pretty straightforward, but if you don't use Python much and you are on windows you might find the 'pip install' stuff a major headache like I used to...
posted by man down under at 8:02 PM on December 17, 2014
Given that the site passes all of the property information and pagination into the URL, it should be easy enough to do with just bs4. However, if there are some problems I haven't foreseen with what you're attempting, you might also find the Mechanize and Cookielib modules helpful for traversing the property criteria forms or storing cookies. Mechanize is a 3rd party module that behaves like a browser and it's great for handling forms. Again, the docs are pretty straightforward, but if you don't use Python much and you are on windows you might find the 'pip install' stuff a major headache like I used to...
posted by man down under at 8:02 PM on December 17, 2014
You might find Allhomes easier, because they seem to have much more sales data generally available. Depending on how much data you want, you might even be able to run some fairly general searches on there and just grab that data.
posted by girlgenius at 8:36 PM on December 17, 2014
posted by girlgenius at 8:36 PM on December 17, 2014
Best answer: The Scrapy library is designed just for situations like this.
posted by PartOfThisCompleteBreakfast at 8:37 PM on December 17, 2014 [1 favorite]
posted by PartOfThisCompleteBreakfast at 8:37 PM on December 17, 2014 [1 favorite]
I've used scrapy with good results on some moderately complex websites.
posted by jedicus at 8:43 PM on December 17, 2014 [1 favorite]
posted by jedicus at 8:43 PM on December 17, 2014 [1 favorite]
Yes, look at Beautiful soup and don't be afraid to use stackoverflow a bunch. I scrape using getHTMLTable in RCurl library on R. I find that and XML are very helpful. Good luck!
posted by z11s at 9:22 PM on December 17, 2014
posted by z11s at 9:22 PM on December 17, 2014
Response by poster: Thanks for the pointers, everyone! I have now tried out BeautifulSoup a little, but it looks like it is mainly about what to do with the html once I get it, but I'm stuck on a stage even earlier than that. If a URL like "http://www.realestate.com.au/sold/property-house-townhouse-villa-with-2-bedrooms-in-parramatta%2c+nsw+2150%3b/list-1?maxBeds=3" is providing thousands of results, only the first 10 or so show up on the first page, and there's lots of pages to scroll through. I want to figure out how to automate capturing the html that results from the whole search. (I realise a quick and dirty hack would be to change the number of results shown on the page, but real.estate.com.au doesn't let you do that.)
Tutorials I can find about Beautiful Soup all refer to urllib2 (now 3) for this part, but I can't find a tutorial for using that package.
I'm going to look into scrapy now instead, I think.
Unfortunately AllHomes doesn't have a large proportion of the Sydney listings on it, or didn't last time I checked.
By the way, man down under, I too used to struggle with pip install until I discovered pip-Win, a package manager which has made it a breeze.
posted by lollusc at 9:40 PM on December 17, 2014
Tutorials I can find about Beautiful Soup all refer to urllib2 (now 3) for this part, but I can't find a tutorial for using that package.
I'm going to look into scrapy now instead, I think.
Unfortunately AllHomes doesn't have a large proportion of the Sydney listings on it, or didn't last time I checked.
By the way, man down under, I too used to struggle with pip install until I discovered pip-Win, a package manager which has made it a breeze.
posted by lollusc at 9:40 PM on December 17, 2014
Response by poster: Scrapy seems like it will be perfect, and the tutorial is very clear. Thanks a lot!
posted by lollusc at 9:50 PM on December 17, 2014
posted by lollusc at 9:50 PM on December 17, 2014
Best answer: Well, if it helps at all, the page number is in the url. For example, page 2 is this: http://www.realestate.com.au/sold/property-house-townhouse-villa-with-2-bedrooms-in-parramatta,+nsw+2150/list-2?maxBeds=3
Notice the "list-2?" portion. Change to list-3, list-4 etc for the other pages. Using python you can dynamically build the urls just incrementing the page number each time until you get no results in the html. Don't bother with urllib - take a look at the requests library instead to get the raw html.
Again on a phone so I can't whip it up now, but getting the html is pretty straightforward. If you're still stuck tomorrow evening mefi mail me and i'll see if i can throw something up on pastebin.
posted by cgg at 9:51 PM on December 17, 2014 [2 favorites]
Notice the "list-2?" portion. Change to list-3, list-4 etc for the other pages. Using python you can dynamically build the urls just incrementing the page number each time until you get no results in the html. Don't bother with urllib - take a look at the requests library instead to get the raw html.
Again on a phone so I can't whip it up now, but getting the html is pretty straightforward. If you're still stuck tomorrow evening mefi mail me and i'll see if i can throw something up on pastebin.
posted by cgg at 9:51 PM on December 17, 2014 [2 favorites]
Best answer: I've written a few scrapers in Python. For actually getting the html, I use the Requests library.
I just looked at my github account and found a script from when I was first learning to do this, in case that's helpful at all.
posted by daisyk at 11:08 PM on December 17, 2014
I just looked at my github account and found a script from when I was first learning to do this, in case that's helpful at all.
posted by daisyk at 11:08 PM on December 17, 2014
You probably already know this, but the data on realestate.com.au comes from RP Data. Not sure if it's accessible enough for your needs, though.
posted by dg at 2:50 PM on December 18, 2014
posted by dg at 2:50 PM on December 18, 2014
Response by poster: Success! I ended up going back to Beautiful Soup. I don't recall why, exactly, but yes, once I understood requests, then it was fine.
My analysis, in case anyone cares, is that location (towards Sydney centre) has the biggest effect on price, followed by type (house/townhouse/apartment, etc), followed by month in which the property sold (summer = higher/ winter = lower), followed by number of bedrooms, followed by number of bathrooms, followed by various interactions of the above variables.
I would not be surprised if I did something wrong, though, as I'm not too hot on regression analysis. And it seems unlikely that summer/winter sale has more of an effect than an extra bedroom, which is what my results seem to be trying to tell me.
posted by lollusc at 8:59 PM on December 20, 2014 [1 favorite]
My analysis, in case anyone cares, is that location (towards Sydney centre) has the biggest effect on price, followed by type (house/townhouse/apartment, etc), followed by month in which the property sold (summer = higher/ winter = lower), followed by number of bedrooms, followed by number of bathrooms, followed by various interactions of the above variables.
I would not be surprised if I did something wrong, though, as I'm not too hot on regression analysis. And it seems unlikely that summer/winter sale has more of an effect than an extra bedroom, which is what my results seem to be trying to tell me.
posted by lollusc at 8:59 PM on December 20, 2014 [1 favorite]
Response by poster: Oh no, it turns out my answer is right, but I'm bad at reading the interactions bits. Bedrooms and bathrooms are more important than time of year, because you have to consider the coefficient for their interactions with each other and with house type, as well as the coefficient for each as an independent variable.
posted by lollusc at 9:46 PM on December 20, 2014
posted by lollusc at 9:46 PM on December 20, 2014
I can second scrapy. I recently used scrapy to scrape college courses from university websites. Much easier than starting from scratch.
posted by degreefox at 12:46 PM on April 6, 2015
posted by degreefox at 12:46 PM on April 6, 2015
« Older The best places to eat in Milwaukee | MigrationFilter: PC and Mac can't see each other... Newer »
This thread is closed to new comments.
posted by lollusc at 6:08 PM on December 17, 2014