Scraping real estate sites for private statistical use
December 17, 2014 6:06 PM   Subscribe

I would like to grab as many "sold" listings as possible from a site like realestate.com.au as xml or some sort of structured list that I can reformat into a csv type file for running some statistical analyses. I know javascript and some python, and once I have the data in some sort of structured form, I'm comfortable reformatting and then using R for the stats. It's just the scraping stage I need help with.

This is just for private use, and will be a one-off, so I don't think I need to worry about getting banned or anything. I would just like tips on the best way to do it.

I'm happy to use a different Australian real estate site if there's a better one for my purposes, but I don't think any of them have a public API.
posted by lollusc to Computers & Internet (16 answers total) 12 users marked this as a favorite
 
I just realised I didn't actually ask a question. In case you can't guess, my question is where do I start? What sort of script should I be writing? Are there existing scripts or programs that do this for other sites that I could adapt? Or even a link to a relevant tutorial would be really helpful.
posted by lollusc at 6:08 PM on December 17, 2014


I'm on my phone so I don't have links, but since you mentioned python, it has an amazing scraping library called BeautifulSoup. Take a look at that.
posted by cgg at 6:17 PM on December 17, 2014


I'm with cgg--I've loved working with Beautiful Soup. Here's an article on what to do if the site relies on JS for rendering.

Since you are good with Javascript and only know some Python, you could probably even just use phantomjs and then use Javascript (with jQuery maybe to make it easier?) to get the data you want.
posted by foxfirefey at 6:24 PM on December 17, 2014


BeautifulSoup (bs4) is definitely worth looking at. I had to complete a one off web crawling task earlier this month and I found it pretty easy to use after a quick read through their docs and one question on Stack Overflow for a part that stumped me.

Given that the site passes all of the property information and pagination into the URL, it should be easy enough to do with just bs4. However, if there are some problems I haven't foreseen with what you're attempting, you might also find the Mechanize and Cookielib modules helpful for traversing the property criteria forms or storing cookies. Mechanize is a 3rd party module that behaves like a browser and it's great for handling forms. Again, the docs are pretty straightforward, but if you don't use Python much and you are on windows you might find the 'pip install' stuff a major headache like I used to...
posted by man down under at 8:02 PM on December 17, 2014


You might find Allhomes easier, because they seem to have much more sales data generally available. Depending on how much data you want, you might even be able to run some fairly general searches on there and just grab that data.
posted by girlgenius at 8:36 PM on December 17, 2014


The Scrapy library is designed just for situations like this.
posted by PartOfThisCompleteBreakfast at 8:37 PM on December 17, 2014 [1 favorite]


I've used scrapy with good results on some moderately complex websites.
posted by jedicus at 8:43 PM on December 17, 2014 [1 favorite]


Yes, look at Beautiful soup and don't be afraid to use stackoverflow a bunch. I scrape using getHTMLTable in RCurl library on R. I find that and XML are very helpful. Good luck!
posted by z11s at 9:22 PM on December 17, 2014


Thanks for the pointers, everyone! I have now tried out BeautifulSoup a little, but it looks like it is mainly about what to do with the html once I get it, but I'm stuck on a stage even earlier than that. If a URL like "http://www.realestate.com.au/sold/property-house-townhouse-villa-with-2-bedrooms-in-parramatta%2c+nsw+2150%3b/list-1?maxBeds=3" is providing thousands of results, only the first 10 or so show up on the first page, and there's lots of pages to scroll through. I want to figure out how to automate capturing the html that results from the whole search. (I realise a quick and dirty hack would be to change the number of results shown on the page, but real.estate.com.au doesn't let you do that.)

Tutorials I can find about Beautiful Soup all refer to urllib2 (now 3) for this part, but I can't find a tutorial for using that package.

I'm going to look into scrapy now instead, I think.

Unfortunately AllHomes doesn't have a large proportion of the Sydney listings on it, or didn't last time I checked.

By the way, man down under, I too used to struggle with pip install until I discovered pip-Win, a package manager which has made it a breeze.
posted by lollusc at 9:40 PM on December 17, 2014


Scrapy seems like it will be perfect, and the tutorial is very clear. Thanks a lot!
posted by lollusc at 9:50 PM on December 17, 2014


Well, if it helps at all, the page number is in the url. For example, page 2 is this: http://www.realestate.com.au/sold/property-house-townhouse-villa-with-2-bedrooms-in-parramatta,+nsw+2150/list-2?maxBeds=3

Notice the "list-2?" portion. Change to list-3, list-4 etc for the other pages. Using python you can dynamically build the urls just incrementing the page number each time until you get no results in the html. Don't bother with urllib - take a look at the requests library instead to get the raw html.

Again on a phone so I can't whip it up now, but getting the html is pretty straightforward. If you're still stuck tomorrow evening mefi mail me and i'll see if i can throw something up on pastebin.
posted by cgg at 9:51 PM on December 17, 2014 [2 favorites]


I've written a few scrapers in Python. For actually getting the html, I use the Requests library.

I just looked at my github account and found a script from when I was first learning to do this, in case that's helpful at all.
posted by daisyk at 11:08 PM on December 17, 2014


You probably already know this, but the data on realestate.com.au comes from RP Data. Not sure if it's accessible enough for your needs, though.
posted by dg at 2:50 PM on December 18, 2014


Success! I ended up going back to Beautiful Soup. I don't recall why, exactly, but yes, once I understood requests, then it was fine.

My analysis, in case anyone cares, is that location (towards Sydney centre) has the biggest effect on price, followed by type (house/townhouse/apartment, etc), followed by month in which the property sold (summer = higher/ winter = lower), followed by number of bedrooms, followed by number of bathrooms, followed by various interactions of the above variables.

I would not be surprised if I did something wrong, though, as I'm not too hot on regression analysis. And it seems unlikely that summer/winter sale has more of an effect than an extra bedroom, which is what my results seem to be trying to tell me.
posted by lollusc at 8:59 PM on December 20, 2014 [1 favorite]


Oh no, it turns out my answer is right, but I'm bad at reading the interactions bits. Bedrooms and bathrooms are more important than time of year, because you have to consider the coefficient for their interactions with each other and with house type, as well as the coefficient for each as an independent variable.
posted by lollusc at 9:46 PM on December 20, 2014


I can second scrapy. I recently used scrapy to scrape college courses from university websites. Much easier than starting from scratch.
posted by degreefox at 12:46 PM on April 6, 2015


« Older The best places to eat in Milwaukee   |   MigrationFilter: PC and Mac can't see each other... Newer »
This thread is closed to new comments.