What are some things I can do with a metric ton of housing data?
December 11, 2010 4:20 PM   Subscribe

I'm trying to figure out interesting ways to slice/combine/aggregate a ton of data into useful or interesting statistics. Where do I begin?

I know this is a pretty general question, but I'm looking for resources (books, papers, websites) on what people do with large amounts of data, in any field.

I want to go through a large, diverse amount of housing price data from the last few years and come up with interesting conclusions based on trends, fluctuations, and inconsistencies. But I don't even know what's possible, so I figured examining data analysis literature would help me brainstorm.
posted by zain to Technology (10 answers total) 12 users marked this as a favorite
You should look at work by Edward Tufte.
posted by knile at 4:29 PM on December 11, 2010 [1 favorite]

Sorry, that might not be AS helpful as you intended. What I know of his work is about how to SHOW the data, not necessarily analyze/describe it.
posted by knile at 4:30 PM on December 11, 2010

Response by poster: Thanks! The Visual Display of Quantitative Information is one of my bibles, but I am indeed more interested in what data I could show than how to show it.
posted by zain at 4:43 PM on December 11, 2010

How much exposure do you have to statistics (the discipline, not the generic term for "lots of data?") It sounds like you'd benefit from a lot of the statistical tools on offer in stats, finance, and experimental science courses: correlation, linear regression, moments, variance and covariance, principal component analysis, autocorrelation, etc., etc., etc. If you've already had the requisite background (calculus through Green's theorem or so, and at least a semester of linear algebra), head for OpenCourseWare and dig in! If not, I would definitely recommend starting with calc and linear.
posted by AkzidenzGrotesk at 5:15 PM on December 11, 2010 [1 favorite]

Best answer: How's your statistics background? I'm not sure that the housing market qualifies any longer as "tons" of data; large data these days is typically biological data, physical simulation data, academic and industrial stuff like that. Definitely what you can easily find publically available (such as from the government) on housing is definitely not "tons". The kinds of analysis you apply to biological data (I can't speak to the other topics) don't have a lot to do with how you'd analyse the housing market, so the books/websites/stuff to recommend to you would vary a lot depending on what kind of data, and how strong your statistics background is already. If what I have to guess at recommending is below your level, maybe someone else will like it.

You can get a cheap Dover edition of The Statistical Analysis of Experimental Data. People love to hate on these, and it's not as if this is the best book, but there are a lot of basic issues having to do with "What is data?" that are covered here. This is the kind of approach I'd take to housing (not knowing anything about economics).

Read Information Theory, Inference, and Learning Algorithms for free. This book is fantastic - unfortunately it only vaguely relates to what you're asking. The topics around "inference" are great for what you want, though. Also, arguably, the topics around encoding/decoding/error correction. Housing market data is 'noisy' - affected by a lot of different conflating factors. How do you isolate them? Is there an underlying signal trying to come through? This is the book for that.

So, for biological data (and a lot of really "large" data), people use a variety of algorithms for analysis, but you can't go wrong starting with HMMs. This is for identifying recurring sequences of data and variations on them, which people say you can apply to "time series data" which is what the housing market is, but really I think they mean by that "linear data where the next value has some direct relationship to the previous value or at least the context". I'm not sure anything economic qualifies -- heh, actually, googling around to find stuff for you, I found this.

HMMs fall into a larger bucket of "combinatorial algorithms" which are a big topic; the two major categories in there for me are Linear Programming and graph algorithms. I can't remember what I was reading when I had my major LP epiphany or I'd love to recommend it. This is not something I'd recommend trying to learn from wikipedia, but just to show that there's relevance to the question, there's LP inside techniques such as Bayesian analysis.

Graph algorithms are comparatively more widely known, better explained and more interesting. A fun book that goes into graph algorithms a bit, but is a good book all around, is The Algorithm Design Manual. It has a great empirical attitude towards problem solving that almost anything else on these topics totally misses. I may be departing somewhat from your question, but hey, I'm not sure what you want. If you wanted to tackle the Netflix Prize or something, this would be where to go.

Using these algorithms, btw, absolutely does not have to entail writing complex code; plenty of smart people have done that already. You just have to understand what they're doing well enough to map your problem into it, feed it to a pre-existing tool, and interpret the output.
posted by doteatop at 5:20 PM on December 11, 2010 [4 favorites]

Apologies for the ramble, btw, I'm totally jetlagged and waiting for a flight.
posted by doteatop at 5:27 PM on December 11, 2010

Use the term "exploratory data analysis" to refine your searches, and look around to see what others have done with similar data. Together with some subject-specific keywords, that should help.
posted by stonepharisee at 11:43 PM on December 11, 2010

Unless it truly is a ton of data, I'd throw it into R and start doing regressions on it. Honestly the kind of thing you're looking for doesn't seem that complex for this domain.. R has more than enough power to get you started. The book Regression Analysis By Example is probably a bit too math oriented, but it will give you the terms you need to find the right things to do in R.

Maybe you could quantify 'metric ton'? Under 100mb? Under 1gb? Under 4gb (what you could comfortably work with in RAM)? Under 100gb (what you could comfortably work with on one computer)?

As you start to go over 100gb, the rules change, but very few people are there.
posted by devilsbrigade at 11:56 PM on December 11, 2010

I want to go through a large, diverse amount of housing price data from the last few years and come up with interesting conclusions based on trends, fluctuations, and inconsistencies.

Statistics don't work that well at that level. Unless you're planning to do some really hard thinking about the first-stage results you get from just randomly identifying regularities, and doing a bunch of secondary testing on the basis of that thinking, you're likely to find the housing market equivalent of presidential elections depending on the World Series or on the results of a specific Redskins game. They really do; there really are pretty firm statistical associations between elections and sporting events; it's just that they're utterly meaningless coincidences.
posted by ROU_Xenophobe at 8:11 AM on December 12, 2010

Experiment with PivotTables in Excel. It's very easy to get started and you can do some pretty amazing stuff without needing much technical expertise.
posted by lunchbox at 9:00 AM on December 12, 2010

« Older Wii need a game for my mom.   |   How do I push traffic to my blog? Newer »
This thread is closed to new comments.