Statistics: Regression
April 1, 2005 7:51 AM   Subscribe

Statisticsfilter: I need help figuring out how to pound a regression model into submission.

I am up for a job teaching Sociology at a University. Part of the process for proving my worthiness is to come up with something interesting with data set on the local real estate market.

My idea was to build a regression model that you could use to predict the price of a house. i.e. the presence of a number of features (age of the house, size of the house etc. - my independent variables) work to predict the price (the dependent variable).

The data set had some "smoothing" out to do (i.e. I dropped a number of the outliers where it was appropriate), but it is still really volatile and inaccurate (i.e. a house that is worth around 100 000 is predicted to be around 150 000). It seems to me that I have tried everything, but it just won't work!).


Do you have any tricks that work for your regression models?

Is there something else I could do with a bunch of real estate numbers (almost all of them interval) using SPSS that would blow minds?

Any help is appreciated!
posted by Quartermass to Grab Bag (9 answers total) 1 user marked this as a favorite
 
Are you limted in the kind of regression you can use? If I were you I would do a random effects model taking location into account.

What software do you use?

Also, can you tell us the full list of available variables?

Geez, I've never heard of anyone having to do something like that.
posted by duck at 7:58 AM on April 1, 2005


oh...I see it says SPSS...also, I was going to say email me if you like for additional random-effects info, if the idea sounds workable to you.
posted by duck at 7:59 AM on April 1, 2005


Typically these kind of cross-sectional housing models are called 'hedonic' property value models. Starting in the 1970s or so, this method has probably been used most often to arrive at the price people are willing to play for environmental goods like clean water, air, airport noise. Doing this kind of stuff usually requires cross-city data, since that's the only way to get variation in these environmental variables.

If you can find some interesting characteristic that has a lot of variation in the data - like if the areas are funneled into different schools that have vastly different test scores or something - that's what I'd build the paper around (as you could probably guess, it's been done, but it's always interesting to see it done at your local level - how much more will people pay to go to Public School X than Public School Y?).

Having grown up in a neighborhood with a slightly draconian homeowners association, one of the more interesting papers I've read used this method to estimate the property value effects of a homeowners association.

Beyond the features of a house, it also usually includes a lot of the neighborhood and location characteristics - like school quality, crime rates, property taxes, proximity to services like shopping and employment centers. Obviously, the more data you have about the area(s) you are studying, the better (notice how many variables that the HOA paper above used: don't be afraid of using what seems like an excessive number of explanatory variables if you have mountains of data). If you have a contact in the geography department, see if you can use some of their GIS expertise to draw out this kind of location data.

Also, the functional form of the model in the homeowners assocation paper is log-linear, which is a trick I've used sometimes.

I haven't done this kind of econometrics in a while, but if you think I could help, go ahead and shoot me an email.
posted by milkrate at 9:07 AM on April 1, 2005


Well, I'll dissent a bit...

i.e. a house that is worth around 100 000 is predicted to be around 150 000

So? I fail to see why this is a problem, unless there's something systematic in your residuals. You expect error, and some large errors.

The object is not to fit the data. The object is to test your theory, to confront the observable implications of your theory with the real world.

Use the functional form and variables that are consistent with your theory, whatever it is. Don't use other ones.

Unfortunately, it looks like they're giving you data you don't care about and telling you to do something whizbang with them. A random-effects or multilevel model might work; then you'd characteristics of individual houses that partially determine price and characteristics of neighborhoods that houses are nested within that partially determine price, and you could introduce cross-level interactions, and so on. I don't know how you'd do that with SPSS but if you're willing to use R you can do it with the nlme package, or the gllamm package in Stata (this is sssllllooooowwww).

If it's the sort of place where you don't need to do something terribly whizbang to seem whizbang, you might look into an event history model to predict length of time till sold, or you could look into event count models if you have data on how many houses each realtor sold to combine with realtor and house characteristics. Or, if you have good spatial data, do something with spatial autocorrelation though I know beans about that.

But, if you can, I'd think hard about coming up with a good theory of something in your dataset, with appropriate genuflecting to the relevant literature, and then testing the theory with whatever methods are most appropriate, concentrating really heavily on the inferences you're making, why you're making them, how they link to your theory, and so on. Show that you understand how you're regressing something, and why you're running that kind of regression, and what that kind of regression is telling you, and how you've dealt with problems that are common with that kind of regression. If that makes any sense at all.
posted by ROU_Xenophobe at 9:38 AM on April 1, 2005


Sounds like you have data that is clustered by neighborhood -- that is, your predictors might be working differently in different neighborhoods. For example, an old house in an expensive neighborhood might cost more than a new construction, whereas newer houses might be relatively more valuable in less expensive neighborhoods (that hypothesis might be wrong, but you get the idea). I think this is similar to what milkrate said.

You could try a multilevel analysis (hierarchical linear modeling), looking at houses nested within neighborhoods (maybe using census tract as the definition of neighborhood). I think this can be done in newer versions of SPSS, but I'm generally wary of using SPSS for anything beyond very basic analyses. Can you do the analyses in SAS or STATA?

Alternately, you might run separate models for different types of neighborhoods (maybe defined by the relative SES of people who live in the neighborhoods). Or at least include some measure of type of neighborhood as a predictor (as milkrate suggested).
posted by nixxon at 9:39 AM on April 1, 2005


I think if you were writing a paper you would want to pull in neighbourhood characteristics and do a real multi-level analysis allowing the independent variables to have different effects in different neighbourhoods.

But it seems like that's a bit much for your purposes here. If you use the random-effects model you get a seperate constant for each neighbourhood (though it would only report the mean constant).

This would effectively control for neighbourhood characteristics (school quality, public transportation access, etc. etc.) though it wouldn't tell you what the effects of those things are -- you would just know what the effects of different house properties are, controlling for neighbourhood, essentially.

I think this is probably more accurate anyway: X dollars for living in this neighbourhood, + $50/sq foot (or whatever) sounds more accurate to me than X dollars for living in this neighbourhood + $50/sq foot in neighbourhood A or $60/sq foot in neighbourhood B, or whatever.
posted by duck at 9:56 AM on April 1, 2005


I think this can be done in newer versions of SPSS, but I'm generally wary of using SPSS for anything beyond very basic analyses.

Slightly off topic, but important if you are doing a multilevel model in SPSS: SPSS has recently implemented multilevel models in the package. I am currently equating the results I obtain from SPSS with SAS, and have found one huge quirk. When specifiying a model with categorical predictors in SPSS, it is better to dummy code the categorical variable and treat the dummy variables as covariates. I have found that not treating them as covariates results in estimates inconsistent with other programs.

Also, good luck, Quartermass!
posted by naturesgreatestmiracle at 12:22 PM on April 1, 2005


(i.e. a house that is worth around 100 000 is predicted to be around 150 000)

Do you have a confidence interval on this estimate?
posted by tss at 9:55 PM on April 1, 2005


Brokers know that only three things matter with respect to real estate values: location, location, location. Age and size of the house are nearly irrelevant.

I know a lot about regression models. One of the first things I learned about them is that they're no help in extracting relations that aren't there to begin with.
posted by ikkyu2 at 11:08 AM on April 2, 2005


« Older Are Macs really more reliable than PCs?   |   Investing my mom's nest egg Newer »
This thread is closed to new comments.