Textbooks on data mining techniques / statistical analysis on large data sets? <br /><br /> I come from a computer science background, and want to basically run statistical analysis on very large data sets, looking for interesting trends and the like. I am looking for resources/textbooks on:<br>
-Finding said interesting trends<br>
-Computational techniques to work on said data sets efficiently<br>
-Statistical tests to help find structure in the data (for example: auto-correlation, proving that it is or is not from a given statistical distribution, etc)<br>
-Anything you think might be good to know for someone who wants to extract meaning and work with super large data sets<br>
I am fine with math and CS, just need to up my exposure to the stats side of it (although I have taken stats in the past, I just haven't taken it with this in mind)
Empirical Methods for Artificial Intelligence by Paul Cohen. Much more about statistics than AI, don't let the title fool you.
I would like to learn some of this stuff myself. When I get around to it, I think I'd like to read The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani, & Friedman, which is available for free online. I've heard good things about it from other people, but I have not read any of it myself.
a friend suggests Information Theory, Inference, and Learning Algorithms. also free online.
Computational Linguists tend to do a lot of interesting and large-scale statistical analyses. One good book in this field is Manning and Schütze's "Foundations of Statistical Natural Language Processing."
"Data Mining, practical machine learning tools and techniques with Java Implementations" by Witten and Frank.
I don't believe the book is open source but the program is, which you might appreciate.