Python and Data Mining
September 18, 2008 9:01 AM Subscribe
Is Python a logical choice to learn if I want to create Data Mining programs.
I have never programmed before but have interest in createing long term data management and mining programs for my research. I heard from many people that python is an easy enough programming language to learn: I dont want to learn the easy but useless language for my interest. I am looking to create a simple data interface for the users to imput data with a strong analitical back end. Think, teh computer screens that starbucks and mcDonalds use but with a few other bells and whistles. Is python capable of doing this (mostly concerned about the analytical part)
I have never programmed before but have interest in createing long term data management and mining programs for my research. I heard from many people that python is an easy enough programming language to learn: I dont want to learn the easy but useless language for my interest. I am looking to create a simple data interface for the users to imput data with a strong analitical back end. Think, teh computer screens that starbucks and mcDonalds use but with a few other bells and whistles. Is python capable of doing this (mostly concerned about the analytical part)
Python has plenty of tools to do data mining, especially if you use NumPy. For the front end, you can chose from many fine GUI libraries, including Wx and QT.
posted by demiurge at 9:17 AM on September 18, 2008
posted by demiurge at 9:17 AM on September 18, 2008
Python is a good language for a beginner to learn, and it's certainly powerful enough to do what you want. When I think data mining, I start thinking about databases and SQL. I'd probably choose Python with a Postgres backend. Depending on how heavy your analytical part is, you can write store procedures in R on the backend.
posted by sbutler at 9:19 AM on September 18, 2008
posted by sbutler at 9:19 AM on September 18, 2008
Best answer: Python is very good at string manipulation and at natural-language analysis, which (I believe) is what you're talking about: it would, for example, be very good at trying to do more and more complex parsing of written text for more and more complex kinds of information. If you're looking to mine data from text, then a really great way to learn Python, in fact, is through the Natural Language Tool Kit, which is both a suite of language-analysis program and a tutorial for beginners learning Python.
If that's not what you're talking about (I don't know exactly what you mean by 'data mining' if the data is going to be inputted by users) and if you're talking more about mathematical data, I still think Python will be good for this purpose. Python is good with math, and has modules which handle math as well as if not better than almost any other language; its creator, Guido von Rossum, is a mathematician himself.
As far as use, well, Python is useful. The idea behind Python is to offer power that is intuitive so that it can be implemented and maintained quickly; it is, in a word, as easy as a strong programming language can be, and as powerful as anything out there for most purposes. You will find it useful. This is a good description of the principles:
The Zen of Python
by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Python is also easily extensible and is the object of a lot of attention and development right now; so there are more modules being written all the time to overcome obstacles that come up.
posted by koeselitz at 9:23 AM on September 18, 2008 [4 favorites]
If that's not what you're talking about (I don't know exactly what you mean by 'data mining' if the data is going to be inputted by users) and if you're talking more about mathematical data, I still think Python will be good for this purpose. Python is good with math, and has modules which handle math as well as if not better than almost any other language; its creator, Guido von Rossum, is a mathematician himself.
As far as use, well, Python is useful. The idea behind Python is to offer power that is intuitive so that it can be implemented and maintained quickly; it is, in a word, as easy as a strong programming language can be, and as powerful as anything out there for most purposes. You will find it useful. This is a good description of the principles:
The Zen of Python
by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Python is also easily extensible and is the object of a lot of attention and development right now; so there are more modules being written all the time to overcome obstacles that come up.
posted by koeselitz at 9:23 AM on September 18, 2008 [4 favorites]
Best answer: Also, if I may disagree with shothotbot (I think): if you are set on designing a data front-end (which it sounds like you want to do) then you will have to learn to program. If you really want to learn to program (keep in mind, this is a serious undertaking, and you should read Peter Norvig's awesome Teach Yourself Programming in Ten Years to get an idea of what it means) then I believe that Python is the very best way to go. Python's a great first language - it's intuitive, it's stubbornly sensible, and it's rational. It might not be as commercially viable as a first language as, say, Java (maybe), but its object-orientation is clear and pure enough to give you good programmer habits for life, much more so than Java. C++ makes so little sense in important ways (compiler compatibility problems? Seriously?) that it would be a terrible place to start, though it has its place. Ruby is making some strides lately, it seems like, and is similar to Python in its intuitive nature, but my sense (I don't know Ruby much, maybe somebody can correct me) is that it's not quite as powerful as Python. There's a whole universe of languages out there that might be good first languages, but Python is sensible enough and has enough community around it do be ideal.
But, again, you shouldn't expect to just learn a programming language over a few weekends and get it out of the way so that you can do what you really want and make this terminal you're talking about. If you're mostly just looking to get the language learned and get to it, well, you're probably better off doing exactly what shothotbot recommends and going with Excel and a sprinkling of VB.
Don't take that as discouragement, though. Just keep in mind that learning programming takes some time and some effort.
posted by koeselitz at 9:36 AM on September 18, 2008
But, again, you shouldn't expect to just learn a programming language over a few weekends and get it out of the way so that you can do what you really want and make this terminal you're talking about. If you're mostly just looking to get the language learned and get to it, well, you're probably better off doing exactly what shothotbot recommends and going with Excel and a sprinkling of VB.
Don't take that as discouragement, though. Just keep in mind that learning programming takes some time and some effort.
posted by koeselitz at 9:36 AM on September 18, 2008
Best answer: Python is a decent language to use for this. It's one of the easier languages to learn, and has a good community.
The strongest point in its favour for you, though, is the book Programming Collective Intelligence - which walks through examples written in Python of data mining and data analysis. It goes through various clustering algorithms, Bayesian classification, genetic programming, and all the sorts of things you'll need; and it's well written and clear. It's perfect for what you've described. Buy it, read it, and whether you use Python or not, the algorithms will help - but if you do use Python, then you can reuse the code from the book to get started.
posted by siskin at 10:05 AM on September 18, 2008 [1 favorite]
The strongest point in its favour for you, though, is the book Programming Collective Intelligence - which walks through examples written in Python of data mining and data analysis. It goes through various clustering algorithms, Bayesian classification, genetic programming, and all the sorts of things you'll need; and it's well written and clear. It's perfect for what you've described. Buy it, read it, and whether you use Python or not, the algorithms will help - but if you do use Python, then you can reuse the code from the book to get started.
posted by siskin at 10:05 AM on September 18, 2008 [1 favorite]
Using nltk & Beautiful Soup, I've managed to create an engine that reads a webpage and returns a list of the most relevant Youtube videos.
Using Numpy/MPI/LAPACK/a few other things, I've built a data miner that acts analogously to Amazon's 'other users bought this', scalable to a hundred thousand users without me tinkering, and a million with some changes (it seems, not positive on that one).
So yes. Python will work just fine for this.
posted by Lemurrhea at 10:40 AM on September 18, 2008
Using Numpy/MPI/LAPACK/a few other things, I've built a data miner that acts analogously to Amazon's 'other users bought this', scalable to a hundred thousand users without me tinkering, and a million with some changes (it seems, not positive on that one).
So yes. Python will work just fine for this.
posted by Lemurrhea at 10:40 AM on September 18, 2008
Python is both a decent language for this, and a decent choice for a first language. If I were doing it myself, I'd probably reach for Ruby and scrubyt.
posted by Zed_Lopez at 11:19 AM on September 18, 2008
posted by Zed_Lopez at 11:19 AM on September 18, 2008
I think python is a good first language too. My issue is that figuring out 1) how to program 2) how SQL works 3) how a statistical package works and 4) how to write a robust GUI for other users is a tall order. Enjoyable and rewarding of course, but a tall order. This is part of research project not an end in itself. If what you really want is to analyze a big pile of data my advice would be: don't write the whole thing yourself.
posted by shothotbot at 1:15 PM on September 18, 2008
posted by shothotbot at 1:15 PM on September 18, 2008
Seconding the Programming Collective Intelligence book. On the other hand, I find it more convenient to use Perl and R for my data mining stuff.
posted by singingfish at 2:15 PM on September 18, 2008
posted by singingfish at 2:15 PM on September 18, 2008
koeselitz's comment is good. I can see almost no reason to ever choose Matlab over Python. It's an old scripting engine where additions like object orientation really look and feel like additions. As a maths machine, Matlab is fine, but it's not a good programming language.
Let me recommend the Enthought Python Distribution. It's taylored for scientific programming and comes with all math and science packs you can think of, as well as some useful tools for experimenting and visualizing data.
posted by springload at 2:17 PM on September 18, 2008
Let me recommend the Enthought Python Distribution. It's taylored for scientific programming and comes with all math and science packs you can think of, as well as some useful tools for experimenting and visualizing data.
posted by springload at 2:17 PM on September 18, 2008
Best answer: After you've finished reading Programming Collective Intelligence, you might want to read Text Processing in Python.
Python is ideal for what you want, as : a) it's easy to learn b) has a ton of libraries for pretty much anything you want to do (like parse HTML, access the flickr API, etc., etc., etc.) & c) is multi-paradigm, so you can start with purely procedural programming, and as you learn progress to an object oriented or functional paradigm.
posted by signal at 4:59 PM on September 18, 2008
Python is ideal for what you want, as : a) it's easy to learn b) has a ton of libraries for pretty much anything you want to do (like parse HTML, access the flickr API, etc., etc., etc.) & c) is multi-paradigm, so you can start with purely procedural programming, and as you learn progress to an object oriented or functional paradigm.
posted by signal at 4:59 PM on September 18, 2008
This thread is closed to new comments.
Its hard to give good advice without more information, but generally if you will be dealing with thousands of records use Excel and Visual Basic. If you will be dealing with tens of thousands of records and know matrix algebra use Matlab.
posted by shothotbot at 9:14 AM on September 18, 2008