Learning as much as possible about data analytics in a month?
December 18, 2015 10:02 PM Subscribe
I have an opportunity to assist on a research project, but my background isn't very data-y.
I got accepted for a computer science internship that involves assisting on a project to analyze texts and data computationally. When I was hired, they basically said "we'll give you whatever you can handle." Right now I know... Calculus. That's about as far as my "analysis" background goes.
I've never taken a statistics class. I was a high school mathlete and almost got a math minor in college, so I have some independent learning experience and solid math skills. (I focused mostly on number theory and abstract algebra.)
Let's say I wanted to cram as much useful data analytics and research skills as possible this winter break, not to ~impress~ anyone but so I can get as much out of this internship as possible. Where should I start? What is realistic to try to understand in about a month?
They know my background isn't in data, so this is just for me. I'm curious about this stuff anyway, so it will be a good opportunity to see where it goes.
I got accepted for a computer science internship that involves assisting on a project to analyze texts and data computationally. When I was hired, they basically said "we'll give you whatever you can handle." Right now I know... Calculus. That's about as far as my "analysis" background goes.
I've never taken a statistics class. I was a high school mathlete and almost got a math minor in college, so I have some independent learning experience and solid math skills. (I focused mostly on number theory and abstract algebra.)
Let's say I wanted to cram as much useful data analytics and research skills as possible this winter break, not to ~impress~ anyone but so I can get as much out of this internship as possible. Where should I start? What is realistic to try to understand in about a month?
They know my background isn't in data, so this is just for me. I'm curious about this stuff anyway, so it will be a good opportunity to see where it goes.
Also, when I was noodling around in that space(text analysis) for fun when I was learning Python, the python natural language toolkit (NLTK) was one of the standard tools. It looks like they have a nice book that starts with an intro to Python and works through the basics of textual analysis. Seeing how far you can get through that would probably get you a long way.
posted by rockindata at 11:13 PM on December 18, 2015 [1 favorite]
posted by rockindata at 11:13 PM on December 18, 2015 [1 favorite]
"iTunes U" will have entire semesters worth of statistics classes where you can just watch the lectures, if you want to try that. I guess Statistics 101 or Intro to Stats seems like a good start, but I know nothing about stats.
posted by AppleTurnover at 11:29 PM on December 18, 2015
posted by AppleTurnover at 11:29 PM on December 18, 2015
Learn some basics about Python scripting and the third-party pandas library that greatly extends Python's ability to handle data.
posted by a lungful of dragon at 1:34 AM on December 19, 2015
posted by a lungful of dragon at 1:34 AM on December 19, 2015
and you should understand numpy (another python lib) too. but maybe check with them what languages they use first?
posted by andrewcooke at 4:07 AM on December 19, 2015
posted by andrewcooke at 4:07 AM on December 19, 2015
Contact the people you'll be working with and ask them. I'm sure they'll be thrilled to get an email from you asking about how to make the most of your internship.
posted by sciencegeek at 4:20 AM on December 19, 2015 [3 favorites]
posted by sciencegeek at 4:20 AM on December 19, 2015 [3 favorites]
Find a data set you care about and start playing with it. Generate some descriptive statistics, and make pretty figures. Figure out what questions you want to ask.
posted by yarntheory at 5:32 AM on December 19, 2015
posted by yarntheory at 5:32 AM on December 19, 2015
I second shooting them an email asking, well, what to google. There's a good chance you won't be doing much of anything you would have learned if you'd taken statistics, so that's a plus.
Depending on the answer, you might want to look at Mining of Massive Datasets and The Elements of Statistical Learning or Introduction to Statistical Learning (I think that's the more introductory book), all of which are available online for free (legally).
You're not going to learn everything (or perhaps anything) in a month, but you could get a broad idea of the concepts.
Another possibility is the Kaggle Titantic tutorial. Don't just follow the tutorial, do the noodling around yarntheory suggests. (Use pandas for the noodling around. The Python numpy/scipy/pandas/sklearn ecosystem plays well together and things are consistently named. Plotting is still something of a pain. I actually prefer R for exploration, but it's super-frustrating when you first start and they're almost certainly not using R for production.)
posted by hoyland at 5:45 AM on December 19, 2015 [1 favorite]
Depending on the answer, you might want to look at Mining of Massive Datasets and The Elements of Statistical Learning or Introduction to Statistical Learning (I think that's the more introductory book), all of which are available online for free (legally).
You're not going to learn everything (or perhaps anything) in a month, but you could get a broad idea of the concepts.
Another possibility is the Kaggle Titantic tutorial. Don't just follow the tutorial, do the noodling around yarntheory suggests. (Use pandas for the noodling around. The Python numpy/scipy/pandas/sklearn ecosystem plays well together and things are consistently named. Plotting is still something of a pain. I actually prefer R for exploration, but it's super-frustrating when you first start and they're almost certainly not using R for production.)
posted by hoyland at 5:45 AM on December 19, 2015 [1 favorite]
You should write Cathy O'Neil and ask her; given that she's a number theorist - turned - data scientist who cares a ton about social justice and data's relation to same, you two would have a lot to talk about. She also has an intro book on data science -- have a look to see if it's at the level you need.
posted by escabeche at 6:47 AM on December 19, 2015 [2 favorites]
posted by escabeche at 6:47 AM on December 19, 2015 [2 favorites]
Understand conditional probability, normal distribution, linear and logistic regression, and why penalization and feature selection are important. Know how to use SVM and what a kernel is. Since you have a math background, brush up on linear algebra if you need to, then read about SVD and PCA. If you want to get fancy, ready about why LDA and Word2vec work and how to use them.
Then figure out how to do all this in the Python data analysis stack: Python 2.7 (usually), numpy, scipy, sklearn, sometimes statsmodels, and sometimes pandas to hold it all together (I like pandas). Wes McKinney's book is very good, even for a python semi-newbie (I was when I read it). Gensim is a good Python module for LDA, word2vec, and similar models.
If your text analysis research group is anything like mine, they'll have some tools that anyone even without a programming background can pick up easily (though they'll probably still be Python modules or command-line interfaces).
posted by supercres at 8:08 AM on December 19, 2015 [2 favorites]
Then figure out how to do all this in the Python data analysis stack: Python 2.7 (usually), numpy, scipy, sklearn, sometimes statsmodels, and sometimes pandas to hold it all together (I like pandas). Wes McKinney's book is very good, even for a python semi-newbie (I was when I read it). Gensim is a good Python module for LDA, word2vec, and similar models.
If your text analysis research group is anything like mine, they'll have some tools that anyone even without a programming background can pick up easily (though they'll probably still be Python modules or command-line interfaces).
posted by supercres at 8:08 AM on December 19, 2015 [2 favorites]
Free book: Data mining for the masses was recommended to me by a friend.
posted by idb at 12:05 PM on December 19, 2015 [2 favorites]
posted by idb at 12:05 PM on December 19, 2015 [2 favorites]
Learning the mechanics of statistics is a great edifying thing to do but what will really add value in this type of role (unless they just want a robot stats generator) is to be able to interpret the data being produced. While you're practicing the mechanics, even if it's with dummy data that doesn't have any real meaning, try to ascribe what the meaning would be if it were relevant. If you run a regression or even just pump out some descriptive stats, take it out of the math terms and say its meaning and relevance.
posted by toomanycurls at 4:02 PM on December 19, 2015
posted by toomanycurls at 4:02 PM on December 19, 2015
This thread is closed to new comments.
posted by rockindata at 11:06 PM on December 18, 2015 [1 favorite]