Help me be the best data guy I can be
June 18, 2012 10:04 AM   Subscribe

Part of my current job involves manipulating a large amount of data. I need to learn how to do this more efficiently and on a larger scale. What should I use? What should I learn? What is the best (free) way to teach myself?

Currently I'm use Excel--countifs, sumifs, vlookup, pivot tables, all that fun stuff-- which gets me pretty much everything I need. For now.

The problem is that the amount of data I'm working with is growing rapidly and will soon surpass the 1.04 million record level that represents the limits of what Excel can handle.

At which point I'm screwed unless I have something else I can use to do the same (or similar) things that I'm currently using Excel for.

posted by dersins to Computers & Internet (26 answers total) 32 users marked this as a favorite
Microsoft Access. Or another SQL software. That's what they're there for.
posted by sarahnicolesays at 10:12 AM on June 18, 2012 [1 favorite]

I'm interested to see what others say here because I have faced similar questions but am certainly not an authority.

MS Access is probably the easiest thing to transition to from Excel if you're on Windows.

If this were more about data analysis than data management, then R would be an option - it's free, open source stats software. But it's got a bigger learning curve (in some ways, although the relational database aspect to Access will also be new to you) and I think it's really more about analyzing static datasets than managing dynamic ones.

If you're on OSX then my understanding is that the closest equivalent to Access is SQLite, with some sort of GUI manager on top of it.
posted by yarrow at 10:23 AM on June 18, 2012

For manipulating large data - and especially for counting, summarizing, doing regressions, or more complex statistical work, you should really look into using SAS for this. There is a good reason why they are the gold standard for this kind of work. There are a number of very good, free SAS examples and tutorials to get you started.
Depending on the kind of work you are doing, you may also find that perl or python can do what you are looking for.
posted by machinecraig at 10:25 AM on June 18, 2012 [2 favorites]

Where is your data coming from? Flat files, a database system, delivered to you in Excel from [mysterious source]?

I do this kind of work every day, but the source is Microsoft SQL in the first place so it's pretty much me doing my thing in T-SQL, sometimes extended to SSRS reports.

Access is an in-between option. For whatever reason, my SQL brain find Access ponderous and incomprehensible, but I know people who would say the same about SQL and can do magic in Access.
posted by Lyn Never at 10:26 AM on June 18, 2012

It would help to know exactly what you're doing with the data.

MATLAB is incredibly powerful for that sort of thing if you're doing somewhat tedious calculations with lots of if/then statements. To learn for free, you can use Octave. Unless you're using some of the newest features of MATLAB, your code should transfer flawlessly between the two. Here's a link:
posted by JacksonEsquire at 10:32 AM on June 18, 2012

Even some of the best default data tools (R, in particular) have issues when scaling up your data. Unfortunately, there isn't a scalability 101 guide for data folks - you're kinda stuck having to figure out the runtime/memory requirements for your analysis and thinking about what tools would be best for your workflow.

That said, most people wind up sampling or taking subsets of their data to work with their tools (in other words, if you can't scale out your data, reduce the amount of data you absolutely have to deal with). If you really need to process ALL of your data, then you might have to get well acquainted with the use and limits of R/SAS/Matlab, conventional relational databases, all the way up through NoSQL databases and hadoop based platforms.

Their will be a small jump from processing your data in memory on one machine (Excel, and to some degree stuff like SPSS, R, SAS, Matlab, etc.) to partially on disk on one machine (database vs. in memory client). There will be potentially a much bigger jump from single machine in memory/database stuff to sharding/partitioning across disks on multiple machines (clustered databases - think web scale).

You don't need to know or implement all of this stuff. Wrangling more data is cool and often gives you better results, but not absolutely necessary if you're able to deliver insight with smaller data sets or lower powered tools.
posted by NoRelationToLea at 10:42 AM on June 18, 2012

Google Refine does a lot of this.
posted by tayknight at 10:43 AM on June 18, 2012

Thanks for the suggestions thus far-- much appreciated. Sounds like it would be helpful for me to explain a little more.

- The data comes to me in the form of .csv or .xls exports (from a database that I cannot query directly).

- It comes in several different types of outputs that contain some overlapping data fields, but none of which produce the information I actually need.

Two examples (slightly obfuscated):

- One type of output is a record of sales calls, which would include the name of the salesperson, the office the salesperson works out of, a database ID# associated with the person being called, and the result of the call (not home, not interested, sale closed, etc.), date of the call

- A second type of output is a record of the people who the salespeople have called: Name, database ID#, phone #, address, etc.

I need to generate many different reports based on this kind of information. A typical one might be:

How many people in each of 25 different cities have been called during a given week, and what percentage of those calls resulted in a sale. Since the sales results do not live in the same export as the addresses of the people we've called, I need to match results to city by the database ID#.

This is simple stuff to do in Excel, but the volume of calls is such that I am likely to exceed Excel's record capacity in the not-too-distant future.

An additional difficulty here is that I have no budget for software or training. So free is not just good, but, unfortunately, necessary....
posted by dersins at 10:58 AM on June 18, 2012

Part of getting good at what you're doing is understanding what tools you need for the task at hand. You're probably going to need to learn several things. To all the stuff above I'll add you would be well served by learning python, especially if your data involves from flat files.
posted by chairface at 11:01 AM on June 18, 2012

This sounds like the sort of thing that's really easy to do with QlikView.
posted by bzn at 11:09 AM on June 18, 2012

It's good practice to separate the storage of your data from the analysis of it. Are you asking for ways to store the data, or to analyze it?

Storage solutions can be SQL-based, SAS, or flat files, among other options. These three are the ones I'm most familiar with. SQL-based storage solutions include Microsoft Access, Microsoft SQL-server, MySQL, SQL-lite, Post-gre SQL, Oracle, others. Many have SQL in the name, which stands for "Structured Query Language". SQL is a concise, english-like language for performing database management and set operations. These include creating sets, or 'tables', - using 'queries' to ask simple questions of the tables, updating, adding to, or deleting tables. While some analysis, such as sums, group-wise sums, averages, counts, and the like can be performed using queries, generally all your data analysis - from counts to tests of proportions to regressions and more - should be handled with analysis software.

SAS is a whole universe unto itself. You both manage the data and perform analysis in SAS, but the two pieces are well separated inside the SAS universe. SAS is popular, reliable, not easy to learn, and expensive. To add to your confusion, you can manage your data in the SAS universe using SQL statements.

Flat files - text files with each line representing a row in a table and each column separated by a delimiter - comma or tab or space or some such - are quick, take up little space, are universally read by any software, but require a decent management overhead when your data scales up and you have multiple files and versions. Data between entities is often passed as flat files, which are then read into the data management software of choice.

If you are looking to analyze data somebody else stores somehow and sends you in excel format, and your analysis is even the slightest bit complicated, then I think you're looking at SAS or R. Both can do pretty much anything, but SAS is lame like PC's and R is cool like a Mac. Also, R is free, is in my opinion easier to learn than SAS, and is just plain enjoyable to use. It's programming, and if you can program you'll love it.

I think those are your options. If you have a future in data analysis, then go for a free SQL product and learn R, and realize it will be a long, time-consuming, rewarding slog. If you are looking for a one-shot thing to count up sums, and otherwise don't want a career doing this work, then learn Access. If you want to spend cash upfront, grumble 'I hate SAS' all day under your breath, but have a technical credential (pass SAS exams that aren't particularly hard) that seems to have some genuine professional demand, then go that route.
posted by everythings_interrelated at 11:10 AM on June 18, 2012 [2 favorites]

Hi Dersins,

We have a similar job. The amount of data that I manipulate is about to swallow me whole!

I'd love to dump it into SAS or Oracle. We're getting a consultant in to help with some of the stuff because I am manipulating WAY too much stuff manually in Excel.

I'd like to do Access, but the training is $1500 for 3 levels and it would take too long to ramp it up.

Are you sure you can't do data analysis in your current system? I know we can in ours, we've just not activated the modules (and yes, intense source of frustration #54)
posted by Ruthless Bunny at 11:13 AM on June 18, 2012

I think a RDBMS is the right tool for this. Install a real database on your machine, my suggestion would be PostgreSQL, and import the data into it. Then you can query and summarize it using standard SQL.

You can export the results as a .CSV for import into Excel as necessary if you want to make graphs or use pivot tables to further analyze the data.
posted by ob1quixote at 11:38 AM on June 18, 2012 [1 favorite]

Your example problem sounds tailor-made for a SQL query on a relational database.
posted by scose at 11:43 AM on June 18, 2012

Honestly, for right now, IF you have Access on your system already (and you probably do if you have MS Office), it shouldn't be too difficult to get up to speed on the kind of things you're doing. If you figured it out in Excel you'll be ok figuring it out in Access; indeed Access (or any kind of relational database) is more elegant at hooking data together in the way you're describing than Excel is.

I'd just try messing around with Access and see if you can figure it out. Basically you'll be importing your different 'outputs' as separate tables in Access and then creating queries to join them together and summarize them. "Joins" are doing the work that vlookup was doing in Excel, and there is even a pivot table view in the query interface in Access that will look somewhat familiar to you. Give it a shot, google around for introductions to importing data into Access & to query building, and see if you can make it work.

(The advantage to this is you don't have the additional learning curve of figuring out how to install & configure various of the other tools that have been mentioned above.)
posted by yarrow at 11:51 AM on June 18, 2012

If you do end up going the python route, pandas looks quite nice for this kind of thing.
posted by d11 at 12:18 PM on June 18, 2012 [1 favorite]

Seconding yarrow that if Access is already on your machine, it is free and pretty easy to understand if you are already proficient in Excel. There are tons of free tutorials on the web, too.

You have your call results table and your customers called table. You can import both of these into a table and then relate those two tables by database#. Now you can write a query that will pull the result of the call, the date and the city name for each call. You can make it a new table in Access and summarize it there.

Access has size limits as well, 2 GB. However, the size of a table is determined by several things like the number of records, number of fields (columns in Excel) and the size of each field. If your tables has less than ten columns, you should not have a problem fitting two tables with two million records each into one db.
posted by soelo at 12:42 PM on June 18, 2012

seconding ob1quixote

I like postgres a lot. I actually work with Oracle on a daily basis, but that's because we already have the Oracle ├╝ber enterprise license, and almost all of our stuff is in oracle already. Posgres does most everything that Oracle does, and really, if you are just suing excel right now, you will not be stressing it much.

Access can give you some bad habits real fast, and you really are bumping right up against its absolute max limits. Also, a big chunk of what people do in my office is take people's horrible Access Databases and get them into shape to go into our good Oracle schemas.

If you are smart, you can get started with an O'Reilly book and a little time with Postgres. Also, staying in Postgres lets you then move into building a nice little Django webapp on top of your data. There is a lot more you can do if you keep your data out of the MS silo.
posted by rockindata at 1:40 PM on June 18, 2012

Learn SQL and you'll be able to move up to big data analysis.
posted by blue_beetle at 5:08 PM on June 18, 2012

For what you're doing, it sounds like you can get away with awk.

To take your example report, supposing file one is a CSV with fields user, city and file two has fields result, user, date if you call this with "awk -F, -f codebelow.awk filetwo fileone":
BEGIN { OFS=",";SUBSEP=","; #Your output will be a CSV
         date="2012-04"; #Do everything from April, 2012

FNR == NR { location[$1] = $2 } #location[user] = city; FNR==NR is true for the first file

FNR != NR && $3 == date{ #For the second file, checking the date... 
  calls[location[$2]]++; #Increment the number of calls for the user's city
  if($1 == "sold") { #If this was a sale...
     sales[location[$2]]++; #count it

END { #When we're done reading the input...
  for (city in locations){
    print city,calls[city],sales[city],(sales[city]/calls[city]) #The last one is your sale rate
O'Reilly's Awk and Sed is kind of long-winded; may help you, and this guide should come in handy.

Awk may not work for all your problems for whatever reason; for that I can say that pandas, mentioned earlier, is really fast and a joy to work with - I ran some basic segmenting and counting on three to five million records and it takes under a minute. It also has functions with names like "pivot_table", so you should be able to find all the same functionality you're used to.

One million records really shouldn't be too many if you have anything that's resource efficient (so not Excel) and a computer with a few GBs of memory. You really also shouldn't need a paid product, though it could "just do it" for you.

I'm also bewildered people are suggesting databases. Based on your description of the data above, you don't have complex relations, just basic ID lookups, so the only advantage you'd get from a DB would be it handling very basic joins for you and not having to keep everything in memory at once - both helpful, on the flip side, setting it up would be a hassle, of no long-term value, and it would probably be slower than anything else that handled a data set of the same size.
posted by 23 at 6:09 PM on June 18, 2012

Take a look at Google Refine. It's a really great tool that runs as a server on your machine, and you access through a browser.

You can import data into it from CSVs and it should be able to handle a million records easily. I would imagine that it can make the kinds of reports you're looking for, though I haven't tried to do anything like that yet.
posted by oranger at 7:08 PM on June 18, 2012 [1 favorite]

Try PowerPivot. It is an Excel add-in made by Microsoft that can fetch data (up to 2 GB) right into your Excel workbook. You then analyze in an Excel-like language called DAX, and show results in Excel pivot tables. It's probably the shortest path from your current implementation.
posted by crazycanuck at 8:21 PM on June 18, 2012

I almost envy you this job, as you're about to discover how elegant and powerful SQL is when used for exactly the kind of job it's designed for - which it sounds like yours is.

Any relational database server will do, but I vote for PostgreSQL - it's free and fast. Get into the habit of saving useful snippets of SQL in a text file somewhere. If there are particular reports that you find yourself generating regularly, you'll eventually want to put them into a script so that you can do the whole thing with a single command.
posted by primer_dimer at 1:24 AM on June 19, 2012

I would go with mySQL. You can get it for free and uploading CSV's into it is super easy. Plus if you can learn SQL it will look good on your resume and open you up to future jobs in business intelligence and other analytical fields.
posted by Aizkolari at 6:51 AM on June 19, 2012

Check out SQL Server Express. It is a free version of SQL Server from Microsoft and has many of the features of fullblown SQL server. It will play nice with your other Microsoft products too.
posted by TeknoKid at 3:42 PM on June 20, 2012

Thanks so much for all the great suggestions. I do already have access installed, so I'll definitely start poking around to see if I can figure out how to make it do what I need to do. In the meantime I'll probably also get an introductory book on SQL because it sounds like that's probably my long term solution. Any suggestions as to an awesome SQL for beginners book (or online tutorial) would be very welcome.
posted by dersins at 9:59 PM on June 20, 2012

« Older seeking marketing stats without being marketed to   |   Looking for a very basic virtual call-forwarding... Newer »
This thread is closed to new comments.