Best way to parse this data
April 11, 2016 11:41 PM   Subscribe

I have a massive data set, 45 to 60 years of 5 different sets with each one needs several different sorting. Good automation of this would be great.

Like question said large data sets, they are groups of stats for each year running back to 1970 with each year have anywhere from 4-7 filters. So a year is sorted by each filter then the top is grouped together, say top 45, and the mean is figured. Out of the top group anything above that mean is pulled into a list.

This process is repeated for each filter and names are compared to each filter as to not have duplicates. At the end you have have a list of which name was above which filter, and if it was above multiple filters. The question lies if anyone knows a good way to automate the filter process?

That way all I'm doing is plugging in data for each year and comparing years to one another. This is what the original project was and it is starting to get a bit out of hand by comparing and sorting over and over again.

Any help is appreciated thanks
posted by bigbadbehr to Technology (10 answers total)
 
Does the data have the same columns (same order and content) in each row for each year of the 5 data sets? If so, you can pivot table the shit out of this.

1. Merge all the years into a single worksheet in excel, adding a column for year. (I'm not fancy enough to automate this but ctrl shift down and ctrl shift right will make this go faster)
2. Check that the data is in a flat format (no breaks in columns or rows) and create a pivot table.
3. Use the necessary variables to filter and sort your info then add year as column to the report
4. You can choose the top x by year for the mean (which can be searched for on Microsoft's support site)

Memail me for help, this is all pretty general but it should speed up your analysis.
posted by toomanycurls at 12:06 AM on April 12, 2016 [1 favorite]


Regarding the data quality aspect of your question, Excel has a find duplicates feature in the conditional formatting menu which can incorporate multiple columns into its logic.
posted by toomanycurls at 12:20 AM on April 12, 2016


How big is massive?
posted by gorcha at 12:54 AM on April 12, 2016


I would look into a more sophisticated tool for data mining such as Knime. It is freeware and allows you to break up your manipulations into discrete chunks called nodes. You can view the output at the end of each stage and there are a large number of nodes for most normal data manipulations built in. You can also use machine learning algorithms on the data. It wouldn't work for very large datasets (1million +) without knowing some hadoop, but for what I am guessing is your definition of "massive" it should do fine.
posted by koolkat at 1:09 AM on April 12, 2016 [1 favorite]


These steps sound quite easy to do with R or Python. What tools are you using now?
posted by demiurge at 5:08 AM on April 12, 2016 [4 favorites]


I'd do Excel PowerPivot with slicers. But that's not going to work with really massive data sets. You're going to want R.
posted by Ruthless Bunny at 5:24 AM on April 12, 2016


You don't need Hadoop for a million rows of data. Puthon or R can do fine(sub second responses) especially if you give them enough memory.
posted by rockindata at 9:52 AM on April 12, 2016


I'd look into pandas (a data-managing library in python, with nice default graphics). Seaborn is another library that does nice slicing, comparison, and plotting.
posted by clew at 3:48 PM on April 12, 2016


I have used R for similar tasks (millions of records, parsing text input etc). It is programming, but it's not too hardcore to have a go at yourself.
posted by Wrinkled Stumpskin at 3:58 PM on April 12, 2016


If I'm understanding your problem reasonably accurately, you haven't described anything too complicated for Excel's native functionality. Sounds like a job for your friendly neighborhood SUMPRODUCT. At worst, a few scraps of VBA to peel out unique values from lists with duplicates. Might require a bit of back & forth to be sure. Let me know if you're interested. I enjoy this sort of thing.
posted by perspicio at 9:36 PM on April 12, 2016


« Older Help me ID this song   |   How do things like live broadcast TV deal with... Newer »
This thread is closed to new comments.