Help me pick a DB platform?
January 30, 2019 8:01 AM Subscribe
For several years now, I've been riding in the Texas MS150, which is a two-day charity ride from Houston to Austin undertaken to raise money for MS.
Turns out, I'm pretty good at fundraising. But to keep improving, I need to build some tools. It looks like this has been asked before, but not since 2014, so...
Up to now, I've been using a spreadsheet the fundraising coordinator gives me every year. It contains a list of ALL donors ever, with columns for each event donation. At this point, there are columns for 2013, 2014, 2015, 2016, 2017, 2018, and now 2019.
Last year, I added a donation date, and I'm tracking it again this year, so that I can see where I am vs. prior years (for example, I know right now I'm slightly ahead of pace in terms of dollars, and VERY ahead in terms of donation count, but also that I was very behind as of 1/1 and made up the difference in January, etc.). I'm a giant data nerd, so I like having this on hand.
However, creating and managing the various quickie-reports and filtered views I want has become cumbersome in Excel because, well, it's Excel. I've considered moving this to a proper DB for years, but now I think I actually want to do it. My impulse it to default to the stack I last used professionally, ca. 2007, which would be basically perl + postgres running locally on my Mac. But it occurs to me that there's probably a better way in the Year of Our Lord 2019, and that the Green might know which way to look.
Already considered and rejected: Anything from Google or Amazon.
Currently playing with, but generally disappointed: Airtable. It does SOME of what I want -- multiple persistent views & filters -- but suffers from some of the same limitations that are driving me from Excel, such as fiddly in-cell formulas to accomplish trivial things. Its inability to manage a filter with AND and OR without a dedicated formula field is in particular galling, since that's exactly the need that's pushing me away from Excel.
I guess in my heart of hearts, it's hard to improve over the simplicity of a view based on "select email_address from Table where isMatch is false and (2019Donation = "" or AlwaysEmail is true)".
Thoughts?
Up to now, I've been using a spreadsheet the fundraising coordinator gives me every year. It contains a list of ALL donors ever, with columns for each event donation. At this point, there are columns for 2013, 2014, 2015, 2016, 2017, 2018, and now 2019.
Last year, I added a donation date, and I'm tracking it again this year, so that I can see where I am vs. prior years (for example, I know right now I'm slightly ahead of pace in terms of dollars, and VERY ahead in terms of donation count, but also that I was very behind as of 1/1 and made up the difference in January, etc.). I'm a giant data nerd, so I like having this on hand.
However, creating and managing the various quickie-reports and filtered views I want has become cumbersome in Excel because, well, it's Excel. I've considered moving this to a proper DB for years, but now I think I actually want to do it. My impulse it to default to the stack I last used professionally, ca. 2007, which would be basically perl + postgres running locally on my Mac. But it occurs to me that there's probably a better way in the Year of Our Lord 2019, and that the Green might know which way to look.
Already considered and rejected: Anything from Google or Amazon.
Currently playing with, but generally disappointed: Airtable. It does SOME of what I want -- multiple persistent views & filters -- but suffers from some of the same limitations that are driving me from Excel, such as fiddly in-cell formulas to accomplish trivial things. Its inability to manage a filter with AND and OR without a dedicated formula field is in particular galling, since that's exactly the need that's pushing me away from Excel.
I guess in my heart of hearts, it's hard to improve over the simplicity of a view based on "select email_address from Table where isMatch is false and (2019Donation = "" or AlwaysEmail is true)".
Thoughts?
You should play with Metabase for a bit. If you make the DB behind it, it's can do a lot of quick visualization of your queries and some ad hock reporting. Nice when you don't really know what you want to report on yet.
posted by advicepig at 8:33 AM on January 30, 2019
posted by advicepig at 8:33 AM on January 30, 2019
Response by poster: The dataset IS very small. Can you provide a link for this dataframes/python idea?
posted by uberchet at 8:52 AM on January 30, 2019
posted by uberchet at 8:52 AM on January 30, 2019
Best answer: If you like the idea of pandas, the good is that assuming your data is in some sort of sensible layout (I'm imagining pretty much 'donor,2013,2014,2015...', you can just get Panda's to read all the data in directly from your existing Excel file.
A super popular way to play with pandas is using a Jupyter notebooks which are nice (optionally online), combination of a journal and code, very popular among the data-science world at the moment. You can end up with a nice document that both describes and documents your code, with real charts and data, similar to this. So for you, you'd could create a notebook that answers all your questions and have it easily printable/exportable as a report for others.
Just had a quick look around for an intro tutorial and looks kinda helpful and maybe this, if you're coming from Excel.
posted by Static Vagabond at 1:20 PM on January 30, 2019 [2 favorites]
import pandas as pd
donations = pd.read_excel('donations.xlsx')
# Add a total column (total donations per donor)
df['Total'] = df.sum(axis=1)
print(df)
A super popular way to play with pandas is using a Jupyter notebooks which are nice (optionally online), combination of a journal and code, very popular among the data-science world at the moment. You can end up with a nice document that both describes and documents your code, with real charts and data, similar to this. So for you, you'd could create a notebook that answers all your questions and have it easily printable/exportable as a report for others.
Just had a quick look around for an intro tutorial and looks kinda helpful and maybe this, if you're coming from Excel.
posted by Static Vagabond at 1:20 PM on January 30, 2019 [2 favorites]
Response by poster: Ok, pandas + jupyter looks GREAT, and I can get at my data from Excel just fine, but I'm having issues because my first task is apparently weird.
Pandas is apparently super good at "show me column X where X > 50", but I can't seem to sort out how to get to "show me column B when [misc conditions about columns N, O, and P]".
I know this isn't for back & forth, but any advice you have on an effective pandas forum would be welcome.
posted by uberchet at 3:04 PM on January 30, 2019
Pandas is apparently super good at "show me column X where X > 50", but I can't seem to sort out how to get to "show me column B when [misc conditions about columns N, O, and P]".
I know this isn't for back & forth, but any advice you have on an effective pandas forum would be welcome.
posted by uberchet at 3:04 PM on January 30, 2019
Response by poster: For the record, I sorted it. I'm adding it here in case someone searches later, because my SQL brain had a hard time getting to it.
To accomplish what I wanted, I needed to do things like
Condition1 = df['Column1'] == 'puffball'
Condition2 = df['Column9'] != -1
and then
df[Condition1 & Condition2]
would give me ALL the records where those two things were true. Then
df[Condition1 & Condition2]['ColumnZ']
would give me only that column from records where those two things are true.
Another key idea I didn't get immediately was the need to do some data cleanup first. Tests for missing values don't seem as straightforward, but it's simple to fill in those "NaN" values with df.fillna(some-value). I used -1, which then made finding the NaN rows easier.
posted by uberchet at 6:20 AM on January 31, 2019
To accomplish what I wanted, I needed to do things like
Condition1 = df['Column1'] == 'puffball'
Condition2 = df['Column9'] != -1
and then
df[Condition1 & Condition2]
would give me ALL the records where those two things were true. Then
df[Condition1 & Condition2]['ColumnZ']
would give me only that column from records where those two things are true.
Another key idea I didn't get immediately was the need to do some data cleanup first. Tests for missing values don't seem as straightforward, but it's simple to fill in those "NaN" values with df.fillna(some-value). I used -1, which then made finding the NaN rows easier.
posted by uberchet at 6:20 AM on January 31, 2019
This thread is closed to new comments.
posted by rockindata at 8:23 AM on January 30, 2019