Give Our Database a Bath
May 5, 2009 11:36 AM   Subscribe

Dirty database needs a good cleaning. But how?

We - a regional environmental nonprofit with about 8000 active members and untold additional business/activist/etc. contacts - are implementing to handle our CRM needs. We're currently working with a database that was developed using twigs, hamsters on wheels and a lot of duct tape. (Also MS Access.)

We're at the point of taking our very old, very shoddily constructed database and cleaning it up so that it can be imported and serve as a basis for our "new" database. We've set some of our guidelines - no data older than 5 years, etc. - but we're not sure how to proceed with the clean-up.

Our implementation consultants are not being particularly forthcoming with suggestions about organizing this project - aside from outsourcing it to India. And since we've had a lot of quality control issues with things we outsource (de-duped mailing lists that are not actually de-duped), we'd rather this be handled in-house. Currently, we see this as having a handful of steps: 1) make the data cuts and generate master database to be cleaned (cleaning to take place in Excel); 2) Proof database finding all dirty bits that need cleaning; 3) Check proof to make sure it is accurate; 4) Make changes to database; 5) Check edits to make sure that they are complete.

How should we organize this project to ensure that it is done quickly and expeditiously? And if we're planning to hire someone to do this, what qualifications should we look for?
posted by greekphilosophy to Technology (14 answers total) 7 users marked this as a favorite
Best answer: I have only two remarks, both of which appear to be relatively facile, but god help you if you ignore either one:

1) information integrity is a way of life, not a project goal. Beware the lazy user -- make provisions for ongoing de-duplication & the like.

2) decide what matters most: a perfect database, or one that's good enough. Although lazy users are a source of ongoing damage, you face a separate risk from people who will insist that the database be perfect. This will result in the database never being used.

...these two points are in tension with each other, but such is the natural order of things.

(actually, I have a third remark but it may already be too late: beware your implementation consultants. Be absolutely goddamn certain that the scope of work has been nailed down. Nailed down, not "commonly agreed upon" or anything like that. Nailed down with big scary words and tediously-exact statements.)
posted by aramaic at 11:57 AM on May 5, 2009 [4 favorites]

How big is your database? How many tables? How many rows in each table? Are you talking about scrubbing your data or are you planning to change the schema, too?

With only the information you provided, I'd guess that what you should do is (1) make sure you have a good backup of your database; (2) identify your bad data (e.g., orphaned foreign keys; expired data; whatever else); (3) write SQL code either to delete your bad data or to move it into an error table; (4) run row counts on tables to be scrubbed; (5) run SQL code to scrub data; (5) run row counts on newly-cleaned tables; (6) spot check your data; and, (7) test whatever applications interface with your database to make sure you didn't break anything. If anything's broken, revert back to your pre-scrubbed backup.
posted by Maisie at 11:59 AM on May 5, 2009

Also, if you're hunting for books on the topic, I spent a bit of time leafing through Data Warehouse ELT Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data -- it seems to spend a fair bit of time explaining just the sort of SQL grunt work that Maisie talks about.
posted by verb at 12:54 PM on May 5, 2009 [1 favorite]

A note: While that book's title can be a bit imposing, you don't have to be building a huge-ass data warehouse to find the data scrubbing techniques it discusses useful.
posted by verb at 12:54 PM on May 5, 2009

Response by poster: Good call Maisie. Size Does Matter! And I suspect the other bit of information I've got will have an impact on your answer also. Our membership coordinator estimates that the mass of data that needs cleaning is about 36,000 entries - with each entry containing demographic information and giving/mailing history. Also, I think the database is so shoddily constructed that it can't handle the kind of programmed cleaning that you're suggesting. It appears that this job will have to be done manually. (Then again, it is important to note: none of us really knows how to do that sort of thing. It may be worth getting a second opinion from an outside source about how this could/should be done given our database's poor construction.)

And aramaic: you offer sage wisdom also. We're trying to walk that fine line now, and part of the new regime is that we will have a new database as well as a set of new guidelines for its use to go along with it. Is a written set of rules helpful for this sort of thing? Do we need to set up a system to spot check to make sure that users aren't diluting our data over time?
posted by greekphilosophy at 1:01 PM on May 5, 2009

Before you get to the manual donkeywork stage, make sure someone who knows the data and knows how to manipulate databases (this may have to be more than one person) goes through looking for common patterns/problems. You may find that not only can much of the clean-up be carefully automated, but you can also create simple tools to monitor the whole process and highlight likely problems.
Also, having the most knowledgeable people initially work on the migration is the best way to double-check your new data structures.
posted by malevolent at 1:08 PM on May 5, 2009

Also, I think the database is so shoddily constructed that it can't handle the kind of programmed cleaning that you're suggesting. It appears that this job will have to be done manually. (Then again, it is important to note: none of us really knows how to do that sort of thing.
Well, keep in mind that "programmed cleaning" almost always consists of someone who knows SQL poking vigorously at the database for a week or three, running queries that generate lists of 'likely duplicates' and refining criteria, etc. That kind of stuff isn't necessarily limited by the system you're using and the design of your database; they're the methods you use to migrate the data and massage it into a new structure when "OK, print it all out and type it back in" is unacceptable.
posted by verb at 1:09 PM on May 5, 2009

Best answer: i have experience with this from one of my old jobs. are you going from access to raiser's edge or something like that? or are you going from access to custom-built sql db with a front end of access for the people who get scared of change?

i'm guessing you have a couple of access tables with various relationships set up. i hope. or is it just one big access table with everything in it? either way, your job is going to suck trying to get this done, especially since most of it is going to have to be done manually (i speak from experience). the reason it's gonna be manual is that any queries you run will break down and cry because your data won't be normalized and there will be numerals in fields designated as alpha only....and so on.


first, you'll want to eliminate any record that is older than five years. but what does this mean? not modified in five years? entered more than five years ago? so decide on that. hopefully, your db has a field where it recorded the date entered. if not, you'll have to do some sql stuff that i don't know how to do to delete things that are more than 5 years old.

second, you'll want to find duplicates and delete those. but before doing that, you should make sure that each of the three entries for john q. smith has all the same info. if it does, you're fine. if not, you should merge the records so that the address AND the email address are all in one record. if the records are quite different, it's your call. i personally would delete them both and hope for the best. but that's me.

after you have the old records and the dupe records gone, now you have the working records that you need to clean up and normalize. i'm not sure exactly what you want "cleaned up" so that will determine your next steps.

it sounds like your consultants suck. isn't this what they're getting paid to consult about? they should be telling you about data normalization and creating 5293842309 tables and linking them and stuff like that.

the one other thing to realize, as brought up above, is that your shiny new database will never be perfect. even from the get-go it won't be perfect because the data you're importing from your old hamster database won't be perfect. so anyone who has expectations like that needs to be told that perfection's not happening.

written instructions for your database will be useful for training purposes, and for the first couple of weeks or months while people are getting used to the new system. then they'll get buried and forgotten because people think they know what they're doing. so doing periodic spot checks is essential. this could even be a built in feature of your new database (eg, someone starts entering info for john q. smith and your db dings or something and says "there is already a record for john q. smith, please confirm you want to create a new record for a DIFFERENT john q. smith" or something like that.)

i am no sql/access guru, but people at my other job were. so it was my job to tell them what we needed (and i knew what we needed based on common sense and not being a jackass, two qualities which i know you have) and then offer feedback on what they were building for us. it actually worked really well until jackasses got involved.
posted by misanthropicsarah at 1:19 PM on May 5, 2009 [1 favorite]

I am a little surprised that your consultants are being so unhelpful and I wonder if they feel like this kind of basic-but-time-consuming work will cost your organization more money than you'd like to pay out if they have to assign one of their people to do it. That said, this would actually be a good project to give to a summer intern that you hired from the MIS department at your local university if you can afford to get one.

You almost certainly do not have to do this manually. If I were you, I'd make a copy of the database and experiment with writing SQL scripts to explore the database and come up with likely candidates for deletion. I think it would actually be kind of fun. The most important thing, however, is to make sure you have a good backup of your database before you do anything. Please, please, please learn from the mistakes of the rest of us who mistakenly thought "oh, this SQL statement will just delete one row."

As for how to fix your database going forward, you may want to have your consultants (who I hate already just from their lack of helpfulness on this. Send it to India? Please.) work with you on that. Your database probably needs to be normalized and you may need referential integrity or other constraints.

Anyhow, good luck and let us know how this turns out. Also, feel free to memail me if you have any questions. I used to be an application developer and I'm now a DBA, so I have relevant background.
posted by Maisie at 1:36 PM on May 5, 2009

1) make the data cuts and generate master database to be cleaned (cleaning to take place in Excel)

I agree wholeheartedly with the previous advice, but I want to call this one out: Excel is a decent tool for a lot of things, but dear God, not data cleanup. It will work, but the person using it for data cleanup will hate surely their life before the task is even 25% complete.

I am going to go out on a limb here and suggest using a more user-friendly tool like FileMaker Pro for data cleanup. FileMaker has an excellent calculation engine with a number of text manipulation functions, and it can suck in and spit out data in most/almost all popular file formats. It also has a very easy to use scripting environment.

Disclaimer: I am an inhouse FileMaker developer for a medium sized company. I have been responsible for several large data conversion projects, cleaning up data in FileMaker and then exporting it to Siebel/SQL. Filemaker is not perfect, but IMO it is a much, much better tool for data cleanup than Excel. Just my .02 cents.
posted by mosk at 1:59 PM on May 5, 2009 [1 favorite]

Another way to try is to figure out a way to parse your data in a binary sort of way. Is this entry "good enough" or not? Especially being inexperienced with databases, I'd be wary of deleting data. Instead, parse it for some logical test that relates to some cleanup task, and split the bad data out. Then fix it and merge it back in.

For example, you've got 36,000 records. Suppose 20,000 of them are good enough? Don't waste time messing with them. Pull out the ones that aren't, and proceed. Split that batch into further smaller batches of data with similar problems. You can probably get to a point where cleaning in Excel isn't awful.

Consider adding some database cleanup and integrity to the existing front-end. If someone is updating a record, and it has invalid data of some kind, make a call-out of some kind that makes them fix it.
posted by gjc at 3:21 PM on May 5, 2009

Sometimes it's helpful to only copy the records that are known to be good, and throwing the rest into a holding area for triage / disposal. If you can come up with a standard "fitness" test, then you may get a useful operational DB in a shorter timeframe- this is particularly true with marketing databases, where the cost of cleaning ancient records is often greater than the value of trying to sell to them.
posted by jenkinsEar at 3:54 PM on May 5, 2009

maisie and mosk have it: do your cleanup in the database, not in excel (though I make no comment on mosk's specific suggestion of Filemaker Pro).

Not only will this be easier, you'll be able to use the same scripts again later, if an inadequate front-end again allows the database to be polluted.
posted by orthogonality at 5:05 PM on May 5, 2009

WARNING: Keep all donor information.

If you have a person who has made a contribution, keep that record. You'll need it if you ever have to
- do any historical contribution analysis or reporting. This comes into play if you're entering into a fund raising campaign and need to factor in your previous fund raising efforts into your calculations for your next campaign. Additionally, you'll need it if you ever want to say that your organization has raised X amount since Y date.

- answer questions from that lapsed donor. The donor who has lapsed could contact you and ask how much they gave, when was their last gift, and what it was used for.

- solicit lapsed donors. A lapsed donor is a potential future donor. They have given to your organization in the past. This means they have some sort of affinity to your organization and, unless you pissed them off, should always be considered a potential donor. Lapsed donors do come back into the fold and give again.

- report your contributions to the IRS or any other agency. You do not want the IRS to come a calling and find them asking about contributions and donor histories you've deleted because they were five years old.

If you're going to delete records, then delete those who were added to your system more than 5 years ago but who have NEVER given or had ANY other activity on their record.
posted by onhazier at 6:09 PM on May 5, 2009

« Older How can I be in two places at once when my mail...   |   Things to think through for a family home loan Newer »
This thread is closed to new comments.