Good database management habits?
October 4, 2011 7:55 PM   Subscribe

How can I learn to be better at data management (without taking a class)?

I'm looking for online resources or really, really approachable books on data management. I do a lot of quantitative research and have never really been taught how to manage data, create databases, and so on. Right now I know how to work well with SPSS and excel, have worked with MS Access in the past (but would like to avoid doing more with it in the future). I have had repeated problems with managing databases where multiple users are inputting data, and also with coding, querying, cleaning, and avoiding accidentally dropping cases from large datasets. Part of the issue is that I often don't know the problems that can come up with each of these tasks until several steps after I've already made a mistake.

Ideally I'd like to spend a couple of hours per week learning how to create databases, manage data effectively, run queries, and troubleshoot problems with databases.

So I guess I'm looking for 1) a recommendation for a preferably free database program that's more user friendly than Access; and 2) links to good resources about the key concepts behind database management. Regular advice about how to go about learning this stuff would be welcome too.
posted by _cave to Technology (8 answers total) 19 users marked this as a favorite
 
Best answer: You're not going to find a program to manage databases that is more user friendly than Access for free.

My recommendation would be to pick up a book on databases and work your way through it. Alternatively there is a online class being offered by Stanford you might want to checkout, depending on how that jives with your "without taking a class" stipulation.

The data entry and migration into a database portion is a problem I struggle with. It's really hard to wean people off giving you data as an excel file which takes some mangling to load up into your database of choice.

It pains me to say this but if you're in the social sciences you're going to have trouble leaving Access behind you. it might be a good idea to learn it, most of the skills (database schema, backing up, querying ,etc) are transferable anyways.
posted by EsotericAlgorithm at 8:45 PM on October 4, 2011


Best answer: I'm self taught and have lots of gaps in my understanding as well. As an exercise to help myself as much as you (sorry!), I'm going to regurgitate what I think I've learned in the hopes that you find it useful, and any gaps or mistakes in it are pointed out (bold words may be helpful when researching information):

You'll want to read a lot about normalization as a part of database design. In general, more normalized designs are more flexible and approach the desired platonic ideal, but greater normalization means worse performance. In the real world, all sorts of weird design decisions are made in order to increase optimization. This may be something to look over.

Indexes are a way to hurt write performance (longer to add information due to extra processing) in exchange for greatly accelerating read performance (queries that check indexed values are much quicker than non-indexed). If you're searching by a field or joining on a field, it should be indexed.

Data types such as text, integer, date, varchar, etc., are important because they are one way to enforce data integrity. You can try to store the word "cake" as a date all day long, but the database will never accept it. But if you try to store December 5th 2009 in a text field, it will let you. It will also let you store 12/5/2009, 12-05-09, 5 Dec. 2009, etc. So be as specific as possible; only use text or equivalent fields when you're not storing a more specific data type (date, one of the various-length numbers, binary, whatever else is offered). Different fields take up differing amounts of space, which will eventually be important. See this post for an example of when choosing "char" instead of "varchar" caused massive problems.

Never duplicate data. Only store information in one place and one place only. The big conceptual leap to make is understanding that calculated information is not new information. Just keep the basic, raw data in the database, and extrapolate from it in code outside the database. If A and B are sufficient for you to extrapolate C, only store A and B in the database, and calculate C in code from A and B. In the real world, this is sometimes violated in order to increase performance, but generally with some sort of backup plan for when/if things go wonky, and you have to decide whether to trust A and B or trust C.

Redundant information in a database is bad.

When you're making a program, always use parameterized queries. If you concatenate strings in order to construct your query, you're doing it wrong and are almost guaranteeing the existence of security issues. Parameters are variables that the program sees as exactly that, variables. They are automagically exempt from any injection nonsense that someone eventually will try to pull on you. Not using parameterized queries is like driving drunk: reckless and unacceptable. See this Microsoft article for more information about parameters in Access. (In particular, keep reading til you hit the "Create a form that collects parameters" section.)

Often you will find yourself updating multiple interrelated tables using multiple queries. The only safe way to do this is to use transactions. Transactions are groups of queries that are executed in a batch and then, once all is rosy, you can choose to commit the change or roll back to the original, pristine data. Since the changes are only committed after every command is successful, you don't have to worry about losing power (as an example) between queries three and four; if you do, queries 1 through 3 never actually happen.

Access tries to do a lot of things for you, including locking certain tables/fields to stop two people from editing them at the same time. If you have a multi-user database, you will need to use a combination of locking and/or transactions to keep things peachy.

Parameters and transactions are really big deals and you should be using them.

Find some way to automatically back up your database on a daily/weekly/monthly schedule, depending on what makes sense for your needs. If you're on a linux box or Mac PC use a cron job; on windows, schedule a task. Gmail has lots of storage, can handle large attachments, and will almost certainly never lose your data, if you want peace of mind on the cheap and don't have very large databases.

Finally, get an account as stack overflow and use it to ask specific questions about code that isn't working the way you think it should work.

And that's all the advice I can think of.
posted by jsturgill at 9:05 PM on October 4, 2011 [5 favorites]


Try to figure out the basic SQL query structure. I'm not saying you need to use SQL, but it will give you a much better idea of how to think about data sets.
posted by Gilbert at 10:17 PM on October 4, 2011


If you want a free Windows-centric database technology with some real power, you want MS SQL Server Express. It can do everything you need to learn the system. Is it more user-friendly than access? Probably not, but I don't think it can be. Real database systems are far far more complicated and powerful than Access, and all those options and controls need to go somewhere. It is reasonably easy to use, however.

@jsturgill - The idea that all "redundant information in a database is bad" is a bit too dogmatic. True, this is usually the case, but not always. There are many situations where denormalization (that's what it's called when you deliberately duplicate data that could be referenced in another table) is desirable, even necessary. Reporting databases are one common example. Here's one StackOverflow question dealing with denormalization. Here's another one.
posted by Vorteks at 7:43 AM on October 5, 2011


jsturgill is on the money

adding to that...

Sketch out your tables before you make them, figure out how they relate to each other and where there is redundancy. Consider this your normalization process.

Actually type out queries rather than use a query wizard.

Delving into mySQL is not that hard and takes the training wheels off. It also allows you to easily move your database online.
posted by jander03 at 7:49 AM on October 5, 2011


Best answer: I'm currently enrolled in and following the online Stanford Databases course. It's very heavy on the theory of relational databases, but eventually I think it will touch on the practicalities of using SQLite, which is a popular open source (i.e. free) embedded relational database. Moreover, the course is covering XML in quite some detail.

You may want to consider taking the online O'Reilly course in Database Administration (DBA). I haven't taken this course, and it is very expensive, but seems to be a very thorough and interactive class on managing MySQL on Linux. MySQL is a very popular, open source (i.e. free) relational database.

However, I don't think you're looking for this complex a solution. If you want a database system that is as "friendly" as Microsoft Access, consider Filemaker Pro, which is a fluffly and friendly commercial (i.e. not free) cross-platform (i.e. Mac and Windows) relational database.
posted by asymptotic at 7:55 AM on October 5, 2011


Here is the correct link to the O'Reilly course on DBA.
posted by asymptotic at 7:57 AM on October 5, 2011


Response by poster: These are some really great resources. Thank you!
posted by _cave at 6:59 AM on October 6, 2011


« Older Rage or GOW3   |   I love halloween. I hate costumes. Newer »
This thread is closed to new comments.