What blogs/books should I read before creating an archival database?
March 29, 2010 6:39 PM   Subscribe

I'm making an archival database (in Excel) with about 10,000 rows and 6-7 columns for each row. I'm doing this so the information will be easy to cross-reference and search. I've never done anything like this before and am looking for books, blogs, etc. for tips, suggestions, and anything that will make my life and this project easier.
posted by melodykramer to Computers & Internet (16 answers total) 6 users marked this as a favorite
Basically there are 10,000 separate data points and they've never been cataloged in any way. So I'd like for them to be searchable by type, date, etc. etc. and was wondering if there's anything I should read before I get started.
posted by melodykramer at 6:39 PM on March 29, 2010

online or within excel? If it is within excel read up on pivot tables.
posted by birdherder at 6:53 PM on March 29, 2010

You should read up on basic database principles, like minimizing how much data is in each cell (eg. keep personal names separate from family names, etc), and having a unique entry ID field.

But spreadsheets are great for designing databases as you go, because they are so flexible. You can later query them easily with Access (that's what I use), or a pivot table.
posted by jb at 7:06 PM on March 29, 2010

Pivot tables can be nice, but sometimes they're a pain in the ass too - especially if you want to have functions that are operating on the results of a live pivot table.

Don't underestimate the strength of Excel's summary and cross-reference functions like sumif(), countif(), vlookup(), index() and match() [and sumifs() and countifs() if you're using Excel 2007]. You can actually manipulate and analyze data in some surprisingly complicated ways using just those functions in combination.

And yeah, jb is definitely right that reading on basic principles of database design will give you a useful foundation even if you never actually end up loading your data into a relational database. Also, learning a little bit of VBA would be useful if you want to put together some kind of a front-end form for users to perform routine operations on the data, but Excel is not a great solution for that kind of thing unless your needs are very simple.
posted by strangely stunted trees at 7:34 PM on March 29, 2010 [1 favorite]

I'm not even sure if I'm aware of my needs at this point -- I think that this project will create project after project after the data is loaded.

Can you recommend any books for basic principles of database design? I took AP C++ as a senior in high school, but it's been some years since I programmed anything.
posted by melodykramer at 7:40 PM on March 29, 2010

In my opinion you should spend your time learning how to use a database instead.

Even a sorry excuse for a database like Access will in the long run be orders of magnitude better than keeping the information in Excel. If you prefer, you could learn something approaching a real relational database system, such as MySQL or (my personal preference) PostgreSQL. Both have freely available Windows and Mac versions suitable for use on your current desktop.

As for books on the subject of data modeling, I'd recommend The Data Model Resource Book by Silverston, Data Model Patterns by Hay, The Data Modeling Handbook by Reingruber & Gregory, and Data and Databases: Concepts in Practice by Celko.
posted by ob1quixote at 8:01 PM on March 29, 2010 [3 favorites]

Each member of the Microsoft Office Suite tackles a specific kind of task. Word is for word processing, Outlook for e-mail & task tracking, Power Point for presentation creation. Excel is a wonderful program, but I think you're getting off on the wrong foot: Its fundamental focus is spreadsheet calculation, such as is generally found in accounting.

The reason people keep recommending Access over Excel (for all that the two are compatible) is that Access is specifically designed as a home/office solution for small database needs. 10,000 is *well* within Access' capabilities, and the program is specifically geared to support table cross-referencing, data grouping, and complex search functions. And while SQL and VBA are both useful for refining Access databases, the visual interface allows you to set up fairly complicated systems with little to *no* programming.

I'm not fabulous at Excel, so not much help to offer there. If you'd like to talk about Access and how that could work, feel free to me-mail me.
posted by Ys at 8:24 PM on March 29, 2010 [1 favorite]

Even a sorry excuse for a database like Access will in the long run be orders of magnitude better than keeping the information in Excel.

That's perfectly true from a technical perspective, but pragmatically it can be sort of a dangerous thing to do if you're in a small organization that doesn't have a lot of people that are strong at business and database analysis. The risk is that you end up with a nice database, but you're the only one there that can maintain it or knows anything about how it works.

Congratulations! Now running the damn thing is part of your job description forever. If that's what you want then fine, but sometimes just running the data in Excel is better part of valor. Of course if your Excel project gets complicated enough, that can happen anyway.
posted by strangely stunted trees at 8:31 PM on March 29, 2010 [1 favorite]

Listen to ob1quixote, the idea of using Excel for a 10,000-row database makes me cringe, and I'm not one to shy away from the freaky stuff.

In particular, you are not going to have any data validation, and Excel has all sorts of helpful functionality like autocorrecting stuff or interpreting dates and numbers that is going to trip you constantly. Not to mention the lack of actual database features like enforcing relationship consistency among records or mandatory fields.

You might think Excel is an easier solution because people are more familiar with it, like strangely stunted trees says, but this is the bad kind of easy: Everyone will think they can work with your data and, with Excel not having a good way to implement safeguards against user stupidity ( because it was not designed to, not because it's a bad program), people are going to keep fucking up with your data, and come to you for help. As a bonus, you will have zero tools at your disposal to solve the kind of problems that are bound to come up and will have to resort to stuff like manual data validation or VBA black arts.

It's true that many people work this way, and also that people who know enough VBA to get themselves in trouble have built consulting careers around supporting this kind of insanity, but you need to think carefully before going down this road.
posted by Dr Dracator at 10:36 PM on March 29, 2010 [1 favorite]

ob1quixote: “In my opinion you should spend your time learning how to use a database instead.”

I agree, but I disagree about how.

melodykramer: “Basically there are 10,000 separate data points and they've never been cataloged in any way. So I'd like for them to be searchable by type, date, etc. etc. and was wondering if there's anything I should read before I get started.”

No. You can read after you get started. Learn by doing.

The best database to start with is SQLite. It's small, flexible, and modular; it exists as a file format rather than a database server, so you'll find it much, much easier to use. Moreover, SQLite is suitable for just about everything I imagine most people will ever need. When you need a production database that several people will need to consult many times every day, learn a bit more about PostGreSQL. Until then, stick with SQLite and you will go farther than you might expect.

This is what I think you should do:


(1) Back up your whole database in a different folder. Just in case, y'know?

(2) Save your spreadsheet as a comma-separated value (CSV) file. (Here's a bonus tip: you should almost always save large, database-like spreadsheets as CSV files. That way, you can always open your file, even if you're forced to open it in Notepad.)

(3) Download and install Firefox if you don't have it already; then, install this handy add-on for Firefox called SQLite manager. It's a front-end for SQLite databases that exists inside Firefox.

(4) Once you've restarted Firefox with that add-on installed, go to Tools -> SQLite Manager to start it up. A window will come up with the Manager front-end in it. There it is - your new friend!

(5) Select Database -> New Database and pick a name and location for your new database. Once you've done that, select Database -> Import, and the Import Wizard will come up. Click Select File and pick the CSV file you created earlier.

And - voila! Now your spreadsheet has become a working database. Feel free to play with the various features. You will discover most handily, first off, that the general contents of the database are reflected in the Structure and the Browse & Search tabs. (Notice whether your table was oriented properly for the import - you might have to go back and fix the CSV file and then re-import it.) Get friendly with them and notice how easily manipulable they are, and particularly how useful the Search mechanism can be when you know what you're doing. Notice the (initially very scary) Execute SQL tab - and, when you're feeling froggy there, jump right in and try a SELECT or a JOIN or two. When you're ready to be executing code like that, I highly recommend A Gentle Introduction to SQL, a very good interactive web tutorial that shows you in a blunt and simple way how SQL works in nearly every vernacular.

I say all this because in my experience, with all things software, it's best to just dive in. You can read all day long, and (especially if you're like me) if you keep trying to get all the concepts you might never actually do anything. Starting right away is usually the best course of action, particularly with databases.
posted by koeselitz at 12:22 AM on March 30, 2010 [12 favorites]

Of COURSE I missed the link. Had to have a mistake in there somewhere. Here is that Firefox Add-on, SQLite Manager.
posted by koeselitz at 12:25 AM on March 30, 2010

I'd just like to add that 1) "I Am Not Your DBA", but I am speaking as cynical mercenary who makes his living turning the over-large out-of-control spreadsheets people insist on running their businesses off of into real databases, and 2) koeselitz's is the best answer.
posted by ob1quixote at 3:36 AM on March 30, 2010

Thanks everyone. I will install SQLite and get to work.

I'm not exactly sure if anyone else will be using this at the moment -- actually, I can probably say definitely not.....but I a) need to have some kind of workable system for myself in place for easy cross-referencing and b) would like to be able to give it to anyone who needs it in the future.
posted by melodykramer at 3:47 AM on March 30, 2010

For the record, using Excel as a database is one of the primary reason why there is so much shitty data in organizations today.
posted by jasondigitized at 5:17 AM on March 30, 2010 [1 favorite]

There are free open source databases for archives already that you could perhaps use, the most prominent being the Archivists Toolkit. At the very least, you may wish to see how they set things up. You can use it with MySQL as the backend database. Seems silly to recreate the wheel if you don't have to.
posted by kaybdc at 8:01 AM on March 30, 2010 [1 favorite]

Ignore the spreadsheet haters.

I happily use spreadsheets to create 10,000 record databases, or 14,000 row databases. Because data entry into Access, even with a form, is a bugger compared to the simplicity of data entry into a spreadsheet. People who prefer "real" databases often have data entry monkeys (like me) to put all the data in and/or spend ages making forms (so not worth the time until you are into millions of records). I use Access to query the spreadsheets or, if I absolutely need to link them (not necessary for a simple one table database), import them into Access after I have finished finished the data entry.

Spreadsheets are especially good if you have to design your database on the fly. This is bad database policy and generally to be avoided. But in historical research, it's inevitable -- the data never does what you expect it to because you don't know what you're going to find. Spreadsheet based databases are more forgiving for re-arranging columns, changing type (from number to text and/or back again), etc.

SQL-lite sounds like a great program - I'll check it out because I'd love to finally ditch MS Office entirely (love my OpenOffice Calc). But if the data-entry and re-arrange possibilities are still as crippled as Access, it would still be easier to create a database in a spreadsheet.

As for shitty databases -- the shittiest database I've ever worked with was designed by IT people with all their "real" databasing skills. But it was just designed WRONG (wrong principles of organization - they want to track organizations, and have people as a subset, but the people have ended up being the primary organization), and now it can't be changed at all. My spreadsheets are way cleaner.


That said, I just noticed on re-reading your question: by "archival database", do you mean an archival catalogue or finding aid for manuscripts and assorted records? Because those can be a bugger to put into database form, which is why even some major archives (like the British Library) continue to use searchable finding aids rather than tabular databases. You should check out what archives that hold materials like yours have chosen to do -- specialised archival software might also be helpful.
posted by jb at 9:36 AM on March 31, 2010

« Older Unintended Discontinuation Syndrome...thanks doc!   |   Frame my ethical dilemma. Newer »
This thread is closed to new comments.