How do I format raw data that will eventually be part of a database?
December 10, 2006 2:16 PM   Subscribe

Complete database newbie. I have raw information in the form of text, numbers and photos. This stuff will eventually be fed to users on a web site. I am in the information-collection phase now. What's the best way to collect and format the data so it has the easiest possible path into a database that will serve the web site?

Just so you know, I'm using a Mac to collect and store the data. I have text in the form of single words and/or numbers and text in sentence and paragraph form. I also have photos. Should I just be using something dumb like Word to store the text? Or maybe something more database-y like FileMaker Pro? Do I need to tag it or name variables associated with the text? How do I format it? I want to avoid, if at all possible entering this data once when I collect it and then again when it has to go into a web-serving database.
posted by crosten to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
Put the data into a spreadsheet such as Microsoft Excel, which can easily output into a standard CSV (comma separated values) format, which can then be parsed into an SQL database via a very simple script.
posted by The Confessor at 2:21 PM on December 10, 2006


I'd agree with the Confessor.

If you don't have Excel (or a Mac equivalent), then Google Docs will also do the job. You can save these as CSV files, but you can also allow others to edit the files as well thus saving you time.

The photos should be named as a unique number for each piece of information (maybe just an ID number), and stored somewhere.
posted by jamescridland at 2:36 PM on December 10, 2006


A simple spreadsheet might be fine, but it's worth consulting someone with some experience of database design to avoid any problems (if you can fully explain what you're collecting and how you'll want to use it then it'll only take minutes to devise a plan).

It's great that you're thinking about data quality; most people just leap straight in and either don't break data into sensible, consistent chunks (e.g. separate address fields), or don't consider relationships between entities involved (if you find yourself repeating the same data in different places then something's wrong).
posted by malevolent at 3:11 PM on December 10, 2006


I second CSV. Comma-delimited, quoted values. Throw a timestamp on there for good measure. Maybe IP address. There may be a time where you want to correlate with the server logs, you know?
posted by adipocere at 3:29 PM on December 10, 2006


Any regular format (i.e. any format that follows consistent rules) can be trivially imported into any database with some very simple scripting. It only takes a line or two of perl to parse tab or comma delimited text files. So it doesn't really matter how you format it, as long as it's regular. Now the fact that you may not be a programmer does make it a little bit more difficult, but the suggestion to use a spreadsheet is good as that can output CSV which is easy to import into just about anything. Using Filemaker would be overkill if that's not your actual endpoint for the data, although it would certainly be easy to move around once it's in a database format.

Now, what I think some commenters have hinted at is that the actual database schema that you will use when you go to create this web app will be non-trivial. This involves things like identifying one-to-one, one-to-many, many-to-one, and many-to-many relationships between entities, assigning unique IDs, deciding on field widths and datetime formats, picking sensible column names, normalization, determining what to include in indices, etc. This is where the actual effort will be spent, not on figuring out how to parse tab or comma separated text files, as that's trivial.
posted by Rhomboid at 5:30 PM on December 10, 2006


Most folks find a spreadsheet to be an quick easy way to collect data in an organized, methodical way. Spreadsheets are generally the first step to creating data tables and basic databases.

I'd suggest that you jump straight to FileMaker Pro and start with a database that gives you easy to use tools to format fields. Because FileMaker is made user friendly I think that you'll find it easier and less intimidating than going through multiple steps. Using FileMaker doesn't require much training. Take a look at the Filemaker home page. You can download the trial version and the business service templates for free. You can import your existing data to Filemaker or export the data from Filemaker back to spreadsheets or into Access or another database program. Have a look, try it and if it doesn't suit your needs you haven't spent any money and only invested a little time.
posted by X4ster at 8:53 PM on December 10, 2006


crosten, I hope you don't mind, but I have a few follow up questions that might help here (I too am interested in the 'building a database' bit).

1) I know Word is notorious for adding junk when you use it for anything HTML related (i.e. you have to clean it up after you import it into Dreamweaver etc.); does anyone know if Excel adds any junk to the files? (Don't know if the junk in Word comes from people formatting fonts etc.)

2) Picking up on Rhombold's point, how does database software allow you to set details in the schema like relationships? What's the next step after data in the Excel spreadsheet - straight into an Access database?

3) How do people pick which database language they're going to use (i.e. .asp vs. .cfm etc.)? is it a case of just what infrastructure people have (i.e Microsoft, Adobe, etc.?) How do 'web frameworks' fit into this

any answer will help... thanks and apologies crosten if this is hijacking your thread. No harm intended.
posted by rmm at 9:09 PM on December 10, 2006


does anyone know if Excel adds any junk to the files

As others have noted, you will want to "Save As..." as comma-separated value (CSV) file. Note that Excel will perform some fiddling with the values, such as added quotes around values that contain embedded commas. (Annoyingly, it also does this in tab-separated value files, when it's obviously unnecessary.)

Still, CSV is a standard flat-file formatting style, and nearly any database program will be able to make sense of it.

how does database software allow you to set details in the schema like relationships?

This is where database design comes in. You might want to look into an explanation of third normal form, which basically seeks to reduce redundancy in a database layout.

straight into an Access database?

Access can readily import a data from a CSV, TSV or Excel file. You might consider going Excel > Access directly, and allow Access to glean the column names from the header row of the Excel file.

Two more points: (1) You will save yourself a lot of trouble if you design your database before you collect the data. A little thought up front will save you a lot of anguish in the long run. (2) Be prepared to throw away your initial database design and start again from scratch. In any non-trivial database, no one gets every detail correct the first time. Be prepared to go through a couple of iterations before you arrive at what you need.
posted by SPrintF at 9:55 PM on December 10, 2006


rmm

SPrintF is correct about the design of most first-time databases. I'd actually recommend starting with a small (~50 members) subset of the data, and ensuring that your design is applicable to all of your initial intended uses before entering hundreds or thousands of items worth of data.
posted by The Confessor at 5:46 AM on December 11, 2006


« Older More of that name that tune!   |   But I still sleep in Superman boxers! Newer »
This thread is closed to new comments.