MySQL Normalization Question
April 27, 2006 1:41 PM   Subscribe

DBAs: Is this a case where I *should not* normalize my database? (PHP/MySQL 4.0.24 question). Long explanation of the situation inside.

So here's the deal -- I'm (re)writing an app that needs to allow users to build questions which can be later used in forms presented by that application use in forms. Questions that are straightforward ("Fill out this field", "Pick from these choices") are very simple to do. However, it gets a little more complicated when you have questions like "Pick from these choices, if other, please fill out the blank" or "If the answer to question 2 is yes, please answer question 3?" -- ie, dependent questions.

I'm currently storing questions as a single record in the database. When I have multiple choices for each question, I newline-separate them and stick them into a field called "question_data". When I present this question to a user, I simply grab the question (via ID number) and explode/implode the "question_data" field into the proper HTML markup. I use an arbitrary symbol when I need to do something like the "Other" field above (although I don't yet have a way of doing the "If the answer to question two is yes, please answer question 3" but I'm sure i could write one. When I need to perform calculations using this data (ie, how many people answered question 2 with answer B?), I get all the users answers and tally up the occurence of each -- "other" responses are prefixed with an "Other: " string that is trivial to find with PHP.

Now, this solution has worked well so far because 1) doing database dumps of user demographics is pretty straightforward (I can just dump their question field directly into the report) and 2) everything is just in one field, *even* if it is an "Other" answer -- ie, if the user gives a "non-official" answer, then I don't have to go look for it.

I've been trying to read more about database design in my spare time (I am by no means a DBA) and it seems like what I'm doing with the question_data field is a big no-no. I can think of some advantages to making another table and having individual records for each question option available, but it seems like calling everything up to display the question (which is the more important thing for this app) would take longer than the way I'm currently doing things (ie, it'd be a join AND I'd have like, 25 rows to deal with).

So what should I do here? Am I looking at the problem totally wrong? Should I force normalization onto these question options? Is the way I'm using question_options already normalized?

or is this a case where I should just comma-separate the "question_data" field and upgrade to mysql 5 so I can do FIND_IN_SETs?

i realize this is a confusing configuration to explain without pictures -- I'll do my best to fill in any important holes I might have missed.
posted by fishfucker to Computers & Internet (25 answers total)
 
never, ever internally separate a field. Once you know this rule, your database design follows immediately and naturally.
posted by felix at 1:47 PM on April 27, 2006


In the past I've done a hybrid for this kind of thing.
For question data, I'll shove it all in one column, so the potential answers for "What is your favorite color?" are "Red,Blue,Green"
Basically, I think this works out because you have a lot more control over how you structure questions.

I would split out answers, though. So, if the user answers, "Red" AND "Blue" that's two different rows with references to the question.

The JOINs are definitely a pain, but it's worth it. For one thing, it makes it much easier to write queries to look for specific answers.

I've gotten around the excees rows thing in the past by grabing the question data and the answer data in separate queries. Then I dump the answers into collections (an Array in php, I guess), with indexes to the question ID. Then you can loop over questions and just pull out the answers related to that question.

You actually have to be careful of over-normalizing here. If you split out the actual question choices into another table, and then just reference them by ID in the answers, you have to worry about what happens if the choice is changed.
So, a user might have answered "Red", but then you go back and change the question choice to "Bright Red." The user's answer is now automatically changed. That may or may not be what you're after.
posted by frufry at 1:57 PM on April 27, 2006


"it seems like calling everything up to display the question (which is the more important thing for this app) would take longer than the way I'm currently doing things (ie, it'd be a join AND I'd have like, 25 rows to deal with)."

I doubt it would be noticably slower if you have the database and indexes set up correctly. As for looping throught the resulting rowset from a query- how is this any slower than parsing the strings to find the new lines? I bet over all, normalizing would actually make things faster as you're letting the database create indexes and pre-sort things, instead of depending on your applications code (php or whatever) to parse through text strings.

It's been my experience in situations like this that normalizing is a good thing. A sepearte table for the questions from the answers means you do things like add extra answers, or change the sort order of the answers without doing any potentially messy string operations.
posted by gus at 2:01 PM on April 27, 2006


Just make a new table for each survey :)

What's 'better' or 'worse' depends on what you're going to do with this table later on. Are you going to hand this off to users with an editor that lets them write their own questions, or are you just going to write the questions yourself?
posted by delmoi at 2:10 PM on April 27, 2006


I echo the thoughts above regarding the avoiding potentially messy string operations. However I would strongly advise resisting "normalization fever."

Normalization becomes increasingly important (in my opinion) the bigger your databases get and the more simultaneous users you're likely to have -- but, in my opinion, for some smaller apps/databases, the effort required to normalize may not pay off.

What I would be most concerned about is making sure your app/database is intelligible to others (the string thing would definitely lower its intelligibility). Even if you're relatively certain you're the only person who will ever touch the database, I think it's the ethical thing to do to build it so that someone else could take over where you left off without excessive head-scratching. Who knows, tomorrow you might win the lottery . . .
posted by treepour at 2:26 PM on April 27, 2006


Well, I'm leaning towards a separate table now, if that's what most people think makes sense. This table will have a TON of redundancy, though, because there'll be a lot of questions that have a Yes/No answer

What would be the best way, then, to set up dependencies between questions? Would it just be best to set up another table that says question Y relies on question X and only check that during the form validation process (Ie, I'd say, huh, you answered questoin X with this, so I'll see if you answered question Y and store it, otherwise I'll ignore question Y).

building questions is eventually going to be an end-user ability, so I'd like it to be as flexible as possible.

I guess my main worry is that retrieving them from another table is going to be slower than grabbing the single record and parsing it -- maybe I should just run some tests.
posted by fishfucker at 2:43 PM on April 27, 2006


create table question (
question_id int not null primary key,
question_text text
);

create table choices (
choice_id int not null primary key,
question_id int not null,
choice_text text
);

create table answers (
answer_id int not null,
user_id int not null,
question_id int not null,
choice_id int not null,
answer_text text
);

create table question_visibility (
question_id int not null,
choice_id int not null,
dependent_question_id int not null,
visibility_cd int not null
);

create table visibility_code (
visibility_cd int not null,
visibility_desc text
);

insert into visibility_code(visibility_cd,visibility_desc) values (0,'grey out the dependent question');
insert into visibility_code(visibility_cd,visibility_desc) values (1, 'enable the dependent question');
posted by felix at 3:21 PM on April 27, 2006


Normalize it. I've built a couple of apps like this, including a web app that lets users build online forms via drag-and-drop. Send me an email if you want the SQL for the tables we used (SQL Server, but the idea's the same).

The client has just asked us to start recording the form responses in the database. If the options weren't split out, this would be painful. I totally understand why this feels like overkill, but normalize them out.
posted by yerfatma at 3:55 PM on April 27, 2006


felix:

Well, my one problem with the choices table, as outlined above, is that it doesn't lend itself to a ordering -- so I'd have to add an ordering field (ie, the client may want stuff sorted like this : "B Choice", "A Choice", "C Choice") .

i do like how you've handled the question dependency -- that makes a lot of sense.



maybe i shouldn't let this discourage me. I keep going back and forth -- I was planning on normalizing it when I made my last post, and then I started working on it again and it seemed like it was way more tables than were necessary. I guess maybe I'm suffering from the idea that PHP can do operations faster than MySQL (specifically, that I can break down that string into markup faster than I can make a bunch of db pulls and do markup -- i guess one would be an explode/implode and the other a while loop). Which is true?

I really want to do this right, because this is the second time I'm designing the db and I don't want to do it again. The first one was done in a rush, and I knew even less about relational dbs than I do now, and I am not looking forward to converting the information.

While I've got you attention, I'm also wondering what the consensus is on upgrading to 5.0? My boss installed the server we're using, and the install plan he used called for 4.0.24 -- i'm still slightly miffed about it because it means I can't do subselects and whatnot, but I've heard these functions can be kind of a crutch -- is it worth upgrading, or if I design the database correctly will I be just as well off using 4?
posted by fishfucker at 6:19 PM on April 27, 2006


oh, and thanks for all the great answers so far, everybody. I really appreciate your time!
posted by fishfucker at 6:20 PM on April 27, 2006


My boss installed the server we're using, and the install plan he used called for 4.0.24 -- i'm still slightly miffed about it because it means I can't do subselects and whatnot, but I've heard these functions can be kind of a crutch -- is it worth upgrading, or if I design the database correctly will I be just as well off using 4?

You can use 4 just fine, for the simple task you have in mind. However, nothing about subselects/views/triggers/stored procedures is a "crutch". These are features that almost every other major database supports; MySQL was just a couple years behind the curve.
posted by sbutler at 6:59 PM on April 27, 2006


[My dba experience is with SQL Server].

nothing about subselects/views/triggers/stored procedures is a "crutch"

Agreed. Additionally, they often run faster than the equivalent functionality in the application code (especially sp's). (YMMV, depending on your db setup, your facility with SQL, and which programming language is used in the app.)

Normalization is not a goal in itself. You can think of it as a bell curve, where the ideal for any particular app is somewhere in the hazy middle, and not the extreme at either end.
posted by SuperSquirrel at 7:39 PM on April 27, 2006


If you don't want to have to revist the db, then probably some degree of normalization is a good idea. Use common sense to determine how far you should go.

Myself, I find the concept of "band-aids" useful. A band-aid is some odd little tweak that compensates for things not being set up the right way. I would say that your internally-separated field qualifies as one. One of my rules of thumb is to ensure that I'm not relying on band-aids.
posted by treepour at 7:50 PM on April 27, 2006


[if I may add one more observation to this discussion...]

frufry: You actually have to be careful of over-normalizing here. If you split out the actual question choices into another table, and then just reference them by ID in the answers, you have to worry about what happens if the choice is changed.

So, a user might have answered "Red", but then you go back and change the question choice to "Bright Red." The user's answer is now automatically changed. That may or may not be what you're after.


Statistically, this is absolutely, positively the worst way to handle this situation. If you change the choices for a question then you need to retire the question entirely. Changing the options presented to later users statistically invalidates the earlier results. How can you draw a conclusion from the data if the answer set is different?

To code for this situation, I wouldn't even offer users the option of adding/deleting choices. There should be some interface for creating a new question from the old one.
posted by sbutler at 8:00 PM on April 27, 2006


I'm still kind of up in the air about this. Here's a code snippet that shows what I'm doing now to grab questions (I build an array of questions necessary for the form and then pass that throughout the registration process within a session object:

damn. I can't get pre to work.

$query="SELECT qd.*, eq.question_required, eq.question_position,eq.question_match FROM question_data qd, ebook_questions eq WHERE eq.ebook_id='".$ebook_id."' AND qd.id=eq.question_id";
$result= mysql_query($query) OR die(ErrorPage("QuestionSet","Database Error",__LINE__,mysql_error()));
if (!mysql_num_rows($result)>0) {
die(ErrorPage("QuestionSet class","No question data for that Ebook ID",__LINE__,"Ebook ID=".$ebook_id));
}
else {
$this->questions_array=Array();
$this->ebook_id=$ebook_id;
while($row=mysql_fetch_assoc($result)) {
if (!in_array($row["id"],$user_answered_questions)) {
$this->questions_array[$row["id"]]["question_id"]=$row["id"];
$this->questions_array[$row["id"]]["question_title"]=$row["question_title"];
$this->questions_array[$row["id"]]["question_type"]=$row["question_type"];
$this->questions_array[$row["id"]]["question_options"]=$row["question_options"];
$this->questions_array[$row["id"]]["question_family"]=$row["question_family"];
$this->questions_array[$row["id"]]["question_required"]=$row["question_required"];
$this->questions_array[$row["id"]]["question_match"]=$row["question_match"];
$this->questions_array[$row["id"]]["question_position"]=$row["question_position"];
}
//print_r($this->questions_array);
}
}
}


Right now, the "question_options" field is my internally separated field.

If I were to have another table, I'd probably just make another query, pull the question ID and all the options, and then pop them into the array with $this->questions_array[$row["id"]]["question_options"][]=$row["question_option"].

that would be a while loop for each question (there'd probably be from 3-5 per form).

Is this a dumb way to do this? Am I not getting how normalization would be handy because I'm approaching things totally wrong from the code side?
posted by fishfucker at 8:53 PM on April 27, 2006


I'd probably just make another query, pull the question ID and all the options, and then pop them into the array

You could do it all in one query (I think), but it would require some gymnastics and carrying some redundant data in each row. So two queries probably makes more sense. However, dumping the answers into your questions array seems like one of those things we do to make things a bit easier to parse in our heads. If you make the second query get all questions for the survey and outer join user responses, so you (basically) wind up with a hash table of questions and responses, you could just look up user answers in the loop you're outputting the questions in.

Does that make sense or did I miss the point entirely?

As someone with a tendency to do the exact same thing, don't do SELECT *, esp. if you're worried about performance: the database has to go out and do another query to get the names of the fields in the table. It's more to type, but faster and clearer about what you're doing.
posted by yerfatma at 5:52 AM on April 28, 2006


I did a survey builder app recently, and normalized the crap out of it. Trust me, it's worth it. You wind up with a bunch of tables:

- Question
- QuestionType
- QuestionOption (possible choices for multiple choice)
- Survey
- SurveyQuestion (allows you to re-use questions across different surveys)
- SurveyQuestionAnswer

To code for this situation, I wouldn't even offer users the option of adding/deleting choices. There should be some interface for creating a new question from the old one.


This is an excellent point. You want your business logic to prevent deleting/editing a question that's been attached to a survey (which is easy enough to check via SurveyQuestion).

As for ordering, just use a "Rank" column in SurveyQuestion and QuestionOption.
posted by mkultra at 7:09 AM on April 28, 2006


If you make the second query get all questions for the survey and outer join user responses, so you (basically) wind up with a hash table of questions and responses, you could just look up user answers in the loop you're outputting the questions in.

I think I'm doing it this way because otherwise I'd have a db call every time a select/radio question came up, rather than just one single db call that pulled ALL the questions from the database. I am however, sticking this bad boy into a session object -- still trying to figure out if that is evil or not. So far I've read a lot about ASP sessions being nasty, but i can't find as much information on PHP sessions, so I'm not sure whether it would be faster to rebuild this array every time or store it as a session variable.

mkultra:

Do you mind if I ask why you chose to use another table for the QuestionType? I've just got that sitting in my Question table -- and then I do a nasty switch in an object that produces the question which decides how to format it. I'm wondering if there's something I'm missing with my QuestionTypes.
posted by fishfucker at 9:53 AM on April 28, 2006


Depending on what you're doing, it can be as little as being able to assign a text label to each type, so you can easily manage "Add a new [QuestionType] Question -->" functionality.

I actually went a step further, creating an AnswerDisplayType table, which is what a Question actually links to, and which has a one-to-many relationship with QuestionType. Radio buttons and drop down menus are the same Type of question, just with different display modes. Ditto single-line text boxes and multi-line text boxes. You can also go nuts with different display rules for different media (HTML, WAP, etc.) without affecting the basic logic of how a questionnaire fits together.
posted by mkultra at 10:25 AM on April 28, 2006


oh wow. that's a great idea. I was wondering how to make the building of radio/selects a little more elegant.
posted by fishfucker at 10:28 AM on April 28, 2006


fishfucker-san, do not forget one of the key tenets of app design- keep your business logic separate from your display logic! :)
posted by mkultra at 10:59 AM on April 28, 2006


what exactly does that mean? I've seen it a lot in one of the MySQL books I'm reading (The Art Of SQL), but the author doesn't explain what defines each sphere (presumedly, the reader should already know that information).
posted by fishfucker at 1:28 PM on April 28, 2006


Wikipedia on 3-Tier Architecture

The relevant point here is that your application code should just know how to get a requested questionnaire form, with all the questions in the right order. Each piece of it just needs some kind of identifier. Your UI (page template) should contain the code needed to "write" each component based on that identifier.

Ideally, you can change each of the following with affecting the other two:

- The data in a specific questionnaire
- Rules governing how to put a questionnaire together (which questions, which order, etc.)
- How to draw each element

In PHP, this all gets very cloudy, as the application code (PHP) and display code (HTML) all sit on the same page.
posted by mkultra at 1:39 PM on April 28, 2006


ah. I guess I when I saw the phrase "business logic" I immediately thought of something that, well, would be decided by MBAs.

thanks for all your help mkultra.
posted by fishfucker at 1:48 PM on April 28, 2006


If you're on PHP5 (which I've never gotten to play with), there are a few MVC frameworks out there. Here are two approaches.
posted by yerfatma at 1:49 PM on April 28, 2006


« Older How to get NFL draft day text message updates.   |   How to identifying blank images? Newer »
This thread is closed to new comments.