How much database normalization is too much?
April 25, 2008 3:57 PM
How much SQL normalization is too much?
I've listened to the excellent advice in my last question and decided to start over with my web app. I went from three tables to 17, and now I find myself wondering: how much DB normalization is too much?
More specifically, if I have a status field which can contain very few entries (no more than four ever for the rest of this app's life), do I have to pull it out into a separate table?
For Example, is this:
Table: course
Fields
courseid (1)
coursename (History)
coursestatus (1)
Table: status
Fields
statusid (1)
status (In Production)
Better than this:
Table: course
Fields
courseid (1)
coursename (History)
coursestatus (In Production)
Thanks for your help!
I've listened to the excellent advice in my last question and decided to start over with my web app. I went from three tables to 17, and now I find myself wondering: how much DB normalization is too much?
More specifically, if I have a status field which can contain very few entries (no more than four ever for the rest of this app's life), do I have to pull it out into a separate table?
For Example, is this:
Table: course
Fields
courseid (1)
coursename (History)
coursestatus (1)
Table: status
Fields
statusid (1)
status (In Production)
Better than this:
Table: course
Fields
courseid (1)
coursename (History)
coursestatus (In Production)
Thanks for your help!
Thanks juv3nal.
So it's typical to separate out the data in a table like I did above? It seems so confusing to me - in one of my examples I have one table that has five fields that break out in two five separate tables, (summer/fall/winter/spring), (semester 1, semester 2), (Released, not released), etc. I Should just get over my perception of this being bad form :)
posted by EduTek at 4:14 PM on April 25, 2008
So it's typical to separate out the data in a table like I did above? It seems so confusing to me - in one of my examples I have one table that has five fields that break out in two five separate tables, (summer/fall/winter/spring), (semester 1, semester 2), (Released, not released), etc. I Should just get over my perception of this being bad form :)
posted by EduTek at 4:14 PM on April 25, 2008
BTW the above are kind of just theoretical (hence "zillions"). In practise, I'm guessing it's unlikely to make a noticeable impact either way, but I don't know how big your dataset is or how frequently you expect queries to be run against it.
posted by juv3nal at 4:14 PM on April 25, 2008
posted by juv3nal at 4:14 PM on April 25, 2008
or how frequently you expect queries to be run against it.
...and perhaps as importantly, the type of queries that would be most common. The general answer to "How much database normalization is too much?" is to just do it unless you suspect you're paying a performance hit for doing it.
posted by juv3nal at 4:17 PM on April 25, 2008
...and perhaps as importantly, the type of queries that would be most common. The general answer to "How much database normalization is too much?" is to just do it unless you suspect you're paying a performance hit for doing it.
posted by juv3nal at 4:17 PM on April 25, 2008
So if I did decide to put everything back into the original table, is it better to use enum('In Production','Ready','etc....'), or should I just stick with varchar?
posted by EduTek at 4:20 PM on April 25, 2008
posted by EduTek at 4:20 PM on April 25, 2008
Just found the answer. Enum is marginally faster in all cases. See here.
posted by EduTek at 4:27 PM on April 25, 2008
posted by EduTek at 4:27 PM on April 25, 2008
On preview, a lot of what I was going to say is irrelevant, but this is maybe worth considering:
It does sound like scenario 2 wouldn't apply to the fields you mention (season/semester), right? But who knows. Maybe down the road you'll want to store first/last day of each semester. In which case the splitting may make more sense. It's more than just what your application does now, it's partly about anticipating future needs.
posted by juv3nal at 4:35 PM on April 25, 2008
It does sound like scenario 2 wouldn't apply to the fields you mention (season/semester), right? But who knows. Maybe down the road you'll want to store first/last day of each semester. In which case the splitting may make more sense. It's more than just what your application does now, it's partly about anticipating future needs.
posted by juv3nal at 4:35 PM on April 25, 2008
In my opinion, Normalization is something that is waaay overdone in most web apps today that use it. It's benefits are numerous and varied, but in a lot of cases it's easy to take normalization too far and make your code much more difficult to write, read, and maintain.
This is the reason why major apps have DBAs instead of just letting the developers ham-fist their way through data modeling.
Understand that pulling two variables out of the same table is going to be more efficient than joining two tables to get the same results. So, in many low-use situations, combining tables with 1-to-1 relationships is an okay idea, provided there isn't some external reason to separate them.
posted by toomuchpete at 4:52 PM on April 25, 2008
This is the reason why major apps have DBAs instead of just letting the developers ham-fist their way through data modeling.
Understand that pulling two variables out of the same table is going to be more efficient than joining two tables to get the same results. So, in many low-use situations, combining tables with 1-to-1 relationships is an okay idea, provided there isn't some external reason to separate them.
posted by toomuchpete at 4:52 PM on April 25, 2008
The answer is: "it depends" - and frankly the only way to find out is to stuff some random data into your structure (build a tool/script to do so), test and then calculate your performance metrics.
Personally - I like very deep normalization. The main system I support is incredibly normalized, even including "metadata" tables for power-user expansion - of course in previous versions this complicated report building by power-users. What we have now is a tool/utility which generates a "de-normalized" schema allowing us to have a seperate "reporting" database - with linked views to the original database.
It's rather nifty.
posted by jkaczor at 4:55 PM on April 25, 2008
Personally - I like very deep normalization. The main system I support is incredibly normalized, even including "metadata" tables for power-user expansion - of course in previous versions this complicated report building by power-users. What we have now is a tool/utility which generates a "de-normalized" schema allowing us to have a seperate "reporting" database - with linked views to the original database.
It's rather nifty.
posted by jkaczor at 4:55 PM on April 25, 2008
I just had to generate a denormalized table from one of my databases to simplify reports, just like jkaczor said. On the other hand, it's nice to not have to track down various misspellings of what is supposed to be the same thing in a field. Enum works in your case, since you know that the set of possibilities will not change.
posted by ctmf at 5:14 PM on April 25, 2008
posted by ctmf at 5:14 PM on April 25, 2008
I am no expert, but I think--
1. Databases now are pretty forgiving for wasted-space varchars
2. THE FEWER TIMES A (COLUMNNAME) IS USED, THE EASIER MAINTENANCE BECOMES
3, An index will cure most common took-too-long problems, except when
4. A poorly-written query can bring any database to it KNEES!
I am also constitutionally against a table named SUMMER, etc. (unless it really has sums in it).
Just when you think there are only four seasons, you will be hit with six of them.
posted by hexatron at 5:14 PM on April 25, 2008
1. Databases now are pretty forgiving for wasted-space varchars
2. THE FEWER TIMES A (COLUMNNAME) IS USED, THE EASIER MAINTENANCE BECOMES
3, An index will cure most common took-too-long problems, except when
4. A poorly-written query can bring any database to it KNEES!
I am also constitutionally against a table named SUMMER, etc. (unless it really has sums in it).
Just when you think there are only four seasons, you will be hit with six of them.
posted by hexatron at 5:14 PM on April 25, 2008
So it's typical to separate out the data in a table like I did above? It seems so confusing to me - in one of my examples I have one table that has five fields that break out in two five separate tables, (summer/fall/winter/spring), (semester 1, semester 2), (Released, not released), etc. I Should just get over my perception of this being bad form :)
Wow, that is way wrong. Each of those seasons should be in one table, with a discriminator column. You would then have a 1 to many relationship between whatever had five fields and your new 'seasons' table.
Assuming it was a 'course' table that had the five fields, how would you query for, say, any course that had sessions in two or more seasons? With five fields, you would need to enumerate all 20 possible combinations
posted by delmoi at 5:29 PM on April 25, 2008
Wow, that is way wrong. Each of those seasons should be in one table, with a discriminator column. You would then have a 1 to many relationship between whatever had five fields and your new 'seasons' table.
Assuming it was a 'course' table that had the five fields, how would you query for, say, any course that had sessions in two or more seasons? With five fields, you would need to enumerate all 20 possible combinations
posted by delmoi at 5:29 PM on April 25, 2008
First, jkaczor is right: it depends (aand delmoi is right about cardinality.)
Data modeling is all about determining how closely you need to approximate a Platonic truth, and then about how to do it efficiently.
Truth first, efficiency second.
That said, some rules of thumb (it's late, it's been a long week, and I've had a couple of beers, so bear with me if I ramble, please.)
1. An attribute that can have only two values (plus possibly null) can and probably should be modeled as a boolean (bit(1) datatype).
There are many wonderful things that you can do with a value that can only be zero or one: in addition to truth testing (assuming your databse supports boolean operators: "create view lolcat as select a.has_caption && a.contains_cat as is_lolcat, a.* as is_lolcat from pictures a"), there's cool stuff with aggregation ("select sum(is_lolcat), submitter_id from pictures group by submitter_id").
2. Avoid enum. MySQL's implementation contains some nasty gotchas. Plus it's just going to make a lookup table that's hidden from you.
3. If an attribute can have three or more values, use a look-up table.
Why? All too often, "only three values" becomes "more than three" in practice. And values that are more than boolean, that is, more than just "true" or "false", tend to have or accumulate additional attributes of their own. Especially if you're just starting your modeling. Having a table allows you to add those attributes. Having only a column does not.
4. And if it turns out things are not so complicated, it's much easier to go from an associated table to an attribute, than from an attribute to a table.
Example: your 'coursestatus" is begging to grow beyond two values: "in production", "produced", "obsolete" (you want to keep historic data, indicating this course was offered in 2000, but is no longer offered, right?), etc.
Now imagine you do add "obsolete" as a possible status. Now you're going to want to indicate which course statuses cannot be used for newly created courses. Clearly, the drop-down for a newly created course should not include "obsolete" (but it might include "produced" if we're copying the syllabus and course notes and books form the same course last year, no?). So now we want to add this attribute data to the course status. If we have a table of course statuses, no problem. We just add the column: alter table course_status add column "is_closed".
But if we just have a table of courses, we have to;
1. change a boolean to something else, or change the allowed values in an enum, and
2. add a nasty transitive dependency ("case course.status = "obsolete" then true else false end as is_closed").
And with a lookup table, our drop-down selects only statuses where is_closed = false. Note also that our drop-down can display a human-readable course_status.description, something we can't do with an enum, which brings us to rule of thumb
5. A look up table should probably contain at least:
an autogenerated synthetic key ("id"),
a unique name (in manifest constant style, so you can make more readable queries, "select * from course a join course_status b on (a.course_status_id = b.id) where b.name = 'INPROD'") and not depend on magic id numbers,
and a human-readable mixed case description ("In Production") (it's easy to make mixed-case upper or lower, impossible to make all upper or lower into mixed).
6. This is likely to be faster than including a string attribute, and more useful than an enum (you can select and display all possible values).
Now you may come back and insist that I'm creating castles in the air or conjuring bogeymen, and course_status can only mean, 'in production" or "not in production". If so, great, it's a boolean attribute. If it can ever grow beyond two values, or have attributes of its own, it belongs in its own table.
Counter-argument one: but all those tables are confusing! Refutation: "then write a view".
Counter argument two: but it'll be too slow! Partial refutation: prove it, and recall Knuth; "premature optimization is the root of all evil"
posted by orthogonality at 10:49 PM on April 25, 2008
Data modeling is all about determining how closely you need to approximate a Platonic truth, and then about how to do it efficiently.
Truth first, efficiency second.
That said, some rules of thumb (it's late, it's been a long week, and I've had a couple of beers, so bear with me if I ramble, please.)
1. An attribute that can have only two values (plus possibly null) can and probably should be modeled as a boolean (bit(1) datatype).
There are many wonderful things that you can do with a value that can only be zero or one: in addition to truth testing (assuming your databse supports boolean operators: "create view lolcat as select a.has_caption && a.contains_cat as is_lolcat, a.* as is_lolcat from pictures a"), there's cool stuff with aggregation ("select sum(is_lolcat), submitter_id from pictures group by submitter_id").
2. Avoid enum. MySQL's implementation contains some nasty gotchas. Plus it's just going to make a lookup table that's hidden from you.
3. If an attribute can have three or more values, use a look-up table.
Why? All too often, "only three values" becomes "more than three" in practice. And values that are more than boolean, that is, more than just "true" or "false", tend to have or accumulate additional attributes of their own. Especially if you're just starting your modeling. Having a table allows you to add those attributes. Having only a column does not.
4. And if it turns out things are not so complicated, it's much easier to go from an associated table to an attribute, than from an attribute to a table.
Example: your 'coursestatus" is begging to grow beyond two values: "in production", "produced", "obsolete" (you want to keep historic data, indicating this course was offered in 2000, but is no longer offered, right?), etc.
Now imagine you do add "obsolete" as a possible status. Now you're going to want to indicate which course statuses cannot be used for newly created courses. Clearly, the drop-down for a newly created course should not include "obsolete" (but it might include "produced" if we're copying the syllabus and course notes and books form the same course last year, no?). So now we want to add this attribute data to the course status. If we have a table of course statuses, no problem. We just add the column: alter table course_status add column "is_closed".
But if we just have a table of courses, we have to;
1. change a boolean to something else, or change the allowed values in an enum, and
2. add a nasty transitive dependency ("case course.status = "obsolete" then true else false end as is_closed").
And with a lookup table, our drop-down selects only statuses where is_closed = false. Note also that our drop-down can display a human-readable course_status.description, something we can't do with an enum, which brings us to rule of thumb
5. A look up table should probably contain at least:
an autogenerated synthetic key ("id"),
a unique name (in manifest constant style, so you can make more readable queries, "select * from course a join course_status b on (a.course_status_id = b.id) where b.name = 'INPROD'") and not depend on magic id numbers,
and a human-readable mixed case description ("In Production") (it's easy to make mixed-case upper or lower, impossible to make all upper or lower into mixed).
6. This is likely to be faster than including a string attribute, and more useful than an enum (you can select and display all possible values).
Now you may come back and insist that I'm creating castles in the air or conjuring bogeymen, and course_status can only mean, 'in production" or "not in production". If so, great, it's a boolean attribute. If it can ever grow beyond two values, or have attributes of its own, it belongs in its own table.
Counter-argument one: but all those tables are confusing! Refutation: "then write a view".
Counter argument two: but it'll be too slow! Partial refutation: prove it, and recall Knuth; "premature optimization is the root of all evil"
posted by orthogonality at 10:49 PM on April 25, 2008
Anticipated follow-up (from somebody, not the OP): "Can I substitute a code char(1) for the id column in my lookup table?"
Answer: "Yes, if you're supporting a legacy code. If not, avoid this. Get the benefit you're looking for (a smaller key size, which means a smaller index, which means more index entries in memory) by using a smaller integral type as the key. Remember than whatever you do, a key and its foreign key referent need to be of the same type, or it'll cost you. When in doubt, to paraphrase Stroustrup, prefer your database's default autogenned key type."
posted by orthogonality at 11:01 PM on April 25, 2008
Answer: "Yes, if you're supporting a legacy code. If not, avoid this. Get the benefit you're looking for (a smaller key size, which means a smaller index, which means more index entries in memory) by using a smaller integral type as the key. Remember than whatever you do, a key and its foreign key referent need to be of the same type, or it'll cost you. When in doubt, to paraphrase Stroustrup, prefer your database's default autogenned key type."
posted by orthogonality at 11:01 PM on April 25, 2008
Take the below with a grain of salt. I am not highly technical. I do analysis, management, and business development, but I know a lot about table design and software implementation. My perspective is not that of a computer scientist or programmer analyst but that of someone responsible for managing and overseeing implementation projects.
My answer to your question is:
Too much normalization is when the drawbacks start to outweigh the benefits of normalization - for example, when users start complaining that they have to look in 6 places to see what was formally in 1 place, or when queries that were very fast start to take minutes or hours or days.
I work for a company that makes software for large energy companies. When I first started configuring our software, 8 years ago, I learned all about how to create a normalized db and the normal forms. This made my dbs much cleaner and made keeping track of data and data loading much simpler. It did also make data loading from old legacy hierarchical databases difficult in some cases where the data were not normalized. The performance hit was, in most cases, not such a big deal, except for a case in Italy where the client kept track of 7 million meters. We were creating systems for billing thousands or tens of thousands of accounts, getting 1 meter read per month.
Fast forward to today. My clients are larger, with much more complex requirements and, often, much more data. Clients often have hundreds of thousands or millions of meters, sometimes getting as much as 1 read a day, each read including hundreds of values for each 15 minute period during a day. Instead of just billing, they want to analyze the data, figure out if someone is tampering with their meter, apply real-time pricing, do a regression analysis, etc. Much more complicated than 8 years ago. On several recent projects, after designing a normalized db and configuring logic to work with that data structure, we have run performance tests and realized that our processes take far too long to run because every query is hitting 10 tables instead of 1. Our queries and logic are more complex and the data tens or hundreds of times what it was before. I had to rethink my ideas about the usefulness of normalization. I am now not so enthusiastic about a totally normalized db.
The lesson learned here is "consider the complexity of the configuration that will be accessing the database and possible impacts of normalization." When designing, start with a 3NF db and then work backwards as necessary to address usability or performance concerns.
posted by charlesv at 10:06 AM on April 26, 2008
My answer to your question is:
Too much normalization is when the drawbacks start to outweigh the benefits of normalization - for example, when users start complaining that they have to look in 6 places to see what was formally in 1 place, or when queries that were very fast start to take minutes or hours or days.
I work for a company that makes software for large energy companies. When I first started configuring our software, 8 years ago, I learned all about how to create a normalized db and the normal forms. This made my dbs much cleaner and made keeping track of data and data loading much simpler. It did also make data loading from old legacy hierarchical databases difficult in some cases where the data were not normalized. The performance hit was, in most cases, not such a big deal, except for a case in Italy where the client kept track of 7 million meters. We were creating systems for billing thousands or tens of thousands of accounts, getting 1 meter read per month.
Fast forward to today. My clients are larger, with much more complex requirements and, often, much more data. Clients often have hundreds of thousands or millions of meters, sometimes getting as much as 1 read a day, each read including hundreds of values for each 15 minute period during a day. Instead of just billing, they want to analyze the data, figure out if someone is tampering with their meter, apply real-time pricing, do a regression analysis, etc. Much more complicated than 8 years ago. On several recent projects, after designing a normalized db and configuring logic to work with that data structure, we have run performance tests and realized that our processes take far too long to run because every query is hitting 10 tables instead of 1. Our queries and logic are more complex and the data tens or hundreds of times what it was before. I had to rethink my ideas about the usefulness of normalization. I am now not so enthusiastic about a totally normalized db.
The lesson learned here is "consider the complexity of the configuration that will be accessing the database and possible impacts of normalization." When designing, start with a 3NF db and then work backwards as necessary to address usability or performance concerns.
posted by charlesv at 10:06 AM on April 26, 2008
normalise until it hurts, denormalise until it works
posted by The Carolingian at 11:02 AM on April 26, 2008
posted by The Carolingian at 11:02 AM on April 26, 2008
"How much should I normalize" is really "How much does keeping my schema UML-diagram-perfect cause performance bottlenecks." nthing the others who have said, "It depends." orthogonality's advice is worth its weight in gold.
posted by verb at 11:05 AM on April 26, 2008
posted by verb at 11:05 AM on April 26, 2008
charlesv writes "On several recent projects, after designing a normalized db and configuring logic to work with that data structure, we have run performance tests and realized that our processes take far too long to run because every query is hitting 10 tables instead of 1."
Thanks, it's good to see this real-world example. To understand it better, what RDBMS are you using, and how's your index coverage? Are you using a fourceplan clause (or similar for non T-SQL) or views to do grouping first, and ref-table joins later? (I'm assuming a lot of your analysis involves aggregate queries.) Any cursors in the queries?
posted by orthogonality at 5:41 PM on April 26, 2008
Thanks, it's good to see this real-world example. To understand it better, what RDBMS are you using, and how's your index coverage? Are you using a fourceplan clause (or similar for non T-SQL) or views to do grouping first, and ref-table joins later? (I'm assuming a lot of your analysis involves aggregate queries.) Any cursors in the queries?
posted by orthogonality at 5:41 PM on April 26, 2008
Orthogonality:
I am not a dba and won't be able to give very satisfactory answers to your questions. I do some configuration but these days mainly manage dbas and other technical types, on whom I depend to know stuff like what you are asking.
We are using Oracle 10g. I couldn't tell you, except at a high level, the details of our indexes except that they are foreign keys using number (10) datatype. We are not using any joins. The queries were generated automatically by my firm's proprietary scripting language, which would include cursors. Once the tables were denormalized the scripts were replaced by stored procedures and we saw big performance improvements.
Yes, we are doing a lot of aggregate queries, like (way oversimplifying here) "look at 1.5 million meters and, aggregating the daily reads, compare each meter's average daily usage for last month to their usage same month last year, then compare their usage two months ago to their usage for that same month last year, but exclude all meters that have had a disconnect, move out or meter swap between the comparison months."
posted by charlesv at 12:34 PM on April 27, 2008
I am not a dba and won't be able to give very satisfactory answers to your questions. I do some configuration but these days mainly manage dbas and other technical types, on whom I depend to know stuff like what you are asking.
We are using Oracle 10g. I couldn't tell you, except at a high level, the details of our indexes except that they are foreign keys using number (10) datatype. We are not using any joins. The queries were generated automatically by my firm's proprietary scripting language, which would include cursors. Once the tables were denormalized the scripts were replaced by stored procedures and we saw big performance improvements.
Yes, we are doing a lot of aggregate queries, like (way oversimplifying here) "look at 1.5 million meters and, aggregating the daily reads, compare each meter's average daily usage for last month to their usage same month last year, then compare their usage two months ago to their usage for that same month last year, but exclude all meters that have had a disconnect, move out or meter swap between the comparison months."
posted by charlesv at 12:34 PM on April 27, 2008
1) Listen to orthogonality.
2) You (the general you) are terrible at determining what will be a performance hit or not. If you don't measure it it doesn't exist and don't you forget it. Databases are terrifically complicated beasts with optimizations all over in non-obvious places. And don't forget that they're basically designed to do joins. I've personally gotten an order of magnitude speedup by letting the database be a database (JOIN) instead of using it as perl (analyzing a substring of a varchar field mid query).
3) Database layout should mirror the nature of the data, even if that's inconvenient. Convenience can be added later with abstraction, either as a view or in the application layer. There are reasons to deviate from this rule but they're pretty rare and are usually good problems to have (too much traffic).
posted by Skorgu at 7:51 PM on April 27, 2008
2) You (the general you) are terrible at determining what will be a performance hit or not. If you don't measure it it doesn't exist and don't you forget it. Databases are terrifically complicated beasts with optimizations all over in non-obvious places. And don't forget that they're basically designed to do joins. I've personally gotten an order of magnitude speedup by letting the database be a database (JOIN) instead of using it as perl (analyzing a substring of a varchar field mid query).
3) Database layout should mirror the nature of the data, even if that's inconvenient. Convenience can be added later with abstraction, either as a view or in the application layer. There are reasons to deviate from this rule but they're pretty rare and are usually good problems to have (too much traffic).
posted by Skorgu at 7:51 PM on April 27, 2008
* sorry, that should read "We are not using any views," not "We are not using any joins."
posted by charlesv at 8:10 PM on April 27, 2008
posted by charlesv at 8:10 PM on April 27, 2008
This thread is closed to new comments.
More specifically, if I have a status field which can contain very few entries (no more than four ever for the rest of this app's life), do I have to pull it out into a separate table?
You don't have to of course, but the reason why you would do it doesn't just depend on how many entries there are:
1) supposing a case where you make a zillion queries on course that do not require knowing the status, then it might make some sense to split it off (you're saving yourself from looking up/sending that data when it's not needed).
2) supposing the few entries you have under status ever change then normalized is easier to update if you have a zillion courses (yeah, it may just be a simple update, but depending on how many courses you have it could take significantly longer to update each record as opposed to just the single record in status)
posted by juv3nal at 4:07 PM on April 25, 2008