SQL Query Question
April 16, 2009 6:10 AM   RSS feed for this thread Subscribe

SQL query help. Given the (fictional but equivalent) table disk_status = computer_id:int, disk_id:int, datetime:datetime, temperature:float, rpm:int, I need to provide a table of disk status for a given computer_id, without knowing in advance how many disks there are. So if there were two disks for a particular computer, the result would have the header: datetime, disk_1_temperature, disk_1_rpm, disk_2_temperature, disk_2_rpm, disk_2_rpm

Long story short (and continuing with the computer/disk example) in the past we only ever needed to track one disk per computer, and so we did not use a disk_id at all. When faced with multiple disks, I considered adding more columns (disk_2_temp, etc) but there may be many, and that seemed like a hack. Instead I added the disk_id column, but now I need a way to get the data out in the right format.

Currently I do two queries: one to determine which disks are available, and then I left join the disk_status table to itself once for each disk:

SELECT disk_1.*, disk2.*
FROM disk_status AS disk_1
LEFT JOIN disk_status AS disk_2
ON disk_2.computer_id=x
AND disk_2.disk_id=z
AND disk_2.datetime = disk_1.datetime
LEFT JOIN disk_status AS disk_3
...
WHERE disk_1.computer_id = x
AND disk_1.disk_id = y

This seems inefficient. It also fails if disk_1 does not report but disk_2 does (disk_2 gets left out too). Is there a better way? Or is there a better way to structure my table?
posted by Nothing to computers & internet (25 comments total)
I'm confused why you can't just SELECT * FROM disk_status WHERE computer_id=? and let whatever program you're using to talk to the database loop over the results. In other words, "a way to get the data out in the right format" seems like the responsibility of the script querying the db, not the db itself, and you've given us no info about the environment. handling input and output
posted by and hosted from Uranus at 6:29 AM on April 16


You need another table, "disks" instead of using additional columns. Any time you're naming columns like "column1, column2, column3" you're misusing a database and need to create another table where those 1-3 items become rows in that table.
posted by odinsdream at 6:29 AM on April 16


Following on odinsdream, I assumed disk_id is the primary key for disk_status and computer_id is a foreign key to another table. The order of the fictional columns seems to contradict this assumption, however. If computer_id is currently the primary key for disk_status, then I agree the schema needs some general reworking.
posted by and hosted from Uranus at 6:35 AM on April 16


Something like:

Table "disk_status":Table "computer" Select ds.disk_number, ds.temp, ds.rpm, comp.computer_name From computer As comp Join disk_status as ds On comp.disk_id = ds.id
posted by odinsdream at 6:45 AM on April 16


By format, I mean the result columns I want from the database are a date, followed by all the readings from a given computer on that date. I am nearly certain it is more efficient to achieve this in the query than performing a join in code.

I may have been unclear. I am not naming columns like "column1", "column2". If I had done so, I would not have this problem (though I would have others). My issue is finding a more efficient way to build the desired output relation, or a better way to design the base tables.

You say I need another table disks. How would I model that, given that disk_id is only unique wuthin the scope of compouter_id? I see a few options, but all of them seem to leave me with the same problem.
posted by Nothing at 6:46 AM on April 16


Actually flip that around maybe, putting a computer_id into the disk_status table, rather than a disk_id in the computer table.
posted by odinsdream at 6:49 AM on April 16


Sorry, the primary key is [computer_id, disk_id, datetime]. computer_id is also a foreign key.
posted by Nothing at 6:49 AM on April 16


And in case it is not clear, this table contains a history of all readings.
posted by Nothing at 6:51 AM on April 16


By format, I mean the result columns I want from the database are a date, followed by all the readings from a given computer on that date.

This is your problem. The results from a database should never be "horizontal" if there is more than one result. They should always appear as separate rows if you've designed the tables appropriately. I would suggest your goal should be an output along these lines:

Computer, Disk, Temp, RPM
Comp1, 1, 20, 7400
Comp1, 2, 24, 5400
Comp1, 3, 22, 5400
Comp2, 1, 22, 7400

Rather than what it seems like you're asking for:
Computer, Disk1 Temp, Disk1 RPM, Disk2 Temp, Disk2 RPM, Disk3 Temp, Disk3 RPM
Comp1, 20, 7400, 24, 5400, NULL, NULL
Comp2, 22, 7400, NULL, NULL, NULL, NULL

You can see that the second type of output is very restrictive because you need additional columns to represent additional data points. You are limited to a number of disks between 1 and the number of columns you've created, whether they're really there or "virtually" in your query.

The first type of output is flexible from 1 disk to N disks, where N is a very large number, and you don't have to rewrite your queries or account for nulls when parsing your output.
posted by odinsdream at 6:55 AM on April 16


I think I have done a poor job explaining. Sorry for the multiple rounds here. There will never be a case where I am returning results from multiple computers. The computer_id will always be a where condition.

Your suggested output is what I get now if I simply do SELECT * FROM disk_status WHERE datetime. I am not seeking to change this structure unless absolutely necessary.

What I need to generate is, for a given computer, the datetime and each disk*parameter value taken at that time. The query I posted does that, but I have the feeling it is not optimal.

To put it another way, at a given time, a disk has properties x and y. At a given time, a computer with disks a and b has properties a.x, a.x, b.x, and b.y. This is the result I want.
posted by Nothing at 7:22 AM on April 16


I think at this point it would be beneficial for you to post the actual schema you're using. Try PasteBin and post the link here.
posted by odinsdream at 7:42 AM on April 16


That is the actual schema, only the names are changed. Only one table is involved in the query. That table has the structure:

CREATE TABLE `disk_status` (
`computer_id` int NOT NULL DEFAULT 0,
`disk_id` int NOT NULL DEFAULT 0,
`datetime` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`temperature` double NOT NULL DEFAULT '00:00:00',
`rpm` int NOT NULL DEFAULT '',
PRIMARY KEY (`computer_id`,`disk_id`,`datetime`)
)
posted by Nothing at 7:51 AM on April 16


Obviously those are the wrong defaults. Copy paste error.
posted by Nothing at 7:52 AM on April 16


Do I understand you right? You want some SQL that will return a variable number of columns depending on how many disks that computer has. So query on computer A returns disk_1_temp, disk_1_rpm, but on computer B returns disk_1_temp, disk_1_rpm, disk_2_temp, disk_2_rpm and computer C might return all the way up to disk_5_temp etc.?

I don't think you can do that in ordinary SQL. A simple SQL query can only return a fixed set of columns. You might have to generate the SQL dynamically. Alternatively concatenate the values into a single string using a function - this might be alright if you only need to view the results from a terminal session, but not if you intend to do anything more with it.
posted by Electric Dragon at 8:02 AM on April 16


SELECT disk_id, datetime, temperature, rpm FROM disk_status WHERE computer_id = 1 ORDER BY disk_id;

Does that not give you what you need? If not, why not?
posted by odinsdream at 8:03 AM on April 16


The sql would be (is) generated dynamically.

Okay, real world example query. I want to generate a graph showing the temperature over time of all the disks attached to the system. Currently I would do this:

SELECT d1.`datetime`, d1.`temperature`, d2.`temperature`, d3.`temperature`
FROM `disk_status` AS 'd1'
LEFT_JOIN `disk_status` AS 'd2'
   ON d2.`datetime` = d1.`datetime`
   AND d2.`computer_id`=1
   AND d2.`disk_id` = 2
LEFT_JOIN `disk_status` AS 'd3'
   ON d3.`datetime` = d1.datetime`
   AND d3.`computer_id`=1
   AND d3.`disk_id` = 3
WHERE d1.`computer_id` = 1
   d1.`disk_id` = 1

(or I would do three separate queries)

I thought this seemed sub-optimal. I am guessing now it actually might be the best solution given the data.
posted by Nothing at 8:25 AM on April 16


It isn't clear why the results from the simpler query I just gave wouldn't be usable in a graph. Excel can certainly make graphs of such results using the Disk column for individual series.
posted by odinsdream at 8:27 AM on April 16


It is sub-optimal. Compare query profiles of the SELECT with multiple JOINS against the single SELECT. It sounds like you want to use SQL results as a report, which isn't what they're designed for. SQL results are designed to be fed into a report system if you need it to "look" a certain way.
posted by odinsdream at 8:38 AM on April 16


I re-created your table structure (which looks fine to me) and tried to see if I could find a way to do this. I used a temporary table containing all unique computer IDs and date/times, then used subqueries to extract the data for each disk number into separate columns. I actually mis-read your question and produced something that gives a table for all computer IDs but it's easy enough to restrict this with a where clause in the final query or to redo it slightly so that only one computer's worth of data goes into the temp table at all.

Anyway, with this approach you will have to specify in advance the maximum number of disks any computer might have (I assumed 4 -- some cut-and-paste can extend this easily). For any disk that does not report or does not exist in that computer, you get a NULL.

In case you are wondering I used 'dt' instead of 'datetime' for one field name in the table... field names the same as data types always make me nervous. ;)

Here's what I came up with:



create table #temp_table
(
dt datetime,
computer_id int
)

insert into #temp_table
select distinct dt,computer_id from disk_status

select
tt.dt,
tt.computer_id,
(select d1.temperature from disk_status as d1 where d1.dt=tt.dt and d1.computer_id=tt.computer_id and d1.disk_id=1 ) as disk1_temp,
(select d2.temperature from disk_status as d2 where d2.dt=tt.dt and d2.computer_id=tt.computer_id and d2.disk_id=2 ) as disk2_temp,
(select d3.temperature from disk_status as d3 where d3.dt=tt.dt and d3.computer_id=tt.computer_id and d3.disk_id=3 ) as disk3_temp,
(select d4.temperature from disk_status as d4 where d4.dt=tt.dt and d4.computer_id=tt.computer_id and d4.disk_id=4 ) as disk4_temp
from #temp_table as tt
order by tt.computer_id,tt.dt

drop table #temp_table


posted by FishBike at 8:54 AM on April 16


Maybe I am thinking about this wrong, but if a computer has properties "disk 1 temperature" and "disk two temperature" at a particular time, the correct tuple is [computer, time, disk1temp, disk2temp]. By splitting that into two rows of [computer, disk, time, temp], I lose the mapping to the real world object, and some operations become more difficult (show me all disk temps where disk1 has not reported in 12 hours).

The difficulty is that my table does not model computers. It models disks. I did this to avoid having many columns of the form "disk_1, disk_2" etc. Perhaps this is the mistake.
posted by Nothing at 9:17 AM on April 16


Nothing:

You're misusing SQL. Use a client/front-end/reporting tool to massage the data to your specification. But the SQL resultset should be one row per entity, and the entity is disk.

No, you do not want multiple columns of xxx1 and xx2 and xxx3 in your database.

odinsdream has given you the right advice.
posted by orthogonality at 9:27 AM on April 16


Okay then. The entity is disk. That was, essentially my problem. Because there was only ever one disk per computer prior to this change, I worked with the assumption that the entity was computer.
posted by Nothing at 9:41 AM on April 16


I agree with everyone else. That said, I guess I'm willing to lay out some Bad advice as well.

OK, so first of all, you could do the Right thing, which is what everyone is telling you to do: output one tuple for each sensor reading or hard drive reading. For example, the former would be something like [datetime, computer_id, disk_id, sensor_reading_type, sensor_reading] and the latter would be [datetime, computer_id, disk_id, temperature, rpm]. You could then do a select on computer_id, order by datetime, and then manually group sets of consecutive datetime readings into a single row when you do post processing in whatever language (perl, python, C, whatever). We all think this is the right move because it makes it easier to do queries about sensor readings or disk statuses, but it does mean you have to postprocess in another language.

(Relatedly, it may be the case that you can do the equivalent of this without leaving your database, if your database can be extended with embedded code of some sort.)

Another option, if you have a real SQL99 implementation, which I'm guessing you don't (who does, really), is you can go Object-Relational or make a user defined type. Basically, rather than each attribute being a regular attribute, I think it can be things like a Vector or an Object. This would mean that you could have a relation which looked like [datetime, computer_id, Vector([(diskrpm, disktemp), (diskrpm, disktemp)...]). Personally, I get the impression that Object-Relational was motivated by a bunch of OO people who didn't understand the relational model in the first place, though interestingly, it might solve your current problem. (That said, I've never used it so I can't vouch for it.)

Another option is to do what you're proposing: a series of nasty self joins that are probably likely to be slow if you have a lot of readings (do make sure to index on datetime, at least). There are a couple of ways you could do this: (1) you could do a big join with one select per potential disk (2) you could do (1), but put the select in a VIEW, which would at least mean you didn't have to type it all the time or (3) you could do (2), but you could make it a MATERIALIZED VIEW, assuming your SQL implementation supports such things efficiently, in which case it might even not be really slow. All of these options have the advantage that they solve your actual problem (you want basically tabular data of the style that you could throw into gnuplot or R or whatever) and they don't make you have an ugly table anywhere (your sensor reading or disk event table can still be reasonable looking), but have the disadvantage that they may be inefficient to a varying degree. In general, if this is a query you're going to be doing a lot, and if there are going to be a lot of tuples, you don't want tons of self joins. (That said, if you do this option, do at least experiment with varying ways of joining---do a full self join, do a self join of multiple subselects that have a where clause restricting to the current computer_id, and so on.)

The final option, which everyone is offended by, is just to make a big table that looks like [datetime, computer_id, disk1rpm, disk1temp, disk2rpm, disk2temp, disk3rpm, disk3temp, ... disk100rpm, disk100temp]. I think everyone thinks that this design is likely to bite you later, but can't come up with any reason why it will. (We also don't know if this is a one-off hack or going to be rolled out to enterprises across the globe.) There's no particular reason why you can't just have NULLs for the disk readings that aren't there and use this messy table. It probably won't take up that much space to store the NULLs either, though it may look weird on whatever SQL client you have to have 200 attributes or whatever. However, if what you really want is an ugly table that you can use for plotting your particular current query, then it's hard to get easier than this.

(On a side note, there's an intermediate option between the final option and the view option, which is that you could try to generate both tables, which might lead to update or delete problems later, or you could periodically regenerate the last table from the "legitimate" table with a SELECT INTO type query. Or you could use a TRIGGER to try to make modifications to the real table modify your bogus table.)

Anyway, the basic problem which I think this illustrates is that vanilla SQL isn't particularly designed, or particularly good at manipulating attributes in a general fashion as opposed to tuples.
posted by pbh at 12:18 PM on April 16


Minor point in case you haven't checked and you don't have other reasons for reinventing the wheel. Isn't this general problem solved by tools for Intel's AMT and IPMI?
posted by pbh at 12:29 PM on April 16


Thanks pbh. The computer / hard drive example was fictional, I am actually dealing with loggers and sensors, and have about 100 data columns logically split into eight tables by type.
posted by Nothing at 1:42 PM on April 17


« Older Laptop filter: please help me ...   |   If I make changes using a comm... Newer »

You are not logged in, either login or create an account to post comments