normalizing spatial-temporal database?
June 2, 2011 10:23 AM   Subscribe

Relational database normalization of spatio-temporal data?


I am looking at relational database design, but can not find a certain type of what is likely a less common problem. Maybe I am just not using the right language in my searches.

Say you have many points in space that have several attributes that change as a function of time. For example monitoring stations that take a temperature, humidity, and pressure sample every 60 seconds. Rather than having a table for each of these three attributes (with rows for the points and columns for each time step), what is a more optimized (normalized) solution? Are there several ways to do this?

If anyone has any short answers or references to tutorials (examples in MS Access would be best) I would much appreciate it!
posted by figTree to Computers & Internet (12 answers total) 1 user marked this as a favorite
Access will be overwhelmed by a lot of data - it's good for prototyping, but if you want true shared-user access and scalability, look elsewhere (mysql is a free alternative).

table Station (key: stationID)
table Temperature (key: stationID, time)

Normalization is something of an art -- normalize too much and you may run into query run-time issues (and have to create de-normalized tables, which carry other challenges).
posted by catkins at 10:28 AM on June 2, 2011

You would have two tables: one with the spatial information about the points and a unique id for each and one with the sample values.

To use your example, table 1 would contain all the attributes about the monitoring station (location, unique id, some human readable id and any other information particular to the station itself). Each row in the table would be one station. In the second table, you would have a row for each sample that would contain the id of the station is was taken at, the time of the reading, the temperature, the humidity and the pressure. You would then join the two tables using the unique id.
posted by radioaction at 10:31 AM on June 2, 2011

If the measurements are all taken at the same time, then you'd have one table for measurements, one for stations, with the station id linking them. (id, station_id, time, temperature, pressure, humidity)

If the temperature, humidity and pressure are taken at separate times, you'll probably want to have a different table for each of them. (id, station_id, time, temperature) (id, station_id, time, pressure), etc..
posted by empath at 10:38 AM on June 2, 2011

Access....mostly sucks. I know that you may have budget/development restrictions, but I really.hate.Access.

You will run into problems with the amount of data that you will be collecting (measurements every 60 seconds every day is going to get WOAHBIG), unless you are going to dump that data on a set schedule. Access has limits on the amount of data that it can effectively manage; you'll hit the limit real quick at that speed.

Access isn't particularly stable, either, as an application. It's pretty easy for someone to accidentally delete/freeze the whole shebang.

You can't link data in Access to the web, either--there's a limit of how many users can access the database at one time.

As far as normalization and data structure, I like the idea of having a station_id table with station information and then a separate table with the time and measurements. This could be useful for pulling metrics/running queries later. Your queries would be super simple; just a join and some parameters as needed. You could also use different joins to pull out incorrect readings/equipment malfunctions, etc. (when data wasn't collected, etc)

I wouldn't keep the measurements each in a separate table; I think that's needlessly complicated, but I think it would work. I've never connected a monitoring station to a database, but I can imagine that setting it up to point to lots of different tables would be pretty scary? so keeping it as two tables--station_id and the readings table would be the most simple.

There's really no one way to design a database or normalize your data. One of the reasons we use normalization is that when you make changes to your data (an example for you would be a station name change/location), you don't want to update each and every record with the new information. Time and temperature readings aren't going to change anytime soon--your station information might.
posted by hotelechozulu at 11:51 AM on June 2, 2011

Your temperature, humidity and pressure measurements are really distinct and grouped. You might, at some point, want to query based on all the times when temperature was in a range, but that temperature measurement isn't shared between some other set of data. You wouldn't say "the 32°C measurement". Those measurements may end up having same values, but the only reason is that your measurement devices have limited resolution.

Your primary grouping is that a measurement of those values occurred at a specific place and time.

My guideline is a table for each thing that is actually physically the same thing across different relationships. If your samples are centrally synch'd, the sample time might be a reference to a table, but if the samples are derived from independent clocks at each monitoring station I'd store the sample time with the temperature, humidity and pressure.

The other thing is that in a real database (since everyone else is saying "Not Access!", I'll toss in a vote for PostgreSQL, with PostGIS if you've got geographic data) you can create views, so if you later decide you want to subdivide a table you don't have to change your interface.
posted by straw at 12:22 PM on June 2, 2011

I think you need three tables:
1) A station_id table like everybody has already mentioned, but I would add two more coliumns. The two columns would have an effective date and a finished date. This would allow you to sort on stations that are currently being used.
2) my second table would be a measurment_type with basically two columns. id and descriptons. It would have something like "1, Humidity". This would allow you to add new measurment types like (for example) wind speed.
3) The third table would be the actual measurments. With table like "station_id, measuremnt_type, timestamp, measurement"

This concept is called a paramaterized database and can be very flexible.
posted by Confess, Fletch at 1:23 PM on June 2, 2011 [2 favorites]

PostGIS adds geographic extensions to the (free) postgres database; if I had to track lots of data about points in space, and the spacial information was important (not just the data itself), I'd probably use a tool like that.
posted by jenkinsEar at 2:00 PM on June 2, 2011

I'd suggest looking into Oracle Spatial.
posted by Eltulipan at 2:37 PM on June 2, 2011

The most important design feature of databases containing historical (i.e. time-dependent) data is that in normal operation (as opposed to fixing data entry mistakes), updates to any table will involve appending new table rows rather than modifying existing data.

One of the reasons we use normalization is that when you make changes to your data (an example for you would be a station name change/location), you don't want to update each and every record with the new information. Time and temperature readings aren't going to change anytime soon--your station information might.

And when it does, you don't want to alter information held in respect of measurements made when the old station information was applicable.

What you'd actually want to do if you moved a station is add a new record to the Stations table, which would give you a new station ID; subsequent measurements from that station would go in the Measurements table (or tables) with the new ID. That way, the existing measurement data doesn't magically get its actual location pulled out from under it.

If there's some reason you need to capture the association between station-at-old-location and station-at-new-location (perhaps you need to distinguish the act of moving a station from the act of decommissioning an old one and installing a new one) you can add a previous-ID field to the Stations table that's only non-null for stations that have actually moved or had some other vital attribute changed (as opposed to having spelling errors fixed in their names or similar).
posted by flabdablet at 7:15 PM on June 2, 2011

By the way, the only thing wrong with Access as a front end is its idiotic Microsoft licensing terms. It has quite easy and capable form and report generation. Its report generator, in particular, is streets ahead of the one built into OpenOffice Base.

If you're already comfortable with Access and if it will already be installed everywhere you need to use your data, there is nothing at all wrong with using it to prototype your database; going live would then involve attaching your existing Access front end to any real database backend via ODBC. Using a non-Microsoft backend would require a bit of fiddling about with the SQL that Access generates (every SQL implementation has minor syntactic differences with every other) so you might want to start with the free Express edition of one of MS's own SQL Server backends.

But I absolutely concur with others who say that putting a self-contained Access database into production for anything more than single-user access to small volumes of data is the Wrong Thing.
posted by flabdablet at 7:27 PM on June 2, 2011

Having worked on various overly generic systems, I respectfully disagree with Confess, Fletch's suggestions. I think the point of using a good SQL server is that it goes beyond RDF style triplets in providing structure, and changes in that structure are what "ALTER TABLE ..." is for, although if you need a system that flexible it might be worth having separate tables for the three measurements.

To flabdablet's recommendations, absolutely: a moved measurement station is a new measurement station. If I wanted to do grouping of stations I'd add another table rather than having references within a table, and then add references to that new table from the original stations table, just to make the query clauses simpler.
posted by straw at 1:55 PM on June 3, 2011

That's a fair point. SQL is good at many things, but traversing NULL-terminated linked lists within a single table is not one of them.
posted by flabdablet at 1:53 AM on June 4, 2011

« Older Help me plot a novel I don't even have an idea for...   |   Which physical address to use for a email... Newer »
This thread is closed to new comments.