Can you help me design a simple database?
January 11, 2021 1:02 PM   Subscribe

I would like to convert the data store for a small personal project of mine from flat csv files to a relational database. I need help designing the DB for optimal performance.

For context, I'm a comfortable python and bash programmer who knows enough SQL to write simple queries but not how to optimize anything for performance.

My project is simple: I track activity at bikeshare stations, and store the number of trips at each station in hour 1-hour intervals. I have 2 tables:
Trips: time,station_id,n_trips (~5 million records and counting)
Stations: station_id, station_name, .... [more station details] (~200 records)

Currently when I want to do queries, I load everything into Python and use Pandas. This is user-friendly (for me) but there's lots of overhead in terms of loading everything into memory. I'd like to add a feature to my project that will make quick lookups important, which is why I'm considering an RDBS.

The kinds of queries I want to optimize for are
- Get records between date ranges
- Get the busiest station in a date range
- Group by day and sum the trips at each station
- Basically every query would involve a join on the "stations" table on station_id.

I've naively set up a test database (no indexes, no primary/foreign keys) to do some benchmarking.
SELECT * FROM trips

took ~2ms.
SELECT strftime('%Y-%m-%d', time) as day, max(n_trips),station_id 
FROM trips
WHERE day > '2020-11-01'
took ~9 s.

I wonder if I can get that second query substantially faster by setting keys or indexes, or other DB magic I'm not aware of. Maybe having a "date" field as will as a "datetime" field to remove the strftime call? If the likelihood of speeding up queries is not great, please let me know so I don't waste more time on this.

I'm running sqlite3 because of its easy integration with python, but I would use a different system if it was worthwhile.

Thanks!
posted by no regrets, coyote to Computers & Internet (10 answers total) 3 users marked this as a favorite
 
I am not very familiar with SQLite, I use it only as a local staging database for Django projects, but generally speaking indexes increase writing times and decrease reading times, and the trick is to find the proper balance between those two. Also, every index takes up memory and/or disk space. So there is always a tradeoff.

Also, generally speaking, functions like time conversion etc. take a lot of CPU resources, so I always try to push those operations to the database (CONVERT statements) or you may see if you can add another column for that data (but that means more disk space and slower write speeds).

It is pretty easy to create indexes (although depending on the available resources it can take a while to create them, especially on very large tables), and it is also easy to drop them again. My advice would be to create a couple of indexes, just to see what happens. You can use EXPLAIN operators to see whether the index is actually being used.
posted by IAr at 1:25 PM on January 11, 2021 [1 favorite]


This structure is totally reasonable; lightweight enough that sqlite should be ok. I don't know sqlite as well as mysql or postgres, but do you need "day" in the WHERE? Does "time > '2020-11-01'" not work?
posted by supercres at 1:27 PM on January 11, 2021


Yeah, the first thing you should do is change your query so that you're filtering based on the field value itself, not based on a computed expression. As a general rule, doing comparisons on the result of a function prevents a lot of optimizations, because the database has to execute the function on every row in order to figure out which ones need further processing.

SQLite doesn't have a dedicated "timestamp" data type; it supports storing times as either ISO 8601 strings or numeric Julian day numbers. Assuming you want to use ISO-8601, this is as simple as doing WHERE time >= '2020-11-02 00:00:00' instead of WHERE day > '2020-11-01'.

Once you have your query structured this way, it can take advantage of an index on the time column. The index is basically just a copy of the keys sorted by date, so it lets you more efficiently look at the rows whose dates fall into a particular range. Similarly, you can create an index on station_id to speed up queries that only need to look at a small subset of stations.

But remember, the only time indexes really speed things up is when your query (or one section of a more complex query) only needs to look at a small subset of a particular table, and that subset is determined by keys that fall into a particular range. If your goal is to compute an aggregate summary of the entire dataset every time you run the query, then indexes are unlikely to help much, and you don't really gain much by using an RDBMS as opposed to Pandas.
posted by teraflop at 1:29 PM on January 11, 2021 [3 favorites]


Ah no, my mistake. No decent datetime in sqlite. Seems worthwhile to spin up a localhost postgres or mongodb if you're going to be using this pretty frequently. Or just dump to HDF5 for disk queries. This is old but I still refer back to it occasionally.
posted by supercres at 1:32 PM on January 11, 2021


For your queries you absolutely want an index on `time` (for the date filter queries) and a foreign key on `trips.station_id` to a primary key on `stations.station_id` (for the join queries and for data integrity). Then I'm guessing it'll make sense to have a date field to avoid a function call (substr might be faster than strftime?) in your group-by query, but I'd get a baseline with just the first two indexes first.
posted by john hadron collider at 1:48 PM on January 11, 2021


And seconding setting up a local postgres database - how big is your database? 5 million records can take a hell of a long time to seek, and sqlite never really gains any caching benefits that dedicated databases do. (I mean yes, your OS ought to be caching those pages as they come off the disk, but...)

F'rex, the sqlite copy of one of my work databases with 22,249 rows is 750 megs and a simple select where date > [a few days ago] is a 20 second operation, and 99% of that is IO. The postgres version of that database completes the same query in 85ms. It's magic!

Sqlite can be handy as heck, but it can also be a convenient sql-shaped trap.
posted by Kyol at 2:08 PM on January 11, 2021 [2 favorites]


To be fair, that database is, like, anti-normalized, but even my home's sqlite3 temperature database eventually gets too big and causes my ESP32 clients to time out while reading from it because seeking through the records takes more than 3 seconds, and that table is fairly decently normalized.
posted by Kyol at 2:27 PM on January 11, 2021


If you actually want performance, nothing beats running an actual SQL server. MySQL is free, or you can go for MariaDB, which is the sequel from the same guy who wrote MySQL.
posted by kschang at 6:26 PM on January 11, 2021


Good advice above. The performance advantage of an index on trips.time is that (assuming a query that's relatively selective - e.g. a single day's data out of a year) your database can retrieve the result by range-scanning only through the relevant index entries, and then visit only the referenced rows in the trips table. Can save you an awful lot of physical I/O, which is nearly always the limiting factor in database performance (assuming your data model & database config is basically ok).

On the subject of which... I puzzled for a while about how your query could work without a GROUP BY clause - because your MAX looks like it wants to be an aggregate function that operates over the data for a given time period. But then I looked back to see that n_trip is already aggregated hourly. That's going to make things a lot harder / less flexible in SQL terms. If I was writing queries here, I'd be expecting to count the rows in the trips table for a given e.g. date range, and then GROUP BY station_id and date, if I wanted to find out e.g. which station is busiest on which days - for which I'd use the COUNT function in SQL, which works pretty much like the MAX function. But, that's assuming that every individual trip maps to a unique row in the trips table. If you're already aggregating before you insert data into trips, it's going to make some kinds of SQL a lot harder to write.

Basic database concepts: each entity has a table. Each attribute of each entity is column in the relevant table. Each instance of each entity is a row in the relevant table. Like all rules - you can break them when you understand when & why you should. But when you're starting out, you'll do better if you follow them.

So, if possible - I'd take the raw trip data into the trips table, and aggregate later in SQL. The reasoning is that "number of trips per hour" is not an attribute of your trip entity, so it has no reason to be represented as a column in the trip table. It's not an attribute of ANY of your entities - it's just a count of how many instances of the trip entity you have - i.e. the no. of rows in the trips table. We already have a SQL function for that, you don't need to handle it yourself.

I'll also second that you should absolutely define primary keys on each table, and the foreign key constraint on trips.station_id. It won't necessarily make it go faster, but it means the database knows how to enforce referential integrity for you - so you don't need to e.g. write code to check that you're not inserting duplicate values into columns that you need to be unique.

I guess you can also enhance your data model with some other obvious entities like bike and rider, once you're happy with how trips and stations are working. Then you can query e.g. who's your busiest user, or which bikes are used most heavily & therefore due for maintenance.
posted by rd45 at 1:46 AM on January 12, 2021 [1 favorite]


You can write times as YYYYMMDDHHMMSS (e.g., right now is 20210112221500 UTC) and just store them as character data, which allows lexicographic sorting to kick in and you can write date ranges as YYYYMMDDHHMMSS-style string queries. This is sort of a hack but I bet it would work fine for your purposes.
posted by axiom at 2:16 PM on January 12, 2021 [1 favorite]


« Older When might we know whether coronavirus vaccines...   |   Help! My daughter saw porn on my screen. Newer »
This thread is closed to new comments.