No, no—*million* with an M.
July 19, 2006 5:27 AM   Subscribe

MySQLFilter: Help me with my 56 million record table! Lots of cumbersome background inside...

I currently have a single transactions table which is just a data dump of every single transaction (for a major retailer) for the last several years. It includes their internal customer_id, the transaction date and a few other choice bits of information.

Right now, I am attempting to generate a week-by-week breakdown of "total unique customers," "existing" customers and "new" customers. (They're "existing" if we can find their customer_id within the same table, somewhere in the 3.5 years preceding the week in question.

I have this built into one carefully haphazard query:

SELECT SQL_CALC_FOUND_ROWS t1.customer_number cn, t1.transaction_date, t2.transaction_date
from data_import t1
INNER JOIN data_import t2
USING ( customer_number )
BETWEEN '2006-01-01'
AND '2006-01-07'
AND t2.transaction_date
BETWEEN '2002-07-03'
AND '2005-12-31'
GROUP BY t1.customer_number

And so on, with the dates incremented for each week in 2006.

Surprisingly enough, this works. It just takes absolutely forever. (6 solid days of query munching for Jan 1-Jun-10.)

In a given week, there will be approximately 240,000 unique customers. The 3.5 year window before it consists of approximately 47 million records. (The database ranks in at 56,233,924 total records.)

Now, I have each field indexed, naturally. I've run an EXPLAIN, and MySQL is using the index. (Explain output visible here.)

Most of the time for a given week is being dedicated to Copying Tmp Table. The IO usage on this drive is skyrocketing, and it's relatively clear to me that it's writing this as a DISK table. (My tmp table max size is 32 MB.)

My questions:

1) Will increasing my temp table size help? Running this in memory would be ideal, but the machine only has 2 GB, a P4 2.8 GHz and an IDE drive, which is absolutely killing me.)

2) Where the hell is it writing the temp table to? MySQL System Variables shows blank for the tmp_dir option. I can't find it in /var/lib/mysql/mysql, but I don't even know what I'm looking for. I'd like to see how large the temp tables are becoming so that I can see if using them in memory is feasible.

3) Is there any way to restructure this query such that it won't tear my disk up? It's thrashing like crazy because it's having to copy entire possible record sets to this damn disk-based temp table. It's an IDE, which is a major no-no for non-sequential massive RW like this, but it's what I'm stuck with for now, which leads me into...

4) This database consists of 1.8 GB data, 3.3 GB index. It's a MyISAM table running on MySQL 4.1.15. Is it time to consider a new database platform? Will that see me any performance gains when dealing with this much raw data?

5) We're looking to move this to a new machine. Will loading up with 8 GB of RAM allow us to run the entire thing in memory efficiently? Will smart indexing and query writing allow me to run it in memory with 2 GB? Is there a faster way to do this, or is the harsh reality of 56 million records simply that things will take much longer?

I know there's a lot to mull over. Your help and advice as crazy-talented DBA-folk is much, much appreciated by me. Thanks!
posted by disillusioned to Computers & Internet (12 answers total) 6 users marked this as a favorite
Now, I have each field indexed, naturally

But that's useless. MySQL can only use one index at a time, and you're searching on two columns. Try adding multicolumn indexes like this INDEX(date, customer) and INDEX(customer,date). I'm not sure which will help most.

I'd also suggest creating a temporary table containing the customer ids (ie INSERT INTO temp SELECT customer ... WHERE week=blah GROUP BY customer) for the week you're searching and join that against the main table.
posted by cillit bang at 6:28 AM on July 19, 2006

How many unique customers in a given week, btw?
posted by cillit bang at 6:34 AM on July 19, 2006

Just a thought, but date functions take a lot longer than integer ones.

Are there autoincremented indices on t1 and t2 that are ordered by time? If there was, you could figure out what the first and last record for the week was first, then run the query for record between those values. Could be a lot faster.
posted by justkevin at 6:36 AM on July 19, 2006

Best answer: It's taking forever because you're first generating a cartesian product of 56 Million * 56 Million = 3156 trillion potential records. You then pare this down with the between dates bit, and then you do the group by. The group by internally requires a sort, so whatever remains of your 3136 records then has to be sorted. So six days? Sounds about right.

If you want to make this faster, you have to cut down on the initial multiplier. Here's off the top of my head how'd I go about it (since I'm not actually employed to do this, and I don't want to created 56M dummy records, I'm not going to test this):

Use a correlated subquery rather than a join. While in theory this means 56M rows each searching through 56M rows (again 56M2) in practice and with an index, this will use a lot less memory and so will be faster. Do make sure you've not only indexed on each field (which may be overkill that actually hurts your speed) but that you've indexed on the tuple customer_number, transaction_date. (cillit bang also mentions this.) The subquery will be using that tuple, looking for min( b.transaction_date) where b.customer_number = a.customer_number and datediff( a.transaction_date, b.transaction_date ) < 365 * 3.5

Make a view of the above (MYSQL finally has views) and then make another view on top of that, adding calculated fields to express the logic for new or existing customers, which is just that the correlated subquery is null (for new) or not null (for existing). Rather than dealing with the nulls, I'd make two fields, using a subquery expression or a case expression (six of one, half dozen of the other):
select [all the other fields in the first view],
coalesce( select 1 where initial_transaction date is null), 0 ) as is_new,
case when initial_transaction_date is not null then 1 else 0 end as is_existing

Then select against the second view for each week of interest: where transaction_date between whatever. If you've made two fields, one for is_new, one for is_existing, you can do all you work with a count(*) and two sums, the sum of is_new and the sum of is_existing. (If you only did one calculated field, you just subtract from count: count - existing = new or count - new - existing. Whatever you like best.)

If MYSQL does views sensibly (like pg and Sybase and sometimes MS-SQL server), limiting your select to transaction_dates in a particular week should happen first. (That is, the where clause in the select should be applied before the subquery and calculations in the views; check the showplan to verify this. If MYSQL doesn't do this, just import the data into pg.). Thus each weekly query should start off with far fewer than 56M rows.

(Finally, talk to your client, and ask if the 3.5 years for existing relationship is really a metric they want to use, or a misguided attempt to make your queries "easier". If the later, it of course actually easier to call a customer existing if he ever made a prior purchase, not limiting it to less than 3.5 years ago.)
posted by orthogonality at 6:46 AM on July 19, 2006 [1 favorite]

Oh, and note: no need for the group by at all, so no need to sort N million records, if what you're looking for is just teh aggregate weekly numbers.
posted by orthogonality at 6:47 AM on July 19, 2006

Best answer: 56 million records isn't so much, depending on the amount and types of data in the fields of those records.

But, quickly, and off the top of my head, there's a lot that you are doing that doesn't look like "best practice".

First, indexing every field isn't likely to be an optimal strategy. It costs something (computationally and space-wise) to maintain the indexes, and indexes that aren't valuable for improving query performance are wasted.

Second, it looks like your query joins a couple of tables, or maybe joins the one table you talk about back on itself. MyISAM tables don't support foreign keys, so if you convert your database to InnoDB tables, and normalize data with foreign key relations, you might get substantially better performance. OTOH, adding InnoDB support will create transaction log files for the tables, which will add disk requirements, and maybe processing overhead, that may negate any join performance benefit, so this is something of design analysis issue.

It's obvious that you're badly I/O bound. Your query is effectively running at the speed of your disk, so doing what you can to break up the processing in ways that will reduce your working set as soon as possible will greatly improve your situation. One way of doing this would be to revisit the indexes you've built to see if they are actually optimal for extracting the information you want to see. For example, your current query looks through your table, selecting your t1 criteria records, then sorts them, and then selects against t2. The sort is computationally expensive, and probably increases your temp table size quite a bit. Could you do your t1,t2 selections, and then sort the combined result? If so, this could be many, many times faster.

In answer to your questions:

1) The best thing to do when you are having performance problems in MySQL is to turn on some monitors, and maybe use some management tools to collect some information. If your temp table is the problem, enlarging it may help, but if it isn't, it won't.

2) The tmp directory can be specified in your MySQL startup file. If you're really banging on the tmp file, and can't stop that by rewriting your indexes or query, you could at least put the tmp file directory on another physical disk, on another controller channel, and significantly improve the disk thrashing that you are getting. With IDE disks, that's cheap and quick, even if you do nothing else.

3) Yes, see suggestions above.

4) MySQL 4.1 is capable of handling tables of the size you are using, and greater, but IIRC, there's a "limit" of 4 GB on MyISAM tables. I would think you could clearly optimize your tables, indexes, and queries to significantly reduce your processing time, and some other database engines will have a more sophisticated query optimizer which will try to re-arrange poor SQL and table designs, or create on-the-fly indexes for repetitive queries, more than MySQL does. DB2 and Oracle are examples of such products, with advanced optimizers, but your problem probably doesn't warrant changing platforms.

5) If you're running MySQL on any 32 bit Windows platform, Windows will only let you use 2 GB of memory for an application (maybe 3GB, if you use the startup switch and the application knows what to do with the larger address space). 64 bit Windows servers don't do a lot better. 64 bit Unix servers, maybe. But before I went there, I'd really look to see if your table was squished as much as possible. 1.8 GB for only 56 million records seems pretty bloated. Do your data types all match the real field contents, or were they padded or wrongly selected in the beginning? And again, I'd really look at all those indexes...
posted by paulsc at 7:20 AM on July 19, 2006

Best answer: -- Unique customers

SELECT Count(*) AS Cnt, customer_number
FROM Data_Import
WHERE transaction_date BETWEEN '2006-06-01' AND '2006-06-30'
GROUP by customer_number

-- For "new" customers
SELECT customer_number
FROM Data_Import DI
WHERE transaction_date BETWEEN '2006-06-01' AND '2006-06-30'
AND customer_number NOT in
( SELECT HIST.customer_number
FROM Data_Import HIST
WHERE Hist.customer_number = DI.customer_number
AND transaction_date < '2006-06-01'br> )

For existing customers, simply remove the "NOT" from the above expression.

You should have an index on (customer_number, transaction_date) (in that order).
posted by curtm at 7:23 AM on July 19, 2006

Also, your query in your original post isn't strictly a legal query under ANSI 99, because you have columns in the select list that are not in the group by list and are not aggregate expressions. Sybase, I know, at one point let you get away with this, at least prior to 12.x But Oracle won't, and MS-SQL, which is based on Sybase, also won't, and pg won't.

I don't doubt that MYSQL allows it, but I'm not sure what results you should expect. The problem would come up if a customer visits a store more than once in the weeks you're looking at. With two different transaction dates, the results can't be grouped by customer_id -- you need as many rows as unique there are (customer_id, t1.transaction_date, t2.transaction_date) tuples. So if you're counting rows as customers, each row is really a customer visit (multipled by the customer's previous visits) .

But I don't how you're using your query.
posted by orthogonality at 7:27 AM on July 19, 2006

curtm writes "( SELECT HIST.customer_number
"FROM Data_Import HIST
"WHERE Hist.customer_number = DI.customer_number
"AND transaction_date < '2006-06-01'br> )"

curtm is also suggesting a correlated subquery (and unlike me, he was nice enough to give you the SQL). But curtm, you're not taking into account the "a previous transaction must have occurred in the last 3.5 years to qualify the customer as existing" requirement.
posted by orthogonality at 7:39 AM on July 19, 2006

Response by poster: Ortho, any chance I could get you to email me (in profile) directly?

I thought MySQL 4.1 was finicky with sub-selects and the like...

Also, I don't know if the assertion that it's 56M * 56M is correct...

The first thing it does is grab a week's worth of records, using the transaction_date and its index. That's about 240,000 records. It *then* looks for each of those customer_ids to see any matches in the 3.5 year period...

Thanks for all your help, everyone. And I'm aware of MyISAM vs. InnoDB and MySQL only using one index, etc.
posted by disillusioned at 3:39 PM on July 19, 2006

Yeah, I'll email you; I actually stopped by to ask you to post your before and after metrics here. MYSQL 4.1 probably is and I'm guessing paulsc knows more about that than I do, but I'll email and see what I can do.
posted by orthogonality at 8:09 PM on July 19, 2006

From my experience, MySQL just becomes very slow when you have tens of millions of records in one table.

We once had a very simple table (one ID field, two integer fields and maybe two or three status fields) with 13 million records and even a simple delete from table where integer_field=1234 which affected just a few hundred rows would take ages.

We ended up splitting up the data into multiple tables.
posted by bloo at 4:33 AM on July 20, 2006

« Older Whose jurisdiction am I under when making a...   |   Research findings in languages other than English Newer »
This thread is closed to new comments.