MySQL Monster Query Issue
November 19, 2007 9:07 PM   Subscribe

MySQL monster query is timing out. Anything I can do about it? (Long question, possible a cure for insomnia.)

I set up (using my usual mediocre coding skills) a small PHP app which allows form creation, saves the results, and outputs the results into a tab-delimited file. First two parts seem to work fine. Having some trouble with the last bit.

The form information is saved into two tables: "form" which saves the name, etc., of each individual form, and "field" which saves the fields, using the 'form' row id to keep track of where they belong. A simple form might be saved as

form -> form_id=1, form_name='Test', etc.
field -> field_id=1; field_form=1; field_label='First Name', etc.
field -> field_id=2; field_form=1; field_label='Last Name', etc.

Pretty basic stuff. Likewise, when the form is filled out, the data goes into two tables, "form_save", and "field_save" which mimics the structure above. The 'field_save' row has a column pointing to both the id of the 'form_save' row and 'field' row it is associated with. In essence, for every form in the database, there is one row in 'form' and however many rows in 'field'. For every form someone fills out, there is one row in 'form_save' and however many rows in 'field_save'.

This all works fine on modest forms filled out in reasonable numbers, which is mostly what my clients have. However, a client created a monster 100 question form and has had a large number of people filling it out. It can be edited on the back-end and saved on the front-end just fine, but I cannot get it to spit out a tab-delimited report because the way I'm approaching it in PHP either requires a cajillion queries or a cajillion if/then reiterations to process. This times out on the server, which has time limits on queries.

The "simplest" way to grab results seems to be to:

* Query grabbing all of the 'field' rows for the form type.
* Query grabbing all of the 'field_save' rows which match the form type.

Then it gets ugly. Step through all of the 'form_save' rows, and within that step through the 'field' rows, looking for matches. It works on reasonable results sets but on a form with 100 fields (really) with 15,000 people who have filled it out (really) we're talking about 1.5 million loops. Boom.

My results file looks like this:

[date] [ip] [first name] [last name] [another optional] [another required] [another required] [another optional] etc.

With the date and IP coming from 'form_save'and the rest coming from 'field_save', as matched against the 'field' rows for that particular form. Fields which are optional to fill out may have no saved result in the 'field_save' table and therefore need a tab to keep the columns aligned.

Is there any way I can structure a query to return results combing the 'form_save' and 'field_save' data all in a single neat row, as above, including the "empty" slots which may exist? Or is there a smarter way of doing what I'm doing?

(I think I can move this client to a dedicated box, which would allow my program to run, but it bugs me that I think I'm being horribly inefficient here and would love some pointers on doing a better job with the query.)
posted by maxwelton to Computers & Internet (16 answers total)
 
I'd think an outer join between form_save and field_save is what you're after, but I suspect I may be missing something...
posted by pompomtom at 9:17 PM on November 19, 2007


What pompompom siad... you should be able to write a left outer join that'll do this.
posted by SpecialK at 9:27 PM on November 19, 2007


(can you paste in, using pre tags, the, schema for these two tables?)
posted by SpecialK at 9:27 PM on November 19, 2007


Response by poster: Thanks for any ideas!
CREATE TABLE `form_save` (
  `form_save_id` int(10) unsigned NOT NULL auto_increment,
  `form_save_form` int(10) NOT NULL default '0',
  `form_save_ip` varchar(20) NOT NULL default '',
  `form_save_stamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`form_save_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Key:

'form_save_form': pointer to 'form' table
CREATE TABLE `form_save_field` (
  `form_save_field_id` int(10) unsigned NOT NULL auto_increment,
  `form_save_field_save` int(10) NOT NULL default '0',
  `form_save_field_data` text NOT NULL,
  `form_save_field_field` int(10) NOT NULL default '0',
  `form_save_field_parent` int(10) NOT NULL default '0',
  `form_save_field_stamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`form_save_field_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Key:

'form_save_field_save': pointer to 'form_save'
'form_save_field_data': saved data
'form_save_field_field': pointer to 'field'
'form_save_field_parent': pointer to another row in this table; if the field type is "checkbox" with multiple possible answers, the various answers are added as additional rows while the original row has "checkbox" in the data field. In this way, if you parse the row and find "checkbox" you know to look for additional rows with a parent that matches your existing row. I do this to allow faster searching for specific answers of a multi-choice question.
posted by maxwelton at 9:59 PM on November 19, 2007


Quick and dirty...
SELECT fsf.*, ff.*FROM form_save_field AS fsfLEFT JOIN form_field AS ff ON (ff.field_id = fsf.form_save_field_field)WHERE fsf.form_save_field_save = $form_save_id;

posted by sbutler at 10:15 PM on November 19, 2007


Or... suppose you want all the answers from a particular date range [$date1, $date2). You could do it like this:
SELECT fsf.*, ff.*FROM form_save_field AS fsfLEFT JOIN form_field AS ff ON (ff.field_id = fsf.form_save_field_field)LEFT JOIN form_save AS fs ON (fs.form_save_id = fsf.form_save_field_save)WHERE  (fs.form_save_stamp >= $date1) AND  (fs.form_save_stamp <>ORDER BY fs.form_save_id, fsf.form_save_field_id;
The point is, JOINs are good for more than adding columns to be returned!

Really, in this case that second LEFT JOIN becomes an INNER JOIN because of the WHERE clause (NULL would always exclude the rows anyway). IIRC, all versions of MySQL >= 4 perform this optimization for you... but it's something to be aware of.
posted by sbutler at 10:41 PM on November 19, 2007


Best answer: Ermm.. MeFi ate that from me. How about this:
SELECT fsf.*, ff.*FROM form_save_field AS fsfLEFT JOIN form_field AS ff ON (ff.field_id = fsf.form_save_field_field)LEFT JOIN form_save AS fs ON (fs.form_save_id = fsf.form_save_field_save)WHERE  (fs.form_save_stamp >= $date1) AND  (fs.form_save_stamp < $date2)ORDER BY fs.form_save_id, fsf.form_save_field_id;

posted by sbutler at 10:43 PM on November 19, 2007


Best answer: You're effectively using your database to store a database schema for your forms, as well as the data defined by that schema. That's probably going to make it hard to get query results with one form per row. But if you just do the queries you're already doing and bung in an ORDER BY to sort them according to the ID's the records are associated with, you should be able to generate your TSV output lines just by making one pass through each query, rather than looping and looking for matches.
posted by flabdablet at 11:39 PM on November 19, 2007


On lack of preview: what sbutler said.
posted by flabdablet at 11:39 PM on November 19, 2007


That is: if you use ORDER BY properly, you can make all the saved fields for a particular instance of a particular kind of form cluster together in your query result, and occur in a predictable order. Then all your TSV-generating code has to do is walk the query, spitting out fields in the order they occur in it, and adding an end-of-line every time the form instance ID changes.

It would probably pay to add an extra numeric column that explicitly specifies the column number that the field concerned should occupy in a TSV report, and include that in the ORDER BY as well. That would make it super-easy to figure out how many tabs to emit between fields.
posted by flabdablet at 11:54 PM on November 19, 2007


Response by poster: Thanks! I really appreciate it, I'll code up some stuff tomorrow to try it out.

Now that you mention it, the fields are ordered in that particular table. So if I join it onto the query, I should be able to retrieve results by this?
SELECT fsf.*, ff.*
FROM form_save_field AS fsf
LEFT JOIN form_field AS ff ON (ff.field_id = fsf.form_save_field_field)
LEFT JOIN form_save AS fs ON (fs.form_save_id = fsf.form_save_field_save)
LEFT JOIN form_field AS f ON (f.field_id = fsf.form_save_field_field)
WHERE
  (fs.form_save_stamp >= $date1) AND
  (fs.form_save_stamp <>
ORDER BY fs.form_save_id, f.form_field_order;
flabdablet, am I taking "normalization" too seriously? One of the reasons for setting up the database this way was to allow the "easy" addition and deletion of fields from forms, including the archiving of previously saved data from a field which is no longer in a form. In addition, this allows fields to be re-ordered on the form.

Sadly, I spent a lot of time cobbling this thing together, which gives an idea of my programming abilities...
posted by maxwelton at 12:20 AM on November 20, 2007


Response by poster: Oops. That was already in there. Too tired, must read again. Sorry. Please ignore!
posted by maxwelton at 12:22 AM on November 20, 2007


Response by poster: This works pretty well, actually, though, of course, the fact that some fields are optional makes life interesting. Thanks again.
posted by maxwelton at 3:17 AM on November 20, 2007


The way you've normalized your data is fine. But because each of your form fields is a database row, I think trying to generate an SQL query each of whose result rows has a complete set of form fields is going to be way too hard. I think it would involve, at a minimum, generating the query itself dynamically.

It seems much easier to me to allow each of your form fields to occupy its own query result row, and just use ordering to make sure all the rows for a given form end up occupying sequential query result rows.

The fact that some form fields are optional is why I suggested adding a field to your schema whose only purpose is to specify which column the corresponding form field should occupy when you're dumping forms to your TSV file. That way, you don't have to do anything clever, or make any complicated decisions; you just include the Column field in your ORDER BY, and emit (previous Column field) - (current Column field) tabs before each field as you write the TSV.
posted by flabdablet at 10:45 PM on November 22, 2007


If your f.form_field_order values start at 0 and go up by 1, you could indeed just use that.
posted by flabdablet at 10:52 PM on November 22, 2007


I'd look at storing the form definitions and responses as a single XML-encoded text blob each, unless you have a need to query on the values of individual form fields. It will greatly simplify your queries. It sounds like your only real requirement is schema polymorphism for the form definitions, and XML handles that nicely. With attributes, you can define right in the XML whatever kind of requirements and options you have for your form without mucking up the table definition or resorting to hardcoding that somewhere else.

Also, if you don't care that it takes a long time and you just want to take care of the timeout, you should increase your PHP max execution time value.
posted by Caviar at 5:07 AM on November 26, 2007


« Older History of Breakbeat   |   Help the son and daughter-in-law of a compulsive... Newer »
This thread is closed to new comments.