Join 3,368 readers in helping fund MetaFilter (Hide)

Tags:

Help me write this SQL job
May 31, 2011 8:39 PM   Subscribe

What's the best way to write this SQL job? I have a table with browser session information collected. I want to clear this table and store the information in another table.

In the table that already exists the data is just collected as an http user agent string, a session id and a date of the session.
e.g. 02-FEB-11 ASDFKJH234KJHASKDJF213123 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729)

I want to clear this table biweekly by running a job and putting the collected data into a separate table where it's structured more like:
browser, version, total number of that browser and version, date.

This is on an Oracle DB. I created the other table. For the job I was messing around with doing a case statement where I'd go something like
WHEN user_agent LIKE '%Firefox%'
THEN insert blah delete
WHEN user_agent LIKE '%MSIE 9%'
THEN insert blah delete
WHEN user_agent LIKE '%Firefox%'
THEN insert blah delete
WHEN user_agent LIKE '%Opera%'
THEN insert blah delete ;
etc...but I don't feel like I'm on the right track of coding the job in the best manner.

I know this is pretty simple and my brain just might not be working correctly right now. If this seems simple to you please let me know how you'd structure this or even provide some sample code. Otherwise, I'll just keep plugging away, thanks!
posted by zephyr_words to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
Instead of keeping one table and moving data periodically, perhaps generate new tables with the date as part of the table name, e.g. clients_05_firsthalf_2011, clients_06_secondhalf_2011, etc.

Then you just point your code at the current table name, instead of moving lots of records and data around, in addition to adding new records.
posted by Blazecock Pileon at 8:53 PM on May 31, 2011


A couple comments and then a (better) answer:

1) Don't do what BP is suggesting above unless you are extremely sure you are not going to have to query across those tables (like, do a query for everything in 2011), because that is a major pain. In general if you find yourself splitting the same data across segments, you're making a mistake (not always, if you have hundreds of millions of rows of data, but even then you're better off trying to make it work with some easier split like current/archival)

2) If you find yourself doing complicated logic in a database query, you are probably making a mistake (and if you think identifying the user agent and version from a browser string isn't complicated logic, keep working at it and answer again later). You should definitely think about moving the logic into the app.

3) If you are writing a query that analyzes and then deletes, you need to be careful that your analysis and deletes are in sync (even when new records come in as this is going on), so you don't delete anything that wasn't analyzed and vice versa. One possibility is to grab the current date (or have it passed in) at the start of the script, and only operate on records older than that for the rest of the query.

So given all that, assuming you do still want to do it in the database, I'd suggest doing one query that inserts into the analysis table, and a second query that deletes. The first query would presumably have a subselect that gets the rows from the original table, an outer select that aggregates those rows, and then the insert. Something like (my Oracle syntax is a little rusty):

INSERT INTO summary (browser, version, count, date) (
SELECT browser, version, count(*), SYSDATE
FROM (
SELECT
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'firefox'
ELSE 'unknown'
END as browser,
-- something crazy with substr to get the version
FROM sessions
WHERE session_date < @thrudate
)
GROUP BY browser, version
)

DELETE FROM sessions WHERE session_date < @thrudate
posted by inkyz at 10:04 PM on May 31, 2011


I wrote the question a bit incorrectly. I should have stated that I'm going to write a procedure and have the job run that procedure. So PL\SQL is open game.

Your answer is helpful inkyz, thanks. It looks like I'll just need to build a few cursors following that similar logic pattern.
posted by zephyr_words at 10:33 PM on May 31, 2011


First off, not an Oracle coder, so feel free to ignore my advice.

Not sure why you want to dump the contents of your table into a new table when you could just write a view over the top of it with your summary data, but maybe that's the packrat coder in me.

inkyz' CASE statement will work, the only problem I can see is that you'll have to know ahead of time which browser string you're looking for and there's a good chance you'll lose some granularity in the data; so, for example, you're not going to be able to disambiguate between FF 3.x and FF4.x, unless you write a massive CASE with every possible option.
Or, take the browser Midori, which has an agent-string like this:
Mozilla/5.0 (X11; U; Linux i686; fr-fr) AppleWebKit/525.1+ (KHTML, like Gecko, Safari/525.1+) midori/1.19

So, if your boss comes to you and asks for data on the hot new browser midori, you're going to have to go back to your query and tweak your Safari CASE with a sub-CASE to cover midori, which I'm sure you'd rather avoid.

So, better to parse your logs in the app. layer (using PL/SQL), extract a list of distinct browser sigs. and pass that back in some dynamic SQL in a loop/cursor to get a count of every distinct browser you've logged without actually having to hard-code a million CASE statements. There's probably a couple of libraries somewhere that will do the parsing for you, it can't be a new problem.
posted by SyntacticSugar at 1:31 AM on June 1, 2011


No need to write a massive CASE statement. Have your procedural logic check the user-agent against a table containing every user-agent found to date (hint: research indices). You should be storing a recoded numeric value corresponding to the browser anyways, rather than the full string for every record.

That is, you should have a table LOOKUPTABLE_BROWSER that has ID and Description fields. When you come across something like the Midori user-agent, look for it in LOOKUPTABLE_BROWSER. If it's found, use the ID; if it's not found, insert it and use the ID of the newly-inserted record. Then store that ID next to the other log details in your newly-created LOG_RECODED master log table (which should have pretty much all numeric fields -- no text, especially no text you're going to query against).
posted by Doofus Magoo at 5:57 AM on June 1, 2011


Either you have literally insane numbers of rows in your current table or you don't. If you don't, I really don't see the point of moving the data. Just have a view sitting on top that generates the summaries you're after. That way, if you decide later that you're actually interested in different things, you won't have discarded a load of data for nothing.

If you do have insane numbers of rows and need to summarise the data for reasons of space and performance, then for Pete's sake don't use a cursor. You will be an old man/woman before it's done. Do what inkyz told you to do. And yeah, you need a simple lookup table too, like Doofus Magoo said. This is database code, and you're not writing it like it's database code at all, you're trying to do some long-winded quasi-procedural thing.
posted by Acheman at 6:29 AM on June 1, 2011


No need for a procedure, or a lookup table.

Just group by on the user agent:

>> But I want to combine several similar user_agents into one count (I want to ignore version differences, for example)

Do the group by first, then do the heavy transforming on the much smaller set returned by the group by

And don't swap out the raw data table; just write views over it.
posted by orthogonality at 10:29 AM on June 1, 2011


« Older What's the cheapest way (speed...   |  How do I not wear my resentmen... Newer »
This thread is closed to new comments.