How do I proceed with this temp job and not overthink it?
May 17, 2024 5:07 AM   Subscribe

Basically I’m out of work and used to do for lack of a better term FAANG style IT. If you’re processing 24 TB worth of transactions a day it is going to be easiest or the same price to build a data warehouse from the file system on up then force it on a vendor. Due to the economy I took a job where they have a couple of small 2 GB databases that were developed mid-90s. They never went to the cloud and just grew without a plan. This is only temporary, the person they had running it became sick and is retiring probably. They have a lot of transaction and performance based issues. I oddly don’t know SQL anymore, or the old way this was written. What are some tips to get me through this politically and how mid-90s SQL works? It is running on-premise on the latest addition of MS SQL.

The company has a few hundred employees, and to give you their thought process the CEO/owner keeps the servers in her basement since she got fiber. I’m ignoring anything wrong with that and assume this company is not keen on change.

The problems I dealt with were more on the level of how do you write a data cluster that can deal with small items on a file system and big items while providing the same interface for SaaS users. How do you schedule jobs to keep data hot and cold or move it geographically where it is needed. What’s the smallest amount of work we can do to write things to support vectorized databases without redoing the kernel, file system and still try to match performance of a specialized system? So SQL was a supported query language relies heavily on reporting data from a tool that supported SQL. We’d algorithmically determine queries and optimize it down to another language. Think Aurora and CosmosDb.

It was made to simply support tools that needed SQL, and relied on simple queries with pretty good eventual consistency. For our purposes it was fine and hard to write a data store that can support relational data, objects and achieve perfect performance.

That was my background, and I worked in a different area. I had assumed outside of embedded devices or specific scenarios, things like MS SQL did more or less sharding or took a poorly written statement and essentially optimized it like LLVM did.

If it does I can’t seem to find it but their problem is that the queries take days to run and are 1000+ lines with every possible type of syntax available to create temporary tables and then on a certain day all the employees seem to run reports on their laptops all at the same time. From what I gathered during my interview was that it was a long running job that would eventually trigger a timeout, regenerate these temporary tables and then be hit again by these excel reports leading to it looking like it took a long time to run.

My problem is I’m intimidated by replacing a guru who would solve the problem with more sub queries and inner joins and complex logic I didn’t know existed. My second problem should be more obvious: all my career I’ve held jobs that more or less gave me room to improve things. Does that make sense? The culture here seems to be very reliant on a unicorn type that probably fixed a few problems with hacks then that just continued. I’m not being asked to change things but improve queries while he’s gone. I briefly asked why it wouldn’t be moved to the cloud or keeping it more simple: why not just have one person reproduce a report with generally what everyone needs as they looked very much the same and timeshare, if you will, manually who gets to run their specialized queries at what time. I was trying to determine their level of comfort with change versus me breaking a query on some weird hack and being up until dawn fixing it.

I know rewrite the system or upgrade the basement servers were not being asked of me but I wanted to take on the not possible task of 30 years of undocumented code on a small dataset. That could be solved in a lot of better ways.

How would you approach this task? I’m guessing the guy I’m subbing for knew the database code well enough when things came up it wasn’t a problem. And going on a limb he had a lot of goodwill built up and also knew who to help first or what was a priority.

So my current plan is to do nothing and hope small adjustments will work. They have no metrics or plan on what slow is so that’s kind of strange this is ongoing, but would make sense if there’s a hardware bottleneck and the adjustment inadvertently just kills a process that is hanging. Besides being intimidated by this code I’m supposed to only improve by how it was done previously, anyway other recommendations? It is odd to be in a situation where no one sees a problem and I’m not able to improve things. It’s temporary but the pay is good and I’d like to not burn bridges.
posted by anonymous to Computers & Internet (8 answers total) 2 users marked this as a favorite
 
I advise multinational corporations on how to manage billions of transactions per day with millisecond latency. This type of problem is very common and difficult to solve. As just one person who wasn't there when the system was built, you're fighting the worst uphill battle. The level of pain of keeping the system going is the same as it would be to modernize it. A transformation from .. whatever that is, to a modern data management system takes experience, skill, deep knowledge, and an appetite and willingness for the entire business to adapt to change. It's not a one person project, it's essentially transforming the entire way the company operates. So forget that.

You exist as a stop-gap insurance policy, which is also very common, and I'd treat this gig like that. You don't have the deep bench and operational experience in managing data, so just keep things from exploding too badly and be honest but reasonable in your assessments and recommendations. Focus on monitoring and observability so you can figure out that things are bad and what's causing the problem. Manage expectations by being clear with your limitations. Make sure any changes you make have a clear rollback plan, because you will fail and need to rollback.
posted by Geckwoistmeinauto at 6:42 AM on May 17 [3 favorites]


Assuming they shelled out for a full version of sql server, the sql server profiler is your friend for finding out what's going on. If there's not a ton of traffic you can capture everything, or you can filter for things taking > 5 secs for example. It's also possible just to record deadlocks which can be one of the reasons you're seeing timeouts.

Then what I would do is take a backup copy of the database, and run the worst performing of the queries, try and optimise a new version for time and compare the old and new results to make sure it's still giving the right results. If it's not practical you can do it on the live db, but you can end up seeing inconsistent speeds due to other users.

If you use sql server management studio to test the queries, the query planner / execution plan is good at telling you which parts take the longest - usually the cheapest win is adding indexes to columns, but of course indexes can take up a lot of space.

One problem you may run into is the client software (or excel vbs macros as it sounds like they are using) might be doing stupid stuff that is not fixable database-side e.g. they are doing joins on the client instead of the server or retrieving a lot of rows one query at a time). Although it's considered a bit old fashioned now, I usually push for the client software to use stored procedures where possible as they can be optimized without the client being changed.

Also if you have access to any of the latest LLMs (e.g gpt-4) they are surprisingly good at explaining SQL - probably due to the amount of ancient web pages in their data.
posted by samj at 6:47 AM on May 17


I recognize parts of this scenario. My experience has been that you cannot really fix the core problems because they probably don't want the core problems to be fixed. My guess is they explicitly want you to kick the can down the road, not build an automated, redundant, scalable can-carrying robot.

You get one or maybe two goodwill changes stepping towards good practices or improvements not directly prompted by an emergent issue. "OK, we can try that as long as it doesn't cost much and isn't inconvenient." The cost threshold varies widely, probably based on the owner's mood, and will occasionally be shockingly high or low. Use these opportunities wisely, and only one at a time. An offsite daily backup for the basement servers might be a good initial use, if another one hasn't already sprung to mind.

Triggers and scheduled jobs can pre-calculate expensive values in the database and store them in a new table. This will only help if you can convince people to update their queries to use the pre-generated values.

Some expensive reports could be made cheap if the backing data is converted to use columnstore instead of rowstore format.

Some expensive queries or slow views can be sped up through materialized views.

If you can talk them into it, large tables can be reduced in size by migrating historical data to an archive table or database, and important / expensive statistics can be pre-calculated per [insert time unit here]. If there are multiple relevant units, you can pre-calculate values for the smallest unit (daily or hourly) and then provide functions and sprocs to generate weekly, quarterly, annual (etc.) results.

Getting the reports to call sprocs instead of directly making queries will add a welcome layer of standardization, centralization, and indirection that will allow you to solve problems. This can be an issue if people are used to being able to modify their own queries and feel they are losing something with the change, as updates to the sprocs now need to be funneled through your position.

Make sure you're aware of the queries being run and who is running them and why. You may need to reach out to individual people and work with them directly to "fix" their spreadsheet macros or whatever.

Clone the database so you can experiment with big changes in a safe manner and time migrations. Ideally the basement servers have enough overhead you can do this on them and it won't impact production, but if you need to you can probably argue for a temporary cloud DB for testing or see if their rules are lax enough for you to just load the data onto your personal workstation. Since it's basement server land, the owner will probably either say "sure" without a second thought, or immediately be convinced that you are trying to steal their precious information to sell to competitors, ymmv.
posted by Number Used Once at 7:33 AM on May 17 [2 favorites]


I am retired now but did a lot of SQL prior to that, but only from the Application Programming side, not a System Administration Performance side. I just came in to ask if you were familiar with SQL Common Table Expressions, which can sometimes simplify and/or improve SQL execution. FWIW.
posted by forthright at 8:03 AM on May 17 [2 favorites]


For SQL (& the like) W3schools remains the go-to, imo.

2 Gigabytes of data? CEO knows you can get flash drives with terabyte capacity for $10 bucks these days. Just duplicate infrastructure: open source, distribute it, etc.. Then, RAM temporary files.

My problem is I’m intimidated

Chill? 30 years just means they’re doing something right.
posted by HearHere at 8:42 AM on May 17 [1 favorite]


Since you're running MS SQL, Microsoft will have much better resources than w3schools. They have tons of articles and resources like this that will be directly relevant, more technical, and more in-depth:

https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/performance/troubleshoot-slow-running-queries
posted by Number Used Once at 9:06 AM on May 17


Something to remember is that FAANG tools and approaches are essentially useless at the scale you are working with. Distributed systems, data pipelines, microservices, service mesh, blah blah blah, it’s all immaterial when you are working in the single GB scale of data.

You need to get some books on SQL server and learn how to do optimization.

Take a workflow and whiteboard out what is actually happening to see if there are some optimizations you can do.

Understand that it it TOTALLY FINE for a query that powers a weekly or monthly report to take hours- just run it over the weekend.

Finally, stick with what you can do in SQL Server. RDBMs tools are not mid nineties tech. SQL (and spreadsheets) make the world go round, and that’s OK! There are a lot of fun problems you can solve if you dig in to it.
posted by rockindata at 4:51 AM on May 19 [1 favorite]


I deal with this quite a lot, you've got two major things in your favor, one is the small nature of the data, the other is you're in a phase where you can tinker-- so focus on how to make that tinkering safe.

- Take a backup of the database, and restore that to a different computer that can you can break to your hearts content.
- For an easy win-- try running the same stored query-- it might run a lot faster because it's now a sole user, probably on a way faster CPU with no row/table level locks happening, so if it's suddenly running a 24hr query in 15minutes... well-- awesome, figure out some automated backup/replications from the 'live' server and call your new one a 'data warehouse' and just use it for the scheduled reporting.

At the same time, you've got a no-risk server to play on now, develop your skills, learn how the query profiler works, figure out what step is doing full-table scans that could be helped by hitting an index.

Another question you should ask the guy leaving is-- 'hey, when someone really needs a report, do you ever just reboot the SQLServer instance?' (to release any locks/other long running processes). Sounds like the company is used to timeouts, so I wouldn't be shocked if that's more than often the quick fix.
posted by Static Vagabond at 6:49 AM on May 20 [1 favorite]


« Older Sundowning Mom Terrified of Caretaking Dad   |   How to transition into a Software Automation... Newer »

You are not logged in, either login or create an account to post comments