care and feeding of messy data
August 20, 2021 3:42 PM   Subscribe

What are sources of inspiration for data engineering done well? Ideally not so much from a “design cloud architecture for a well-resourced firm” perspective but rather “fight in the messy data muck with public sector analysts and make their lives easier with automation/integration pipelines/etc”!

I finally jumped from data science education into a role that’s closer to a Data Integration Developer on a newer Data Engineering team! Yayyy! I get to write R code to automate tricky messy data cleaning pipelines, which has long been a professional goal, and I’m **into it**.

But! I want to learn how to grow and develop and, in particular, do this kind of “messy data engineering” work well. I’m emphasizing the messiness because this data is coming from a wide variety of sources, at varying quality, and though I’d like to contribute to broader data literacy projects, it’s going to take time to make progress upstream (which, luckily, I find an engaging challenge.)

I’m tapped into #rstats and R Ladies communities, which are great, but I notice a disconnect. Much of these communities are focused on the analyst/modeler/researcher perspective more than implementing integrations/automation/etc. On the flip side, I’ve done some BigQuery and cloud automation server stuff in previous workplaces (very peripherally), but our environment is still at least a year or two away from that type of infra. It’s much more like, complicated SQL work across physical servers, helping people to use git/GitHub for the first time, etc. When it comes to designing systems, I also don’t necessary love that IT Systems/Admin/Manager direction, and would prefer to steer my career closer to the users and their code. At the same time, I don’t want to contribute to inefficient mucky solutions if I should be advocating for better data pipeline patterns (which I’m still learning to recognize!) Anyway, this gives you at least a modest glimpse into where I’m at. I also know Python and SQL so it’s not like R-or-bust, but I’ve found it useful to code closer to the analysts’ style of coding given frequent collaboration on data cleaning work (they actually largely use SAS, but, another story there).

My question for you: what sources of inspiration (books? Podcasts? Particular coders? communities?) do you think of with this work? What’s a good roadmap for career development in this area — and is calling it data engineering even the right way to think about it? Would love any thoughts you may have, thank you!
posted by elephantsvanish to Computers & Internet (5 answers total) 9 users marked this as a favorite
Designing data pipelines in the way you describe is software engineering. You should think of your job as building a software system, with your fellow data engineers as the users of this system.

I am not sure that this work is impossible to do in R, but to be honest I haven't seen it done in R. And the kinds of skills that you can build are around modularity/code reuse, testing, devops, productionization/error handling, etc.

I am happy to suggest some resources around software engineering. It's a fun field and you get to dig into lots of interesting stuff that most people on the data "front lines" don't have time for.
posted by goingonit at 6:25 PM on August 20, 2021

Learn you some OpenRefine! For tabular data cleanup it is the bee's knees and the cat's meow.
posted by humbug at 7:12 PM on August 20, 2021

I am not current on things like 'R' and Python, coming from a mainframe environment, but we did have SAS. In my work I did computer performance reporting, based on a tool called MXG which was entirely written in SAS. Some lessons learned which I feel can be passed on:

1. Create a common set of variables/columns for what you need to analyze. Include logical variables to indicate when a variable is missing.
2. Design a spreadsheet, or table in a database, or SAS dataset to contain this data
3. For each data source, design a small program to gather the data, clean it, transform it to your common format, and mark those logical variables
4. Document each of those transform programs very well, including reasons why they don't supply certain variables, etc.
5. write all analysis programs to use the common format as input

From what I'm reading, parts 3 and 4 constitute a 'data pipeline'. The main point is not to mix up your input pipelines with your reporting programs
posted by TimHare at 8:33 PM on August 20, 2021

goingonit is right that at its core it's software engineering, but the more specialized conceptualization of the discipline is not really mature in the way that "data science" or "devops" are. Data engineering for analysts in the public sector is a weird space that touches on a bunch of other stuff, while at the same it always feels like nobody else is quite in the same specific situation. I think in part this is because different types of analysis are often their own islands (think of a SAS programmer doing public health research vs economists who work in Stata vs somebody doing biology stuff in R). Different domains have vastly different data needs (data size, types of datasets), analysis needs (time series predictions, hypothesis testing, all kinds of newer ML algorithms), and data origins (sales data, gene sequencing data, public macroeconomic datasets, messy survey data, internal SQL databases, etc). The best I can do is offer some connections to other domains with some overlap.

Two domains that sound related to your daily work are ETL (extract, transform, load) and MLOps (machine learning ops). ETL is mostly associated with big enterprise software where they are moving data between databases. There are specialized tools for it and it's closer to the data layer -- the kind of thing database people might specialize in, like DBAs that know the ins and outs of Oracle. MLOps is basically DevOps for ML, where you can think of ML as a buzzword for "predictive analytics." Often MLOps is being done by big tech companies, which means they are developing some very cool tools that are sometimes useful, but often too specialized, or way overkill for your public sector analysts. There is stuff for versioning data and models, automating workflows, that kind of thing.

As a guiding principle I would try and think of it in terms of data done well (stored in the right formats in the right places), combined with ETL that is robust (testable, validated), fast, and as transparent to end users as possible.

When you're dealing with tons of different data sources and different types of data transformation, you need a system that is centralized and standardized enough to make updates intuitive and safe, but flexible enough to communicate and pull data from all your sources. R isn't traditionally used for this kind of thing and doesn't have the tooling of e.g. Python or standardization of SAS. R also doesn't have the web application tooling of Python, with Flask and Django and all that, which can be useful if you're building anything that's not a monolith. R Shiny and other tools can be great if you're using them within their intended purpose, but again that intended purpose isn't usually data engineering.

One thing I've found helpful is a thorough understanding of exactly what data is and how it works on the computer. The book Designing Data-Intensive Applications is great for this. (although, again, it touches on a lot of domains but "preparing data for public sector analysts" isn't really one of them)

Another neat tool I've used a bit is dbt, which was originally a simple CLI tool (and that core tool is still FOSS) for maintaining and compiling SQL pipelines but has grown into more of an ETL (or ELT) frontend. There's a sense in which the core tool is pretty similar to a bunch of PROC SQL statements in SAS, but it's built around much better software engineering and data governance principles. (you're limited to just SQL though)

Best of luck with your new position!
posted by ropeladder at 10:27 AM on August 22, 2021

Echoing some things goingonit said, you're perilously close to discovering you're a software engineer whether you intend to become one or not (never fear, it happened to me and I survived). The term you want to search for is ETL ("extract-transform-load") and, you're correct, this is data engineering.

Depending on exactly what your needs are, it may or may not make sense to write your logic in R. People tend not to do so for a couple of reasons--people coming at data engineering from the software engineering side aren't going to know R and, let's be honest, R is a pain in the butt when it comes to testing and other things people like when it comes to writing "production" software. (That said, the tidyverse makes it glorious for data transformation and if you need to share code with people using R, it may be the right choice for you.) You'll want to look into what are (apparently) called "workflow management" libraries--things that allow you to stitch your many steps together into one pipeline. Two big names that are Python-based are Airflow and luigi, but there are loads of others. (Note: you can kick of whatever kind of job via airflow or luigi, so you can use them to stitch together R scripts or SAS scripts or whatever.) At my first job, we kicked luigi flows off with cron jobs, we didn't stand up a permanent UI, but I think it supports that.

In terms of learning resources, I honestly just learned stuff on the fly. I did have more experienced colleagues to lean on, which it sounds like you may not. I did go to Data Day Texas one year when I was living in Texas and that was reasonably educational in terms of giving me some sort of conceptual understanding of various things (I think that's where I learned what Kafka is, for example).
posted by hoyland at 10:38 AM on August 22, 2021 [1 favorite]

« Older Did your youth church group adopt Rivers Of...   |   Help with composing script for calling a store... Newer »

You are not logged in, either login or create an account to post comments