How do I use LLM to help my ETL workflow?
January 4, 2024 6:04 PM   Subscribe

I previously used Apache Airflow, Luigi, etc to create my ETL process. I have 40-50 csvs with different data structures with a total of ~400k records of varying quality and different structures. I’d like a local LLM (eg not use OpenAI to keep costs down), which model should I run and how do I code or create an ETL process in a something like Airflow or similar? How do I “code” using LLM. These are semi-structured business listings.

I have a bunch of csvs that I’d have to traditionally create a separate mapping for each field. For the most part the data is structured but sometimes the headers are split into different fields and things. So business hours might be split into several fields and I need them all into one field in RFC 4791 expression:

FREQ=WEEKLY;BYDAY=MO,TU,WE,TH,FR;BYHOUR=8,9,10,11,12,13,14,15;TZID=America/New_York;

That’s every day Monday - Friday 8am to 5pm.

Basically I’d like to say “given these csv files convert it to this json schema, putting hours of operation into RFC 4791 format in the hours of operation field and then format all fields in a pretty format”

This works surprisingly works well in my testing with ChatGPT. At least if I take a single record that doesn’t match 1:1 the field names in the schema, it figures it out it seems. The “make pretty” converts things like all capital address field to more what you’d see in google maps. For the hours of operation I’d probably say “and make it in the time zone of the address listed here.”

The csvs themselves came scraped from web pages so the data may be structured but may just have 8-5. Then I need to add additional caveats, however I think it’ll pretty close and have me avoid going through every single csv, map the files, clean up the data, transform it etc.

This is beginning to feel like a DSL a bit and I’m wondering if I’m approaching this in the right way. I have some known good listings in the csvs, so I can manually check how well the web scraped csvs exist to a previously manually entered address, as in “Street vs St” I don’t care about and haven’t thought about automated testing yet.
posted by geoff. to Computers & Internet (4 answers total) 1 user marked this as a favorite
 
To clarify, are you wanting to do inference on every row of the csv or have it write the code for you? And what's your tolerance for errors?
posted by hoyland at 7:12 PM on January 4 [1 favorite]


Response by poster: Inference on every row. I’m using this to determine errors and type of errors. I’d be open to having it write the code for me.

There is no common format for the csvs, like whoever did it didn’t come up with a common way to capture the data so that’s kind of why I’m looking to see if there’s a way to get around manually mapping each csv with a myriad of columns to something normalized. So some might have a column that’s says “name” others might have it called “name of business.” Furthermore some separate out address into street address, city, state and zip columns.

I’m also making the huge assumption that the validity of the CSVs is fine. If there’s no hours of operations for business in Delaware it’s because the state didn’t provide it.
posted by geoff. at 7:39 PM on January 4


This sounds more like a case for data cleaning than running inference on every row, given the volume of data. Have you tried GPT-4 with Code Interpreter with the files, explaining what you want? You can upload a zip file of the entire folder. You could at least use that to extract information about the different field mappings for each. I'm guessing that'd be a more productive path.
posted by lookoutbelow at 9:14 PM on January 4 [2 favorites]


Given that it sounds like the data is in known formats, just not a common format, I do suspect it's easiest just to write N standardization scripts. When it becomes apparent that maintenance is death by a thousand paper cuts, then you look at a model, but I'm reasonably certain you'd want to end up training the thing yourself.

How you do the standardization is orthogonal to Airflow/Luigi/friends, they're just doing the task orchestration. All an Airflow Operator is is some Python code that kicks off a task and throws an exception if it fails. The task can be implemented however is appropriate, you just have to know if it succeeds/fails (yes, you're taking up a worker slot if you're waiting for some process you've kicked off elsewhere to finish, but ignore that).
posted by hoyland at 9:41 PM on January 4 [2 favorites]


« Older Not earrings, surely, but then what?   |   Etiquette for treating a friend (of a couple) for... Newer »

You are not logged in, either login or create an account to post comments