Tools for building datasets as I wander through the internet
September 18, 2024 3:38 PM   Subscribe

I often find myself grabbing text off a website and pasting it in a spreadsheet to do quick-and-dirty analysis - for example, making my own table of grants made by a foundation, using pdf annual reports or webpage graphics. I imagine that this need is relevant for tons of professionals - people who do actual research for a living, for sure, but also novelists and lawyers and entrepreneurs. Are there any non-shady*, privacy-oriented, and no-code-necessary tools that folks find useful for this kind of work?

In my daydreams, I imagine something like a browser extension, where I get to highlight a bunch of text info, have the extension say, "We see what looks like 5 columns' worth of data. What are the column names?", and the output is a CSV (or a Google Sheet that preserves links, be still my beating heart), maybe even with a column added for me to add some metadata notes. If I could save table schemas and add/align new info to them? Bliss!

I've noodled on whether other existing productivity software might help with this - I never got into Evernote, and left grad school before Zotero, etc. were integrated with browsers. My bookmarks are legion, but what I'm looking for is something beyond tagging/folder structure for links. (This question has given me reason to go look into Pinboard, which I've seen recommended on MetaFilter a number of times.)

What I'm thinking of feels similar to things like automated resume parsers on job-application websites, or import tools where you can, e.g., create a CSV file to add a heap of contacts to software you use by following their column and data-type structure. Of course, one of my problems is that I want to make MANY spreadsheets, and not just add record after record to the same file.

*On shadiness: when I poke around in the Chrome webstore, I find a handful of web-scraping tools that seem close to what I want to do, but I don't feel great about them. I know that ~"AI" is in the water at this point, but I'm skeeved by tools that scream, "USE AI TO DO CAPITALISM", and would like to avoid tools that use LLMness to deliver the service I'm paying for if at all possible. All the extensions I've seen are free, and that feels fundamentally incompatible with data security - how else would the developers make any money?

"Instant Data Scraper", for example, looks like it does what I want, but I am not reassured by "Developer promise: This extension does not contain any malware or spyware beyond standard Google Analytics." "Simplescraper" looks like it has even more of the features I want, but also gives me bad vibes.

My vibes may be totally off base here! And the answer might be, Learn to code and make it yourself, rrrrrrrrrt. But if you've found good tools for organizing lots of different information in structured ways, I'd love to hear about and learn from your experience.
posted by rrrrrrrrrt to Technology (12 answers total) 20 users marked this as a favorite
 
This is a prime task for a LLM and I’m not sure why they’re off the table. LLMs are just that, a model. Melissa Dell at Harvard has some models that are designed to parse tables in historical documents, but they are far from no-code. Unless the data is always in a similar format, or you’re fantastic at writing regex, this is going to be difficult without some modeling. Perhaps you’re looking for a specific tool that uses LLMs to parse the query but do not share the data that is input into the query itself? That would probably work great
posted by MisantropicPainforest at 4:12 PM on September 18 [2 favorites]


Google Sheets can import a table pretty easily with the IMPORTHTML formula. I think this would be useful as your first step. I use it to grab tables in webpages all the time. As an example, I use this formula to grab the list of every show currently playing on Broadway:
=IMPORTHTML("https://en.wikipedia.org/wiki/Broadway_theatre","table",1,"en_us")

When I paste that into a cell and hit enter, it pulls the entire table into the sheet and I can copy and paste it (paste special values if needed) into any other tab or file. It works with lists as well as tables. The help page linked above lists some other interesting commands like IMPORTXML, IMPORTFEED (for atom and rss), and IMPORTDATA for csv or tsv

Pinboard was a great tool once upon a time. I still use it as I was grandfathered into a one time fee. But don't count on any support from the site owner. They've basically given up responding to people or take a very, very long time doing so.
posted by soelo at 4:31 PM on September 18 [5 favorites]


I wouldn’t use a chrome based browser to do this, since Google is messing around with its extension capabilities.

However, I will put in a plug for Microsoft Excel’s data import features if you have access.

I do think that an (ideally open source) separate application is more likely to be non sketchy. As a starting place, I would suggest poking around Alternativeto.net. See how they classify different features.
posted by oceano at 4:51 PM on September 18 [1 favorite]


I saw someone do this on youtube with genealogy but I don't know enough about what you are asking to know if this is what you are talking about. I think it is.
https://www.youtube.com/watch?v=upecTYEcxnw
Excell talk starts at about 5 minutes
posted by memoryindustries at 5:03 PM on September 18


I was going to say Excel, as well, which would cover off any HTML tables. And if you get comfortable with power query, you could combine and analyze data from different sources together.

If you do want to try the LLM route, this open source browser extension project has a function for extracting structured data from a page, if I recall correctly: https://github.com/SuffolkLITLab/prompts.
posted by lookoutbelow at 6:29 PM on September 18


If you just want to clip pages, I use "omnivore", but it does not parse. It is basically an offline reader that you can tag later. It also works for email mailing lists as you can generate email aliases from it to give to email subscriptions.

If you have to grab actual tables of figures off existing pages, you'll probably need some of those Chrome extensions..
posted by kschang at 7:37 PM on September 18


I would not use an LLM for data specifically unless you plan to hand-check every data point compared to the original.

LLM's tend to produce output that looks right, and may even be very close to the original, but I would not trust them to reproduce every item in the original data exactly.

(I'm open to being proven wrong — maybe there is a model out there that is good at the task of 'extracting data tables' specifically — but I would want to see some pretty thorough accuracy proof before trusting it.)
posted by mekily at 7:58 PM on September 18 [2 favorites]


So this literally just launched this week, but it’s by the folks at Fathom who know a thing about data visualization:

https://rowboat.xyz/
Rowboat is our tool for quickly making sense of tabular data (think Excel and CSV files). It runs entirely in a web browser, so visit rowboat.xyz and drop a file on the page. You’ll see its contents instantly visualized, ready for filtering, sorting, and all the other fun things one might do with a dataset.

The general idea is that as more people deal with data in their every day work, they often wind up using the wrong tools, like Excel, which isn’t built around understanding data. Options beyond Excel are so over-built—Enterprise licenses, complex installation, writing code—that there’s an enormous amount of work to do before you can actually see what’s in the data. Which, of course, completely misses the point of collecting the data in the first place.
(From launch post on Mastodon )
posted by graphweaver at 8:48 PM on September 18 [4 favorites]


Not exactly sure - but maybe check out Steampipe and/or Turbot Pipes
posted by TimHare at 9:02 PM on September 18


When I want to do things like that, I usually copy the table using CopyTables (Firefox extention) and paste stuff into Grist - I use the locally-hosted self-managed open source version. No AI there, but Grist has a ton of nifty templates, and just copying data and pasting it into something database-spreadsheet-esque is a nice start.

Hope that's helpful!
posted by kristi at 10:13 PM on September 18 [1 favorite]


This is a bit tangent to your question, but Open Refine is a tool for aligning messy data.
posted by adamrice at 5:33 AM on September 19


I have used ChatGPT 4 and more recent versions for this exact task (if I understand correctly, you want to take a screenshot of columnar data and feed it into some tool that will turn it into CSV for you). I agree with others that there's no specific reason that you should fear using AI for this, aside from the fact that as with any tool you should be engaging in some kind of QA process to ensure the results are coming out as you expect them to. In terms of privacy, you should consider any inputs you feed to a publicly hosted LLM to be insecure, but you said these are data you're getting from other people's websites, so the concern is less clear to me.

Modern websites are generally not amenable to being "scraped," which is the term of art for using a 3rd party tool to extract data as you describe, and therefore put up a lot of technical roadblocks to doing so (in addition to making it a violation of terms of service, if you are accessing the content through a login). In many ways, there was a time probably 10-15 years ago when it was technically much easier to automate what you're interested in doing, but I digress.
posted by telegraph at 9:30 AM on September 19


« Older Finding senior Ruby on Rails development work:...   |   What's the most fun way to travel from Seattle to... Newer »

You are not logged in, either login or create an account to post comments