My Own Private (or public) Google
December 3, 2020 7:37 AM   Subscribe

I have a hard drive with ~3TB of assorted files (html, video, etc.) scraped from a large (public) website at a finite point in time. The files aren't arranged in a particularly human-readable way, but (I think?) in folders by file type. How do I make them my own private google -- or open to the public is fine?

My goal is to have this archive in a format where a relatively small number of people could pull up a browser, enter text (or filetype) in a search field, and have relevant results pop up -- really, exactly what Google does. It could be a system where they need to set up an account (ideally free for them), or something open to the public (not sensitive, if not popular either).

Difficulty: I understand computers, and 10 years ago might have clawed my way into setting up my own CMS, and maybe an SQL install or something, but I'd rather just have an off-the-shelf product that works quickly and that I can set up with less terminal and more mouse. I'm willing to pay ~$20/month, or maybe more (since this is potentially time limited).

One idea I had was just to set up a google drive account, create a shared drive, and upload everything there (though I think uploads are limited to 750GB or something/day). I can try to trim it to under 2GB (the jump from 2GB for $9.99 to 10TB for $49.99 is massive)... Or should I just try to get the data into the cloud somewhere, hope Google indexes it, and create a one-page web interface that routes searches to site:xyz? (Does "hope Google indexes it" work here?)

Other ideas are welcome! Thank you!
posted by rdn to Computers & Internet (5 answers total) 5 users marked this as a favorite
 
Be careful about copyright issues -- just because the files were scraped from a public website doesn't necessarily mean you have the right to distribute them.

Google Drive seems like a good solution. Dropbox also offers 2TB of storage for $22/month. Box seems to offer unlimited storage for $15/month with a business account, though I haven't used their services in the past few years.

If (and only if) you have the right to distribute the files, you can upload them to the Internet Archive.
posted by mekily at 8:27 AM on December 3, 2020 [1 favorite]


Never used it, but Google's Programmable Search Engine seems like it can cover the search index (and interface) part of this if everything is online. (There are plenty 'off-brand' options too.)

No idea the best way to host 30 TB; list prices are pretty steep for cloud hosting.
posted by mark k at 8:28 AM on December 3, 2020 [1 favorite]


Response by poster: (Thank you all! I should clarify further - the files came from a US government website, so no copyright. Thank you again for the great tips!)
posted by rdn at 10:11 AM on December 3, 2020


Home-spun search current best practice is elasticsearch with logstash ingesters and kibana/grafana web user interfaces. The 'stack' is known as ELK and you can configure the ways it creates indices and/or indexes of the material it's scanned. You can run it in containers on local or networked servers with a web search box for the indexes it's created.
posted by k3ninho at 12:35 PM on December 3, 2020 [5 favorites]


How many is "a relatively small number of users"? Like 5 or 50 or 500, or is it unknown? Will the data change over time or is it purely a static archive that won't change in the future at all? Are the users likely tech savvy? Would they likely want to access lots of the data or only one or two things from it (which could go towards data transfer costs some cloud providers place on access - if 50 people all pull down 3TB that's quite a lot...?

Random thoughts to take a different approach: given 3Tb isn't a small/cheap dataset to host and make randomly searchable to an unknown volume.....

1. Came in to mention the Internet Archive as well - have you checked if the data is already there or least enough to be useful? Maybe your problem has already been partially solved for you, especially if a US Government website?

2. If it's a *really* small number of users and static (like a few researchers you collaborate with etc), you could just buy a few 3TB external drives and gift them out and tell people to go get a decent desktop search tool (like Copernic) if their desktop OS doesn't do great on indexing.

3. If it's a larger group of people who are interested you could seed a copy of the data into a peer to peer network and people could download their own 3TB of goodness (or more likely split down files by subject/site etc.). Would require enough people who have the data continuing to share / you having a fairly fast connection to seed from etc.
posted by inflatablekiwi at 1:56 PM on December 3, 2020 [2 favorites]


« Older Is there great decaf coffee that tastes EXACTLY...   |   Tell me about shipping gifts from Canada to the US Newer »
This thread is closed to new comments.