Combining multiple json files in a batch: Help!!
June 2, 2016 10:12 PM   Subscribe

I have about 12K json files that I need to combine into one or two. The only tools I can find online require you to select each file one at a time (or are super pricey for a one-time use). This isn't really feasible with 12K files. Ideally, I could select all files, click upload, and the tool would generate a joined file. This is way beyond my limited coding knowledge. Any thoughts? Thanks!
posted by emkelley to Computers & Internet (14 answers total) 2 users marked this as a favorite
if all you need to do is concatenate test files you can do that in a bash like shell thusly:

cat *.json >> one_file_to_rule_them_all

or if they're in subdirectories:

find ./ -iname "*json" -exec cat "{}" >> one_file_to_rule_them_all \;

If you need the result to comply with some sort of schema things will be harder.
posted by mce at 10:28 PM on June 2, 2016 [1 favorite]

Just catting the files together will result in a large blob of text, not JSON -- you need to have a comma between each object, and slap brackets around the thing, if nothing else.

Do you want to join them into a single usable JSON object, and are you comfortable with throwing together a bit of python?

(I have a little python widget I use for this at work, but it's at work and I have not yet had enough coffee to want to re-make it right now, but maybe I can memail it later if you're interested. Sorry for the half-offer here.)
posted by sldownard at 11:46 PM on June 2, 2016 [1 favorite]

Easiest way would be to use load jsons in some high level language, where the data is easier to manage, then combine, and then dump the result back as a json. With python you'd want to have something that:

- builds an empty dict: d = {}
- reads all file names in the dir where you have the jsons.
- loops on filenames: for filename in filenames:
-- opens a file at time: f = open(filename)
-- parses it into new dict: nd = json.load(f)
-- a) saves it into result dict: d[filename] = nd
-- OR b) updates the result dict overwriting with new values: d.update(nd)
- then save the result dict as a json: f = open('result.json', 'w')
- json.dump(f)
- f.close()

a or b is because we don't know if you want the joined files to be kept as separate entries in one big json: [file1=[a,b], file2=[a, d]] or use the same namespace, e.g. join(file1([a, b]), file2([a, d])) = [a, b, d].
Python's inbuilt JSON library can handle this, documentation:
posted by Free word order! at 1:46 AM on June 3, 2016 [1 favorite]

second last line should be 'json.dump(d, f)'
posted by Free word order! at 1:56 AM on June 3, 2016

Is each file a separate JSON object (with { } around the whole file) or array (with [ ] around it) and you need to merge all the objects or arrays into one? Or does each file contain multiple JSON objects, one per line? Or what? Maybe you could post one of the files here.
posted by nicwolff at 3:49 AM on June 3, 2016

Just a heads up that combining JSON files into a single file that's technically valid JSON is probably not enough; whatever software you've got that reads these makes assumptions about how the JSON is laid out. So that's why there's no standard tool for doing this; nicwolff's question is getting at this. If you're lucky, the files are a sequence of JSON objects and mce's solution will work. Otherwise some programming will be required.
posted by xris at 4:17 AM on June 3, 2016

I'm using the tripadvisor dataset to do some comps for a business I'm working on. Ideally, I'd get all these files in an excel sheet with the hotel number as its own column. That's pretty easy for me to drop though. What is the best way of doing analysis on this data? I only have excel on my machine but I could download a tool that is relatively straightforward.

I am trying to get them in one file because I found a guy on fiverr who can convert a file into an excel but it's $5 a file and I'm cheap. I can do this manually -- I did a couple last night -- but unfortunately I really need the total numbers. Doing just a sample is insufficient.

Thank you for your help. I really appreciate it.
posted by emkelley at 4:53 AM on June 3, 2016

Didn't mention - if I don't need to get the fiverr guy to do it, great!

Also I don't know how to do python or any code besides some basic html and css. If this requires a python solution, I can handle that if someone can point me to a python for dummies. Like the code that mce posted? I don't even know where to begin with that.

There are also text .dat files.

here's one of the raw files: dropbox.
posted by emkelley at 5:00 AM on June 3, 2016

So looking at this sample file - what data are you wanting to keep in a CSV?
posted by jferg at 5:42 AM on June 3, 2016

Try this:

It all runs in-browser, so nothing is sent to my server. You can save the .html file to your desktop and run it from there. You can select multiple input files in the file selector.

Since I wasn't sure what data you want to go into the CSV, I just gave some examples - you can save the HTML locally and edit it as you please, or send me memail with more details of what you want in the CSV. Adding columns to the CSV is easy.
posted by jferg at 7:00 AM on June 3, 2016

What would the ideal final data file that you want in Excel look like, what are the columns and rows?
You're not going to load 3 GB of data in Excel and have anyone be happy about it.
posted by demiurge at 10:34 AM on June 3, 2016

here's my attempt
tab separated output
the text output isn't probably what you want but you can tweak the script to out put the text as you like.
posted by andendau at 1:52 AM on June 4, 2016

You're probably not aiming to get all these files into one file at first. Each file is a hotel with various data about the hotel and the ratings of the hotel by users. Each row of the spreadsheet would ideally be some aggregation of the data. That you could then use excel to compare all hotels against each other.
posted by andendau at 9:17 AM on June 4, 2016

« Older Looking for a "Bathing" Vase   |   How can I avoid "mindfulness"/meditation training... Newer »
This thread is closed to new comments.