Can I streamline this tedious archival work with Python?
November 1, 2023 6:13 AM Subscribe
I have to zip a group of files from a series of folders that also contain files of another type - can I automate this with Python?
I am working on a newspaper archiving project that requires me to create a .zip file of a group of TIFFs of the pages of each issue. There are ~100 issues total that I need to do this for, commingled with file sets for ~500 other issues.
Each of these groups of TIFFs live in a separate folder for each issue that also contain a PDF of the entire issue. The latter is not needed for this part of the project. All of these folders are currently stored on a single external hard drive, and both the folder and file names contain the date for each issue in a consistent format.
Is it possible to create a Python script that will:
1. Create .zip files of the TIFFs for a specific date range or group of dates (I could manually enter these into the code, if needed).
2. Do this while excluding the .pdf file of the full issue in each of these folders?
3. Deliver the output to a specific location specified with a filepath?
I am slowly teaching myself Python, but am a rank n00b, so would greatly appreciate hivemind insight. Figuring this out could potentially save me a lot of time in the future.
I am working on a newspaper archiving project that requires me to create a .zip file of a group of TIFFs of the pages of each issue. There are ~100 issues total that I need to do this for, commingled with file sets for ~500 other issues.
Each of these groups of TIFFs live in a separate folder for each issue that also contain a PDF of the entire issue. The latter is not needed for this part of the project. All of these folders are currently stored on a single external hard drive, and both the folder and file names contain the date for each issue in a consistent format.
Is it possible to create a Python script that will:
1. Create .zip files of the TIFFs for a specific date range or group of dates (I could manually enter these into the code, if needed).
2. Do this while excluding the .pdf file of the full issue in each of these folders?
3. Deliver the output to a specific location specified with a filepath?
I am slowly teaching myself Python, but am a rank n00b, so would greatly appreciate hivemind insight. Figuring this out could potentially save me a lot of time in the future.
one way to do this would be to create an alternate folder and file structure that you can test this on (in case something gets messed up), and then use chatgpt to actually write the code. run it on the test location, see if it works as intended, if not, talk to chatgpt about what you need to revise, test again, and then run it on your target files.
posted by entropone at 6:28 AM on November 1, 2023 [1 favorite]
posted by entropone at 6:28 AM on November 1, 2023 [1 favorite]
Definitely second using ChatGPT. If you share some examples of folder and file names, I'd be happy to run it through GPT4 for you and see what code it shares.
posted by many more sunsets at 6:33 AM on November 1, 2023 [1 favorite]
posted by many more sunsets at 6:33 AM on November 1, 2023 [1 favorite]
While it is not python, for which I apologize, this is the sort of job “find -exec” is purpose-built for, letting you sort by file types and date ranges at your leisure and then performing some action on the files it finds like adding them to a zip file.
It’s certainly possible to write a python program that does this, but a shell script would likely be a simpler and much more reliable approach.
posted by mhoye at 6:36 AM on November 1, 2023 [1 favorite]
It’s certainly possible to write a python program that does this, but a shell script would likely be a simpler and much more reliable approach.
posted by mhoye at 6:36 AM on November 1, 2023 [1 favorite]
Best answer: I recently solved a similar file organization project using DropIt. I think the task you're describing would require two or three rules, but you might still find that easier than writing scripts from scratch.
posted by foursentences at 8:14 AM on November 1, 2023 [1 favorite]
posted by foursentences at 8:14 AM on November 1, 2023 [1 favorite]
Response by poster: If you share some examples of folder and file names, I'd be happy to run it through GPT4 for you and see what code it shares.
Thank you! Here are examples of 1) folder and 2) individual TIFF file names:
1. dcpl_blade_2003-03-28
2. dcpl_blade_2003-03-28_0000001.tif
There are typically 30-40 individual TIFFs making up an issue.
posted by ryanshepard at 8:16 AM on November 1, 2023
Thank you! Here are examples of 1) folder and 2) individual TIFF file names:
1. dcpl_blade_2003-03-28
2. dcpl_blade_2003-03-28_0000001.tif
There are typically 30-40 individual TIFFs making up an issue.
posted by ryanshepard at 8:16 AM on November 1, 2023
Best answer: This is the code I got after a bit of nudging and prodding. (transcript here.) I haven't tested it, but reading over it, looks right to me as long as all the folders are named properly.
posted by many more sunsets at 8:59 AM on November 1, 2023 [1 favorite]
import os import shutil import re def extract_date(foldername): match = re.search(r'\d{4}-\d{2}-\d{2}', foldername) if match: return match.group(0) def zip_tiffs(start_date, end_date, src_directory, dest_directory, dest_zip_name): all_tiffs = [] for foldername, _, filenames in os.walk(src_directory): issue_date = extract_date(foldername) if issue_date and start_date <= issue_date <= end_date: tiff_files = [os.path.join(foldername, f) for f in filenames if f.endswith('.tif')] all_tiffs.extend(tiff_files) if all_tiffs: zipf = shutil.make_archive(os.path.join(dest_directory, dest_zip_name), 'zip', src_directory, all_tiffs) # Use the function like this: zip_tiffs('2022-01-01', '2022-12-31', '/path/to/src', '/path/to/dest', 'archive_name')
posted by many more sunsets at 8:59 AM on November 1, 2023 [1 favorite]
Best answer: As a note, this assumes that the .zip file is going to be created for the entire date range. If you just want a zip file per issue, this is what you want:
posted by many more sunsets at 9:01 AM on November 1, 2023 [1 favorite]
import os import shutil import re def extract_date(foldername): match = re.search(r'\d{4}-\d{2}-\d{2}', foldername) if match: return match.group(0) def zip_tiffs(start_date, end_date, src_directory, dest_directory): for foldername, _, filenames in os.walk(src_directory): issue_date = extract_date(foldername) if issue_date and start_date <= issue_date <= end_date: tiff_files = [os.path.join(foldername, f) for f in filenames if f.endswith('.tif')] if tiff_files: zipf = shutil.make_archive(os.path.join(dest_directory, foldername.split('/')[-1]), 'zip', foldername, tiff_files) # Use the function like this: zip_tiffs('2022-01-01', '2022-12-31', '/path/to/src', '/path/to/dest')
posted by many more sunsets at 9:01 AM on November 1, 2023 [1 favorite]
This thread is closed to new comments.
posted by Alterscape at 6:26 AM on November 1, 2023