Can I streamline this tedious archival work with Python?
November 1, 2023 6:13 AM   Subscribe

I have to zip a group of files from a series of folders that also contain files of another type - can I automate this with Python?

I am working on a newspaper archiving project that requires me to create a .zip file of a group of TIFFs of the pages of each issue. There are ~100 issues total that I need to do this for, commingled with file sets for ~500 other issues.

Each of these groups of TIFFs live in a separate folder for each issue that also contain a PDF of the entire issue. The latter is not needed for this part of the project. All of these folders are currently stored on a single external hard drive, and both the folder and file names contain the date for each issue in a consistent format.

Is it possible to create a Python script that will:

1. Create .zip files of the TIFFs for a specific date range or group of dates (I could manually enter these into the code, if needed).

2. Do this while excluding the .pdf file of the full issue in each of these folders?

3. Deliver the output to a specific location specified with a filepath?

I am slowly teaching myself Python, but am a rank n00b, so would greatly appreciate hivemind insight. Figuring this out could potentially save me a lot of time in the future.
posted by ryanshepard to Computers & Internet (8 answers total) 3 users marked this as a favorite
 
It is entirely possible. You can use the pathlib built-in library + string manipulation (or regular expressions, if the naming scheme for folders is really janky) to identify what folders you need, pathlib to iterate the contents of each folder and build a list of paths to tif files only, and the zipfile built-in library to create an archive from that list.
posted by Alterscape at 6:26 AM on November 1, 2023


one way to do this would be to create an alternate folder and file structure that you can test this on (in case something gets messed up), and then use chatgpt to actually write the code. run it on the test location, see if it works as intended, if not, talk to chatgpt about what you need to revise, test again, and then run it on your target files.
posted by entropone at 6:28 AM on November 1, 2023 [1 favorite]


Definitely second using ChatGPT. If you share some examples of folder and file names, I'd be happy to run it through GPT4 for you and see what code it shares.
posted by many more sunsets at 6:33 AM on November 1, 2023 [1 favorite]


While it is not python, for which I apologize, this is the sort of job “find -exec” is purpose-built for, letting you sort by file types and date ranges at your leisure and then performing some action on the files it finds like adding them to a zip file.

It’s certainly possible to write a python program that does this, but a shell script would likely be a simpler and much more reliable approach.
posted by mhoye at 6:36 AM on November 1, 2023 [1 favorite]


Best answer: I recently solved a similar file organization project using DropIt. I think the task you're describing would require two or three rules, but you might still find that easier than writing scripts from scratch.
posted by foursentences at 8:14 AM on November 1, 2023 [1 favorite]


Response by poster: If you share some examples of folder and file names, I'd be happy to run it through GPT4 for you and see what code it shares.

Thank you! Here are examples of 1) folder and 2) individual TIFF file names:

1. dcpl_blade_2003-03-28
2. dcpl_blade_2003-03-28_0000001.tif

There are typically 30-40 individual TIFFs making up an issue.
posted by ryanshepard at 8:16 AM on November 1, 2023


Best answer: This is the code I got after a bit of nudging and prodding. (transcript here.) I haven't tested it, but reading over it, looks right to me as long as all the folders are named properly.
import os
import shutil
import re

def extract_date(foldername):
    match = re.search(r'\d{4}-\d{2}-\d{2}', foldername)
    if match:
        return match.group(0)

def zip_tiffs(start_date, end_date, src_directory, dest_directory, dest_zip_name):
    all_tiffs = []
    for foldername, _, filenames in os.walk(src_directory):
        issue_date = extract_date(foldername)
        if issue_date and start_date <= issue_date <= end_date:
            tiff_files = [os.path.join(foldername, f) for f in filenames if f.endswith('.tif')]
            all_tiffs.extend(tiff_files)
    
    if all_tiffs:
        zipf = shutil.make_archive(os.path.join(dest_directory, dest_zip_name), 'zip', src_directory, all_tiffs)

# Use the function like this:
zip_tiffs('2022-01-01', '2022-12-31', '/path/to/src', '/path/to/dest', 'archive_name')

posted by many more sunsets at 8:59 AM on November 1, 2023 [1 favorite]


Best answer: As a note, this assumes that the .zip file is going to be created for the entire date range. If you just want a zip file per issue, this is what you want:
import os
import shutil
import re

def extract_date(foldername):
    match = re.search(r'\d{4}-\d{2}-\d{2}', foldername)
    if match:
        return match.group(0)

def zip_tiffs(start_date, end_date, src_directory, dest_directory):
    for foldername, _, filenames in os.walk(src_directory):
        issue_date = extract_date(foldername)
        if issue_date and start_date <= issue_date <= end_date:
            tiff_files = [os.path.join(foldername, f) for f in filenames if f.endswith('.tif')]
            if tiff_files:
                zipf = shutil.make_archive(os.path.join(dest_directory, foldername.split('/')[-1]), 'zip', foldername, tiff_files)

# Use the function like this:
zip_tiffs('2022-01-01', '2022-12-31', '/path/to/src', '/path/to/dest')

posted by many more sunsets at 9:01 AM on November 1, 2023 [1 favorite]


« Older Digital Picture Frames for Low-Tech Seniors   |   Why am I surrounded by realtors Newer »
This thread is closed to new comments.