Recommend a batch tool for converting 1000s of .TXT files to .PDF
April 16, 2020 8:13 AM Subscribe
I have a client who is a lawyer who received thousands of individual .txt files from his opposing counsel and wants them combined into a smaller set of PDF documents so he can review them more manageably. Is there a batch tool or other automation for Windows 10 you would recommend that can handle the large number of files? He does have Adobe Acrobat installed.
If you want to nerd out, Pandoc is made for this. (Well, I should say that it's a big hammer for this job, but will work.) There are some GUI wrappers too but I don't have any experience with them.
posted by ftm at 8:27 AM on April 16, 2020 [5 favorites]
posted by ftm at 8:27 AM on April 16, 2020 [5 favorites]
ImageMagick and a script will do the job. I can provide a sample batch file if you'd like.
posted by jmfitch at 9:30 AM on April 16, 2020 [1 favorite]
posted by jmfitch at 9:30 AM on April 16, 2020 [1 favorite]
Response by poster: jmfitch - check your MeMail for my contact info, thanks!
There are approximately 90,000 files (not an exaggeration), so I really need something more automated than drag-and-drop
posted by briank at 9:58 AM on April 16, 2020 [1 favorite]
There are approximately 90,000 files (not an exaggeration), so I really need something more automated than drag-and-drop
posted by briank at 9:58 AM on April 16, 2020 [1 favorite]
He should look into a tree-based organizer like MyNotesKeeper or Treepad. Converting that many text files to PDF does not help much with the need to organize them and review them.
posted by megatherium at 10:47 AM on April 16, 2020
posted by megatherium at 10:47 AM on April 16, 2020
Goldfynch is an online service that might be an excellent candidate for this task. Make an account and upload your data, and it can take care of the rest. You can also then use it for search and discovery workflow.
*disclaimer: I know the developers of Goldfynch, and the terrible data format challenges they describe make me crawl back to my simple world of geospatial data.
posted by hobu at 10:49 AM on April 16, 2020
*disclaimer: I know the developers of Goldfynch, and the terrible data format challenges they describe make me crawl back to my simple world of geospatial data.
posted by hobu at 10:49 AM on April 16, 2020
Edit: I rushed my response. I use LibreOffice and PDFtk Server to do this for documents (ImageMagick is for images derp). Here's sample batch code ('T:\' is just the source dir):
FOR /r "T:\" %%F IN (*.txt) DO "C:\Program Files\LibreOffice\program\soffice.exe" --headless --convert-to pdf:writer_pdf_Export --outdir "%%~dpF." "%%F"
FOR /f "tokens=*" %%G IN ('dir /b /s /r /a:d "T:\"') DO pdftk "%%G\*.pdf" cat output "%%G.pdf"
This will simply roll up any documents in a subdirectory into a single PDF with that directory's name. Give it a rip and feel free to message with questions!
posted by jmfitch at 11:03 AM on April 16, 2020 [2 favorites]
FOR /r "T:\" %%F IN (*.txt) DO "C:\Program Files\LibreOffice\program\soffice.exe" --headless --convert-to pdf:writer_pdf_Export --outdir "%%~dpF." "%%F"
FOR /f "tokens=*" %%G IN ('dir /b /s /r /a:d "T:\"') DO pdftk "%%G\*.pdf" cat output "%%G.pdf"
This will simply roll up any documents in a subdirectory into a single PDF with that directory's name. Give it a rip and feel free to message with questions!
posted by jmfitch at 11:03 AM on April 16, 2020 [2 favorites]
I'd agree with megatherium that I'd want to keep them as txt, and use notepad++ or similar on the directory, you can search through all files in the directory, and easier to browse.
If you do want to concatenate them, you can do it without any additional software in Windows, using the command prompt and notepad. Copy all the txt files (or a subset that you want to group) into a folder, say c:\temp\text.
Then at a command prompt, type this:
copy c:\temp\text\*.txt c:\temp\concatenatedfile1.txt
will merge them all into a file called concatenatedfile1.txt. Then you can open than in notepad, word etc and print to PDF (built in to Windows) as you wish. Repeat for next group, changing the 1 to 2, so it doesn't overwrite it.
concatenatedfile1.txt does either need to be in a different directory, or else use a different extension, else it will try to append itself to itself.
posted by Boobus Tuber at 12:15 PM on April 16, 2020 [1 favorite]
If you do want to concatenate them, you can do it without any additional software in Windows, using the command prompt and notepad. Copy all the txt files (or a subset that you want to group) into a folder, say c:\temp\text.
Then at a command prompt, type this:
copy c:\temp\text\*.txt c:\temp\concatenatedfile1.txt
will merge them all into a file called concatenatedfile1.txt. Then you can open than in notepad, word etc and print to PDF (built in to Windows) as you wish. Repeat for next group, changing the 1 to 2, so it doesn't overwrite it.
concatenatedfile1.txt does either need to be in a different directory, or else use a different extension, else it will try to append itself to itself.
posted by Boobus Tuber at 12:15 PM on April 16, 2020 [1 favorite]
Is there any particular naming convention for the files that could be used to batch them up into smaller groups? Starting up LibreOffice 90,000 times might take forever. And then that 90,000 filename long PDFtk command line...
If you could somehow group them up into 90 sets of 1000 files... There's probably a batch script that would do something like:
<h2>filename-1</h2>
text of filename-1
<h2>filename-2</h2>
text of filename-2
...
And combine the 1000 files into one HTML file which you could then convert into PDF somehow and possibly even get autogenerated table-of-contents links to each individual file. Only having to do the convert-to-pdf 90 times.
posted by zengargoyle at 12:21 PM on April 16, 2020 [1 favorite]
If you could somehow group them up into 90 sets of 1000 files... There's probably a batch script that would do something like:
<h2>filename-1</h2>
text of filename-1
<h2>filename-2</h2>
text of filename-2
...
And combine the 1000 files into one HTML file which you could then convert into PDF somehow and possibly even get autogenerated table-of-contents links to each individual file. Only having to do the convert-to-pdf 90 times.
posted by zengargoyle at 12:21 PM on April 16, 2020 [1 favorite]
Good points. If you want to keep them just as text, Boobus Tuber has it nailed.
posted by jmfitch at 5:51 PM on April 16, 2020
posted by jmfitch at 5:51 PM on April 16, 2020
This thread is closed to new comments.
posted by curious nu at 8:22 AM on April 16, 2020 [1 favorite]