Project Gutenberg nightmare, help!
June 9, 2011 1:45 AM   Subscribe

I've downloaded the Project Gutenberg CD, which contains 600 or so classic books on an ISO. I'm wanting to put those books on my Nook via Calibre, and need some help.

Calibre and the Nook can handle the .txt and .html files that are in the iso, but the filing system PG uses is really weird. You can only access the .txt and .html files through an html index page included. The individual books are, for some completely mind-boggling reason, not labeled with the book title and author. How do I at the very least name these books with title and author? Has anyone else run into this problem?
posted by zardoz to Computers & Internet (8 answers total) 4 users marked this as a favorite
 
Are the files on the CD only txt and html, or does it also include the epub versions?

It looks like PG includes metadata with its epub and mobi files, and to some extent with their html files. From my limited experiments: if you open calibre and import a PG epub file, calibre will list the correct author and title even though they're not included in the filename. If you import a PG html file, it will include both title and author in the title field and nothing in the author field.

Do you need to use calibre to put books on a nook anyway? If not, I'd try just uploading a book or two and seeing if the nook recognized their metadata. If that doesn't work, you can import them all to calibre. If the CD doesn't have epub files and if you're not set on specifically the 600 books included on that ISO, you can use this page (mentioned on your PG link) to download the epub versions of whichever books you like.

If you want exactly those books, they're not in a well-tagged format the nook can read natively, and you don't have a list of the book numbers to feed into the site above, you could write a script to download the epub versions of the books. If you don't know how, maybe someone on an ebook forum like mobileread could help.

Or maybe there's an easier way I'm overlooking :)
posted by trig at 3:43 AM on June 9, 2011


Response by poster: trig--yes, the files are all .txt, with some with both .txt and .html. Once I can get them imported into Calibre, it's a simple click to convert it to .epub, and then transfer to the Nook.

I know I could just get an .epub of every individual book from the PG site, but it seems to me the whole point of the CD/DVD download is that you can get ALL the books at once without hunting for hours for each one. But the system they use...I can't understand what they were thinking.

For example, Paradise Lost has the cryptic name plrabn12.txt. Why not Paradise Lost by John Milton?

I should add I'm not a programmer in the least. I'll check with the mobilereads forum, thanks for the suggestion.
posted by zardoz at 6:09 AM on June 9, 2011


Gutenberg has its own file naming convention. It's not very useful, but it's historic. I'd stick with trig's epub suggestion unless you want years of fun farting about with file conversions.
posted by scruss at 6:41 AM on June 9, 2011


Gutenberg links to a catalogue of epub books, which may be useful.
posted by jeather at 6:53 AM on June 9, 2011


For example, Paradise Lost has the cryptic name plrabn12.txt. Why not Paradise Lost by John Milton?

Because that would require computers that can handle long filesystems. With 8.3 (ie, DOS-style) filenames you could read those books on computers and other devices that predate the Windows era or don't implement long filenames, and PG are the types that would care about not arbitrarily excluding like that.

Their FAQ talks a bit about file naming.
posted by mendel at 9:58 AM on June 9, 2011


Dammit.

For "computers that can handle long filesystems", please read "computers that can handle long filenames" or even "filesystems that can handle long filenames".
posted by mendel at 10:00 AM on June 9, 2011


It's not hard to write a script to rename them automatically... How do you want to handle multiple copies of the same book but with different formatting? Also, handling special characters is, uh, more interesting (pain in the rear) (although one solution is to just skip all texts not marked as English, if you don't read other languages this is a pretty good option). Anyway, here's a quicky Python script, if you wanted it to actually make changes then you would replace the "print" statement with "rename" (but because of the above notes, I do not recommend doing this). I have to go to work, so no more time to work on it right now, but maybe tonight. Oh, and this is only doing the .txt files, because most of the .htm files actually do have embedded metadata and Calibre recognizes them just fine.
from os import listdir,curdir,rename;
from re import search;

for fileName in listdir(curdir):
	if(search("\.txt$",fileName)):
		title = False;
		author = False;
		file = open(fileName);
		while(not title):
			title = search("Title: (.*)\n",file.readline());
		while(not author):
			author = search("Author: (.*)\n",file.readline());
		print(fileName,title.group(1) + " by " + author.group(1) + ".txt");

posted by anaelith at 10:35 AM on June 9, 2011 [1 favorite]


Yeah, the PG naming system really is annoying to work with.

Did you see that link in my post? I haven't tried it, but it looks like it'll let you download a bunch of epubs at once. Still tedious unless you have a list of the book numbers you want, but entering author names might be easier than manual downloads.


One more suggestion: I think you can find nice renamed bundles of PG books via bittorrent.
posted by trig at 2:34 PM on June 9, 2011


« Older Real job for HS dependent?   |   Windows EFS locking me out of my files. What to do... Newer »
This thread is closed to new comments.