How to take a PDF and format it into a novel?
December 12, 2018 12:16 AM   Subscribe

Through my job, I've been engaged to take a PDF of raw text and turn out a novel manuscript. No copyediting or proofreading involved, just taking this 200+ pages of text and formatting it into a novel. I'm not even 100% sure what this entails except that it doesn't involve any graphic design. Does anybody have any advice about this? I've never done this before so I'm not sure how else to do it besides googling "Google Docs Novel Template." I'm obviously extremely clueless but any help/advice/material hints appreciated!
posted by anonymous to Writing & Language (12 answers total) 4 users marked this as a favorite
 
I think we need more information.

What is the output supposed to be: an ebook, a printed book, a PDF?

Do you have, or can you get, the source text in any format other than PDF?

Are there chapters? A table of contents? Is the text already divided into whatever logical units it's supposed to have, meaning what you need to do is figure out a layout for them, or are you also meant to create the units?
posted by trig at 12:53 AM on December 12, 2018 [2 favorites]


If you have a text layer in the PDF and logical divisions are all clear, you can turn this into a nice-looking novel with LaTeX fairly easily. It will do all the text layout for you, and you can tweak some parts by hand if you want to.

It might not be worth learning LaTeX if you only need to do this once ever, but it’s a good tool to have if you expect to need to produce high quality text layout in the future.
posted by SaltySalticid at 1:46 AM on December 12, 2018 [1 favorite]


Vellum is an app specifically for formatting ebooks. Plenty of templates, styles, and different sorts of one-click things it can build. I don’t know if it takes PDFs as an input, but if it doesn’t and you can figure out how to get the PDF into another format (“text layer” sounds encouraging?) Vellum will likely be able to do the rest.
posted by schadenfrau at 1:59 AM on December 12, 2018 [2 favorites]


Do you know who the publisher is, because that will change how you format and supply files?
posted by b33j at 3:28 AM on December 12, 2018 [1 favorite]


I have never used LaTeX or Vellum but I am a graphic designer, and in my experience when you extract text from a PDF, it needs copy editing and proofreading. There will be carriage returns, spaces, and tabs where they aren’t supposed to be, some text may be missing or out of order, some characters might be substituted with dingbats. PDF is for presenting a page in a specific visual format, not for containing a usable manuscript.

Do you have access to the Word file (or other word processing file) that the PDF was originally exported from? You’d get much better results with that.

It is my prediction that anyone who attempts to read a 200 page novel that has not been copy edited or proofread will throw it aside as unreadable before they get five pages in.
posted by ejs at 4:56 AM on December 12, 2018 [11 favorites]


Try Overleaf, an web based TeX editor. Create an account, do "create first project" select book, choose one of the templates, take a look at how it works. The tufte-book example template has example text that explains how to use it.
posted by bdc34 at 6:15 AM on December 12, 2018 [1 favorite]


I've had to do this for a couple of books that I have written or edited. It's been surprising to me that publishers would just ship this work back to authors rather than have the editing done on their side, but I suppose this saves them labor costs and authors aren't in much of a position to refuse.

First off, there absolutely have to be editorial guidelines from the publisher. If you don't even know whether they want you to follow one specific set of style guidelines, e.g. Chicago style manual, or another, you can't even start. On top of that, they should want fonts, margins, and million other tiny little details that you barely think about when reading a book. You absolutely have to make sure you have that from the publisher, or you're going to be doing this editing part multiple times. Go back and ask; either you missed a link about this in a previous email, or someone screwed up and didn't send it to you.

Second, as other have noted, there are extraction tools you can use to convert a PDF to text (or .docx, or many other formats). You can use one of those, edit the manuscript in something like Word, then convert it to PDF when you're done. (Others have mentioned LaTex and other more advanced or specialized apps for editing, but I'm guessing by what you've said that you won't be familiar with those.) Some of those converters are free online (e.g. PDF to Word), or you can export as a Word document in Adobe Acrobat, or you can create a script to run to do the conversion. (Google can give you instructions for the last one, but it may be more than you want to get involved with.)

Third, a number of people here have mentioned that extracting text from a PDF will leave you with a crazy-looking document, full of weird line breaks and more. That's true. However, I wouldn't let that worry you, or feel like it's put you in a worse position. As an editor, you're going to have to look at every single word, letter, comma, dash, non-breaking space, and more multiple times anyway. If anything, a crazy document that requires a much more intensive review will tend to make it more likely that you'll catch things than a low-stress one that you can skim through. I won't say it makes your job easier, but it is likely to lead to you doing a better job.

Fourth, since this is a novel, it's probably less hassle than other kinds of books. I've had to do academic books with all kinds of quotes, bibliographies, charts, diagrams, formal logical stuff, etc. and that just adds whole new layers of difficulty. A novel should have much less of that, unless you've been handed some insane avant-garde Mark Danielewski thing.

Good luck!
posted by el_lupino at 6:25 AM on December 12, 2018 [5 favorites]


Yeah seconding the need to proof the thing after it’s been extracted. Grammarly doesn’t catch everything but it does catch obviously weird stuff. The combo of extracted Word file + Grammarly + Vellum (which allows you to preview) is probably the easiest and most accurate option. I would still have other people proofread the resulting ebook for you (multiple people, and they will probably find different errors), but also know that I’ve never read a professionally produced ebook without at least a couple of errors. I don’t know what it is, but even with all the people at publishing houses, they get through. So it’s like with contaminant particles in food — don’t let through more than the percentage allowed by the FDA and you’re probably good.
posted by schadenfrau at 6:28 AM on December 12, 2018 [1 favorite]


Completely sideways viewpoint here, but this sounds to me like a job for Amazon Mechanical Turk. Full disclosure, I am a worker on Mechanical Turk, however I am not your worker, and would not work on the task I am about to describe.

There are tasks on Mechanical Turk everyday involving transcribing text from a PDF into another document. A Google doc. A spreadsheet, a text file. I don't work on them because I find them tedious, monotonous, and mind-numbing, but there are people who absolutely love this type of work.

Where I in your shoes, I would visit the Reddit forum, turkkit, and ask the good folks there how they would go about setting up such a task. What would they want you to pay for it? What were they recommend as a length of CDF file to convert to text for any one single task? How would they recommend you set up your quality control? Do you need one task for workers to copy the PDF into the dock, and a second to verify that that transcription has been Faithfully executed? Basically, throw yourselves on the mercy of the workers, and they can tell you exactly what you should be doing so that they will work on your task.

Then, if it makes Financial sense for you, do whatever it is they say. The workers who were in The Reddit Forum are some of the most conscientious and dedicated workers I have ever come across. It's the only Forum I ever visit for Mechanical Turk related questions, issues, and support, and they won't steer you wrong.

(Please forgive weirdness and capitalization and punctuation. I am using the Google Doc talk to text function, and haven't quite figured out yet what the heck I'm doing, or how to best repair these issues without doing a whole lot of stuff with my hands, which would defeat the purpose of not doing things with my hands. )

posted by The Almighty Mommy Goddess at 7:19 AM on December 12, 2018 [1 favorite]


I keyed in on the term "novel manuscript" in your question, which implies that they're not looking for an e-book or anything complicated like that. I think what you you need to do is convert the PDF to a Word file with just basic formatting for the title page, table of contents, chapter titles and whatever other elements the text may have, which mercifully for a novel should be few and far between.
To do this most efficiently, use a PDF to Word converter such as the one el_lupino suggests. Don't export to word directly from the PDF file itself. Using a converter will eliminate of problems like hard line breaks between each line. There are two types of PDFs--"live text" PDFs that are created directly from some other digital format and "dead image" PDFs that are produced by a scanner. If you are converting from an image PDF you'll definitely want to at least spell check the resulting word document because this requires an additional step of OCR/optical character recognition, which sometimes misreads characters.

Then select all the text and set it to very basic format: 12 pt Times, all full justified, double spaced, no extra line spacing, etc. etc. Go through the document and eliminate hard page breaks and format any special text like chapter titles how you'd like. And there you are --a novel in manuscript format. If you are aware of any more specific publisher manuscript submission formatting guidelines, obviously follow them.
posted by drlith at 7:49 AM on December 12, 2018 [2 favorites]


There will be carriage returns, spaces, and tabs where they aren’t supposed to be

That’s one of the beauties of LaTeX, literally none of that will matter, at all. Raw text is compiled into a suitable output format, and and this parsing generally ignores tabs, extra whitespace, and cr/newlines when laying out text— it just works to make things look nice.

LaTeX has been a publishing standard for decades for a reason: it’s the best text layout processing tool out there, it’s insanely stable, custumizable and it’s free.

Keep content and layout separate!
posted by SaltySalticid at 10:19 AM on December 12, 2018


When I used to "build" textbooks we would never have accepted a PDF or text file of book contents. In order to create a book, we needed a Word document. The author had to create the PDF from somewhere - probably a Word doc. Are you in touch with the author?
posted by bendy at 12:19 AM on December 13, 2018


« Older Structural or civil engineer consultant   |   "And in her ears the little Seashells, the thimble... Newer »
This thread is closed to new comments.