Fastest way to list keywords for a book index?
I need help create a word list for a book index. You can mark words in MS Word, and it will then create an index for you, but my publisher wants to create the index. From me, she just needs a list of the words that are going to be in the index.

Each chapter of my book is stored in a separate Word file. So I'm trying to figure out the quickest way to do this. I've thought about doing a find/replace to and changing all spaces to hard returns. This would put each word on a separate line. Then I could sort them. I could then copy/paste them into Excel, which will let me remove redundancies. Then I guess I will have to go through the list an remove all the unimportant words and different word forms (i.e. cat and cats).

Can anyone think of any techniques or applications (free? shareware?) that will speed this up?

I'm on a WinXp machine.
That is a terrible way to make an index, as it excludes multi-word entries, synonyms that don't actually appear on a page, and any way to reference a topic under a range of pages or using "see also."

What you're talking about making is a concordance, which you often see in Bibles, not an index.

The publisher should have someone on staff who can do a real index without being given a list of words. If not, your index is going to suck anyway, so you may as well just pick a few dozen representative terms by hand and send the publisher that -- it's a lot easier.
See this AskMe thread.
hey what did you write? novel? tech manual? what?
Response by poster: Kindall, suck or not, that's what I've been TOLD to do. And I've read other books from this publisher. Their indexes don't suck. I'll admit, their index-creation method is odd, but somehow it works out in the end.

Still, I am left with the problem of generating the lists in less than 10,000 years.

Xmutex. It's a computer book about a specific application. Not a manual. More like a "dummies" book, but for intermediate users.
If you have Adobe Acrobat Version 4, you can PDF the files with the underlying text, then run the Catalog program (part of the Acrobat package)and it will build an index for you. Very handy tool that might work for you.
Response by poster: Thanks for the help, rhapsodie and vito90. I looked at that thread and I'll look at Acrobat, but I suspect both of those will lead me to index-makers.

Just to clarify: my publisher won't accept an index. In other words, they don't want a list of terms and page numbers on which those terms appear. They just want a list of words and phrases that I think should appear in the index.

I was asking for tips that will help me with this task.
Here is a really primitive way you could do it: Dump the entire text into Word, then break the text into a list of words - do a search and replace on spaces - use the "replace all" button to replace paragraph marks (^P) for all of the spaces. Select the entire word list when you are done, then do an alphabetical sort. From that point you could just eyeball it to get rid of the multiples, or do more search-and replaces, such as find "the^Pthe^Pthe^Pthe^Pthe^P" and replace "the^P" - running the process again and again until you get only one instance of "the." That should get you to a unique list of words, I think. It will still take a while, but at least it's a little more programmatic.

You are on your own for phrases though...
Hey grumblebee.

Take the DOC file, remove all breaks, then sort the little bugger in Excel. Export that list as a TXT file.

Send that file to a friend who works in UNIX, and they can run the UNIQ command on it (it takes just a few seconds), and you'll have a list of unique terms that you can then scan for pronouns, prepositions, and the like.

If you don't have a friend that works on UNIX, I do... and for some pittance of remuneration I'd be happy to take care of the whole thing start to finish.

: )
And if you're on OS X, this little command line tutorial gives you everything you need (and an extra step — the last — I'd skip).
Let's try this again: this little command line tutorial.
In defense of grumblebee's publisher - this is standard practice for publishers when bringing out a technical work. I'm an editor, and when I work on a specialist technical book I often know dick-all about the subject matter. I need to be sure that I'm including the right concepts and terms.
Response by poster: In case anyone is interested, I think I found the easiest way to do it (and I feel dumb that it never occurred to me before).

1. Mark index entries in Word (even though you don't want to create an actual index.) You can mark entries by highlighting them and pressing CTRL+ALT+X (COMMAND+OPTION+X).

2. Choose Insert > Reference > Index and have Word create an index.

3. Copy the index and paste it into a new document. (You HAVE to complete this step, or the following one won't work -- because Word won't let you do find/replaces within one if its indexes, but a pasted index isn't an index, as far as Word is concerned.

4. In the pasted-index document, choose Edit > Replace and in the Replace dialogue, click the MORE button.

5. Check the wildcards option (Words weak version of Regular Expressions), and in the Find field, enter [0-9]. Enter nothing in the Replace field. Click the Replace all button.

6. Replace all commas with nothing.

This will get rid of all the commas (which Word places after index entries) and page number references, and you'll be left with a sorted word list. The only negative is that the first replace will wipe out any index entries that contain digits. So while you're marking words and phrases, you should paste those into a separate file.
