Text Mining for Lazies
January 22, 2010 2:32 PM   Subscribe

I'm looking for a simple, low-level text mining software.

I have about ten text-based documents in PDF and html. I'd like to tally the amount of times certain words or phrases appear in each document. The documents are each about 10 pages long. This is a fairly small project as far as text mining goes, I believe (I'm not looking at gobs and gobs of unwieldy data, we're talking about less than 100 pages total of pure text with about 75 total queries). Ideally, I wouldn't have to learn Python to do this, although that's on my plate for future projects.

Is there a program that I can use for this purpose, preferably one that is open source or inexpensive? I do have access to some high level statistical software programs (SAS, SPSS, and others) so expensive programs aren't off the table, especially if they're not terribly hard to use. Or is my best bet a brute force method (find and tally by hand) unless I want to learn Python for this particular project?
posted by k8lin to Computers & Internet (20 answers total) 2 users marked this as a favorite
 
Is it viable to convert your PDF to text? You could then use "grep" to do this.
posted by jkaczor at 2:36 PM on January 22, 2010


For text files one can use a program like bbedit which embeds this kind of functionality but usually does it one file at a time. This will still be much faster than tallying by hand.

Doing this for multiple files at once is where the command line shows its advantages, I would imagine the solution would likely involve using find, grep, and sed.
posted by idiopath at 2:41 PM on January 22, 2010


The UNIX tool "grep" will only count instances per line. So if a word occurs more than once per line, your end result will be incorrect. It's easy enough to whip up a script to split input on whitespaces and count word frequencies, though.
posted by Blazecock Pileon at 2:43 PM on January 22, 2010


jkaczor: grep would help, I think you would need some other utilities too, since grep can only give you the number of lines that match, not the number of total matches.
posted by idiopath at 2:43 PM on January 22, 2010


pdfs can be difficult to search depending on how they are created.
Some pdfs I have from magazines and newspapers feature a form
of encryptions that I have had trouble finding tools that can handle
when I wish to search across them.
posted by digividal at 2:45 PM on January 22, 2010


I should have been more specific. I am on a PC running Windows.

I could definitely convert the PDF to plain text.
posted by k8lin at 2:48 PM on January 22, 2010


This is super kludgy and I'm sure there's a better way to do this involving actual software or command line stuff or what have you... but you could get the files into plain text, search and replace line returns for spaces, and then paste into Excel so you have one word per line and analyze from there using pivot tables (or bring into Access, or whatever). I wouldn't think you'd hit the record limit, if you assume 500 words/page it's 50,000 or so lines.
posted by yarrow at 2:51 PM on January 22, 2010


Convert to text and open in any text editor that shows the number of results for a search.

I'm using Google Chrome, loading up a text file in it and searching for a word will show the # of matches. Rinse and repeat.
posted by wongcorgi at 2:54 PM on January 22, 2010


You use pdf2html (or whatever plain text converter you have) and throw your textfiles in a directory, then you type up another text file with all of the words an phrases you want to filter for, then you open up a command line, and you type something like:
for /f %i in (phrasefile.txt) do find /c /i "%i" \pdfdir\*.*

posted by rhizome at 2:57 PM on January 22, 2010


Here's a Perl script (countFrequencies.pl) you can use to count word frequencies:

#!/usr/bin/perl -w

use strict;

my %dictionary;
while (my $line = %lt;STDIN%gt;) {
  $line =~ s/[?;:!,.\"()]|[0-9]//g; # strip punctuation
  my @elements = split (/\s+/,$line);
  if (scalar @elements > 0) {
    foreach my $element (@elements) {
      if (! defined $dictionary{lc($element)}) {
        $dictionary{lc($element)} = 1;
      } else {
        $dictionary{lc($element)}++;
      }
    }
  }
}

foreach my $word (sort keys %dictionary) {
  print "$word: $dictionary{$word}\n";
}


To use it, type in:

countFrequencies.pl < myFile.txt

It will output something like:

a: 4
about: 4
access: 1
although: 1
amount: 1
and: 4
...
total: 2
unless: 1
unwieldy: 1
use: 2
want: 1
we're: 1
with: 1
words: 1
wouldn't: 1


if I use your Ask Metafilter question as input.

I've never done Perl on Windows, but I imagine it would be possible with Cygwin or another open source option.

For your PDF files, use pdf2txt or similar to first convert them to text.
posted by Blazecock Pileon at 3:05 PM on January 22, 2010 [1 favorite]


Sorry, slight typo. That should be:

#!/usr/bin/perl -w

use strict;

my %dictionary;
while (my $line = <STDIN>) {
  $line =~ s/[?;:!,.\"()]|[0-9]//g; # strip punctuation and numbers
  my @elements = split (/\s+/,$line);
  if (scalar @elements > 0) {
    foreach my $element (@elements) {
      if (! defined $dictionary{lc($element)}) {
        $dictionary{lc($element)} = 1;
      } else {
        $dictionary{lc($element)}++;
      }
    }
  }
}

foreach my $word (sort keys %dictionary) {
  print "$word: $dictionary{$word}\n";
}

posted by Blazecock Pileon at 3:07 PM on January 22, 2010 [3 favorites]


The "Find in files" feature of Notepad++ might do what you want (on the text files, at least). Point it at your directory, enter the word you want to find, and it will locate the word and tally the number of instances.
posted by arco at 3:14 PM on January 22, 2010


I would probably go with something like Blazecock Pileon's perl script -- unless there are hyphenation issues. Do you have hyphenation due to line breaks ("predis- posed")? How do you want to treat hyphenated phrases ("hard-earned")?
posted by mhum at 3:19 PM on January 22, 2010


I modified rhizome's batch file command like this:
for /f "tokens=*" %i in (phrasefile.txt) do find /c /i "%i" file1.txt
I needed to search for phrases and words contained in the phrase file, which is what tokens=* allows for, and I need to look at each document separately. Otherwise, this command was exactly what I was looking for but didn't necessarily articulate very well in my question, so I'm very grateful that someone figured that out.

I imagine that Blazecock Pileon's perl script would also work (might require some massaging to get it to do phrases also, but I haven't messed with it). Rhizome's is fairly low-effort and works - I checked it using brute force on the test document and it is behaving exactly as I'd wanted it to behave (i.e. the word "trust" also returns for "trustworthy" which is great for my purposes, but might not work in other applications) - so I'll go with that for now. Although it's returning counts per line, that's ok. I really just need a close estimate of the word/phrase count because the difference between zero and one in this particular project is significant.

Thanks for all the helpful and speedy responses.
posted by k8lin at 3:54 PM on January 22, 2010 [1 favorite]


You said you have access to SPSS, so why don't you use SPSS Text Analysis. This will do exactly what you are asking for and a lot more.

Also, if you want a really quick and dirty analysis of word counts, check out this site.
posted by crapples at 7:32 PM on January 22, 2010


Download the trial version of dtSearch and run its indexer on the ten PDF files (assuming that they were created as text-searchable, not scanned). dtSearch Desktop will give you a count of all the words it indexes.
posted by megatherium at 7:51 PM on January 22, 2010


BP, there's tons of extraneous stuff in that script. 'foreach' on an empty list does nothing, so the 'if scalar...' test is a no-op. Likewise ++ always treats undef as numeric 0 and sets it to 1, so the inner 'if' is a no-op as well. It would be more idiomatic perl to write the loop as:
while (<>) {
  s/[?;:!,.\"()]|[0-9]//g;
  $dictionary{lc($_)}++ foreach (split (/\s+/));
}

posted by Rhomboid at 11:32 PM on January 22, 2010


I'm sure there are lots of ways to improve it.
posted by Blazecock Pileon at 3:34 AM on January 23, 2010


For fun, I took at look at doing this with Python:

#!/usr/bin/python

import sys
import re

dictionary = {}
stripCharacters = re.compile(r"(\n|[0-9]+|\.|,|\?|!|\"|\(|\)|:|;|{|}|=|\-|\+|\[|\])")

for line in sys.stdin.readlines():
  strippedLine = stripCharacters.sub('', line),
  for element in re.split('\s+', strippedLine[0]):
    if element.lower() in dictionary:
      dictionary[element.lower()] += 1
    else:
      dictionary[element.lower()] = 1

keys = dictionary.keys()
keys.sort()
for key in keys:
  print key + ": " + str(dictionary[key])

posted by Blazecock Pileon at 10:57 AM on January 23, 2010


You could also use Patrick Juola's authorship attribution Java package JGAAP. Just select Words for Event Set & Null Histogram Analysis to analyze them.
posted by scalefree at 11:27 AM on January 23, 2010


« Older Where can I get a Virtual Debit Card   |   YANMSE (You Are Not My Structural Engineer) Newer »
This thread is closed to new comments.