How to search in conversations
June 25, 2007 5:28 PM Subscribe
ComputerScienceFilter: What algorithms or existing programs out there are good for searching files with conversation? Say I had the transcripts of every single AOL chatroom conversation. What algorithm would find dialogs about oranges if I searched for "oranges" and rank them according to relevance?
Response by poster: Good point! Didn't even think about that.
But I want to write my own though, so I would need the source code.
posted by lpctstr; at 5:43 PM on June 25, 2007
But I want to write my own though, so I would need the source code.
posted by lpctstr; at 5:43 PM on June 25, 2007
Couldn't you just use grep?
If you want to get fancy, you could write a little bash script around it that counted the number of times the word appeared in each file and rank them by that.
If you want to go more complex, you might do well to look at some of the very basic search engine scripts that exist. They're designed for integration into personal websites and are floating around the web in lots of flavors: perl, python, php, VB, etc.
posted by chrisamiller at 5:44 PM on June 25, 2007
If you want to get fancy, you could write a little bash script around it that counted the number of times the word appeared in each file and rank them by that.
If you want to go more complex, you might do well to look at some of the very basic search engine scripts that exist. They're designed for integration into personal websites and are floating around the web in lots of flavors: perl, python, php, VB, etc.
posted by chrisamiller at 5:44 PM on June 25, 2007
Response by poster: chrisamiller: that's also a good idea. but too slow. I don't think a grep would finish quickly enough if used over 400 MB of files. I'm thinking at least an inverted index to achieve constant time searching.
odinsdream: why? people implement algorithms all the time.
posted by lpctstr; at 5:47 PM on June 25, 2007
odinsdream: why? people implement algorithms all the time.
posted by lpctstr; at 5:47 PM on June 25, 2007
Sure, tokenize the words in each file, lower case them, sort and uniq them, then add a row to your index (for the file), adding columns (for new words) as needed. Once this rough index is constructed, invert it so you can look up all the files that contain a word. In C++ you could map a word (string) to a vector of filenames that contain the word. Or you could use a regular sql database.
posted by DarkForest at 6:00 PM on June 25, 2007
posted by DarkForest at 6:00 PM on June 25, 2007
For the 'real thing' algorithm, start at Wikipedia's see also. Warning: wherever the name Donald Knuth comes up, your brain will start to melt.
For big serious searches, start with Lucene.
Even with 400mb (which is relatively small), you'll need to scan and index the files to have respectable performance - so 'lazy' searching is out of the question.
posted by tmcw at 6:02 PM on June 25, 2007 [1 favorite]
For big serious searches, start with Lucene.
Even with 400mb (which is relatively small), you'll need to scan and index the files to have respectable performance - so 'lazy' searching is out of the question.
posted by tmcw at 6:02 PM on June 25, 2007 [1 favorite]
Best answer: The first part of your question that deals with searching is most commonly called Full Text Indexing. Bits and pieces of the general scheme of things have been mentioned above (Tokenization, sorting, indexing, etc.). Lucene as mentioned by tmcw is a great place to start because it is a very well respected open source full text search engine. That's where you are going to see all the hard stuff to this problem implemented (index maintence, adding/removing items to the index, storing the index in someway that it is usable for fast searches).
As for your second part, relevance, that is much, much harder and it involves understanding the semantics of what you are indexing (also known as "The Corpus"). For example in your simple example of chat transcripts, you might weigh the usernames of the participants more or less than the actual chat dialog. Once you enter into the real world of documents and files then you can do things like treat subject headings and titles more importantly than normal body text. Or as a more as more unique example, I know that Microsoft SharePoint indexer actually gives more weight to text terms found in spreadseet documents vs. word processing documents on the theory that words in spreadsheets are far less common so if they show up they are likely much more important.
posted by mmascolino at 6:17 PM on June 25, 2007 [3 favorites]
As for your second part, relevance, that is much, much harder and it involves understanding the semantics of what you are indexing (also known as "The Corpus"). For example in your simple example of chat transcripts, you might weigh the usernames of the participants more or less than the actual chat dialog. Once you enter into the real world of documents and files then you can do things like treat subject headings and titles more importantly than normal body text. Or as a more as more unique example, I know that Microsoft SharePoint indexer actually gives more weight to text terms found in spreadseet documents vs. word processing documents on the theory that words in spreadsheets are far less common so if they show up they are likely much more important.
posted by mmascolino at 6:17 PM on June 25, 2007 [3 favorites]
I don't if this is still true, but back in the dim and misty the first place to look for hardcore computer science was The Art of Computer Programming by Donald Knuth. Volume 3 covers searching and sorting algorithms.
posted by RussHy at 6:20 PM on June 25, 2007
posted by RussHy at 6:20 PM on June 25, 2007
There's probably better books by now. I took a course on information retrieval (in 1983) using this book.
posted by DarkForest at 6:25 PM on June 25, 2007
posted by DarkForest at 6:25 PM on June 25, 2007
The second edition of Knuth's volume 3 came out in 1998. But they are both heavyweights in the ACM. I give Knuth the nod until Salton wins his Turing Award. :)
posted by RussHy at 6:32 PM on June 25, 2007
posted by RussHy at 6:32 PM on June 25, 2007
Salton's book might just be a little more specialized on document indexing and retrieval than the Knuth, but I haven't looked at the Knuth Vol. 3, so I couldn't be sure.
posted by DarkForest at 6:49 PM on June 25, 2007
posted by DarkForest at 6:49 PM on June 25, 2007
I wrote a trivial IRC log indexer using ferret (A Ruby/C port of Lucene) a couple of months ago; it came to about 40 lines. Maybe a nicer starting point than leaping into Java. It's supposed to be faster too (!).
"I don't think a grep would finish quickly enough if used over 400 MB of files"
Depends what you're doing:
-% du -sh irclogs
1.9G irclogs
-% time grep -Ri oranges irclogs > /dev/null
Real: 0:04.80 CPU: 99.7% (2.233/2.566) Page: 0 Swap: 0 I/O: (0/0) Mem: 606
Keeping in mind you can run one per core, brute-force can work quite well if the data fits in memory.
posted by Freaky at 10:21 PM on June 25, 2007
"I don't think a grep would finish quickly enough if used over 400 MB of files"
Depends what you're doing:
-% du -sh irclogs
1.9G irclogs
-% time grep -Ri oranges irclogs > /dev/null
Real: 0:04.80 CPU: 99.7% (2.233/2.566) Page: 0 Swap: 0 I/O: (0/0) Mem: 606
Keeping in mind you can run one per core, brute-force can work quite well if the data fits in memory.
posted by Freaky at 10:21 PM on June 25, 2007
« Older Photographer's tour of Beijing? | Buying a mobile home, without being taken for a... Newer »
This thread is closed to new comments.
posted by DarkForest at 5:35 PM on June 25, 2007