How can I load (and sort) a really huge text file in Perl?
October 8, 2008 10:06 PM   Subscribe

How do I load and then sort a really huge (94 MB) text file list of words? Perl runs out of memory trying to load all the lines-- before I have a chance to try to sort it.

My text file contains a list of words, one per line. I'd like to sort this list into alphabetical order (and then ultimately traverse the sorted list to easily parse out all of the non-unique tokens). Here is what I was doing in Perl:


open LARGELIST, "large_source_file.txt" or die $!;
print "Loading file...\n";
@lines = [LARGELIST];
print "Done";
@sorted = sort(@lines);


(The brackets around LARGELIST are supposed to be the less-than and greater-than signs, but that would get stripped out in AskMeFi for looking like an HTML tag.)

After chugging for a while, perl runs out of memory while at the loading-file stage. I'm definitely not a Perl guru, so is there a more memory-efficient way I should do this (or is there some entirely different approach/language I should be trying)?
posted by kosmonaut to Computers & Internet (25 answers total) 4 users marked this as a favorite
 
Best answer: What operating system are you on? If it's a Unix offshoot, there's a "sort" command available from the shell.
posted by Class Goat at 10:20 PM on October 8, 2008


My non-expert solution would be to load chunks of the file, say 1MB each, sort and filter out unique items from each and save to memory or temporary files. You can then combine the sorted lists stepwise --- supposing the number of unique words is significantly smaller than the total --- and sort again, or use some simple (probably horribly inefficient) algorithm like picking the smallest element off the top of each list.

All this would take quite a bit of code to implement, but I don't think there is a trivial solution to your problem.
posted by ghost of a past number at 10:21 PM on October 8, 2008


Wow -- 94MB at one word per line definitely qualifies as a "really huge" file.

My answer is probably of no help to you, but as a FileMaker developer
I can tell you it would be very easy to do these tasks in FileMaker Pro, although it would take a while to sort and then de-dupe the resulting file. The end result would be a sorted text file with a single, unique instance of each word. Although I honestly have no idea how long "a while" is in this case...which sort of has me curious.

If you just want the results and don't mind hosting "large_source_file.txt" somewhere, I'd be glad to download it and send you a text file with the resulting output. However, if you are really more interested in doing this as an exercise, I will cheerfully defer someone with a broader programming repertoire.
posted by mosk at 10:27 PM on October 8, 2008


Textpad under windows may be able to do this.
posted by bottlebrushtree at 10:27 PM on October 8, 2008


A simpler way occured to me, which however is probably very inefficent: Read one line at a time and build up a sorted list of unique words as you go. You can't use fancy sort algorithms this way, but at least you won't run out of memory.
posted by ghost of a past number at 10:32 PM on October 8, 2008


Best answer: If you're on a Unix or Unix work-alike, in addition to the sort command, there is the 'uniq' command, which will spit out a file without duplicate lines. Try something like:

uniq -i large_source_file.txt unique_src_file.txt

There are also sorting algorithms that can sort in pieces at a time, like Merge Sort. From Wikipedia:
Merge sort is so inherently sequential that it's practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not change with the number of data elements.
Send me an e-mail if you want a sample implementation or something.
posted by ayerarcturus at 10:36 PM on October 8, 2008


How much RAM is Perl using?? You could write this program in C, and it wouldn't have a problem as long as you had ~94 MB RAM available. I think Perl is doing something bogo.. maybe sort() creates a copy?

Here's Python code that at least does an in-place sort...

f = open('large_source_file.txt')
lines = f.readlines()
f.close()
lines.sort()
posted by qxntpqbbbqxl at 10:38 PM on October 8, 2008


Best answer: I think ayerarcturus is onto the right track. Unix is very good at this and, if you don't have unix, downloading CYGWIN will give you the tools you need. The amazing thing about unix is that if you want a sorted list of only the unique words, it takes just one commend (sort -u).
posted by eisenkr at 10:51 PM on October 8, 2008


Best answer: It's running out of memory when you are dumping the contents of the file into an array. Process the file one line at a time or in chunks and you won't have that problem. Another possible solution is ulimit, assuming you are on a unix machine.
posted by robtf3 at 10:52 PM on October 8, 2008


Sitting at the shell, just do:

sort -u large_source_file.txt > unique_tokens_file.txt

Having said that, I've slurped much, much bigger files using Perl. A properly configured installation on vaguely modern hardware should not keel over on a paltry 94 meg text file, even copying it like you're doing. Either your system or Perl installation is broken in some way.
posted by majick at 11:01 PM on October 8, 2008


# Maybe try the following?


open LARGELIST, "large_source_file.txt" or die $!;
print "Loading file...\n";
while (sort([LARGELIST])) {
print $_, "\n";
}
close LARGELIST;

# Should definitely be less overhead. Otherwise
# you could write out to a second file, and handle it that
# way, or use the unix 'sort' command and be done with it.
#
# Note: using [] in keeping with your example above.
posted by o0o0o at 11:06 PM on October 8, 2008


Response by poster: Thanks for the great suggestions and explanations.

I'm running a unix (OS X). I'm now trying the sort command in the shell. At the moment it is running and currently consuming a slowly-growing 444 MB of RAM (I have 2 GB). I'm not sure how long it will take to complete (or fail), but if it works, then I'm in great shape. And as eisenkr pointed out, it even has a -u flag for finding unique entries–– however, the uniq command looks like it can also give me the count of each unique item, which would be even better for me. It looks like I still need to have the lines sorted before running uniq, though.

As for where Perl went wrong: I am using the standard install on an OS X system, which I assume should function reasonably. Also, I have been working with other files of similar size (including the file whose words I parsed to create this one-word-per-line file) and Perl didn't choke, so I am thinking it must be like robtf3 suggested-- that there was an array issue because of the ridiculous number of lines, combined with the way I was implementing things in my code.

Again, thanks for all the suggestions!
posted by kosmonaut at 11:17 PM on October 8, 2008


Best answer: sort wordfile | uniq -c | sort -t ' ' -k 1nr,1

will give you the output sorted by frequency. If you drop the final sort, you'll have the results in alphabetic order.
posted by zippy at 11:31 PM on October 8, 2008


I don't know about OSX's sort, but on some unix systems I've used, the sort command will sort smaller files in-core and will do a disk-based mergesort for larger files (presumably doing the first pass in-core).
posted by hattifattener at 11:36 PM on October 8, 2008


In HTML you use &lt;FOO&gt; to get <FOO>, and Metafilter is no exception.
posted by Rhomboid at 11:55 PM on October 8, 2008


Having said that, I've slurped much, much bigger files using Perl. A properly configured installation on vaguely modern hardware should not keel over on a paltry 94 meg text file, even copying it like you're doing. Either your system or Perl installation is broken in some way.

It is my understanding that Perl only runs out of memory when the operating system stops letting it allocate memory. However, if you're using a shared machine there might be a limit on memory per user. You can check this with the command
ulimit -a
which will give an output along the lines of:
userfoo@serverbar:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) 409600
max memory size         (kbytes, -m) 409600
open files                      (-n) 2048
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
max rt priority                 (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 150
virtual memory          (kbytes, -v) 409600
file locks                      (-x) unlimited
So user userfoo on server serverbar cannot load more than 409600 kilobytes, or 400 megabytes, into memory. Try to allocate more than that and you'll get an 'out of memory' error like you're seeing.

Memory limits like this make a lot of sense on shared servers because if user A uses up all the physical memory (either with a big file or a misbehaving program) they can dramatically reduce the machine's performance for other users.
posted by Mike1024 at 1:14 AM on October 9, 2008


The sort | uniq solutionis probably your best bet. If you want to do this in Perl, look at the module Tie::File. From the documentation:
The file is not loaded into memory, so this will work even for gigantic files.
posted by singingfish at 4:48 AM on October 9, 2008


I'm no Perl whiz, but if you really want to do it in a script I'd consider using a database like SQLite to store the sorted version, and then just read BIGFILE in a line (or chunk) at a time.
posted by meta_eli at 5:51 AM on October 9, 2008


Wow. Noone else mentioned it so here it goes.

You made a rookie Perl mistake. You read the WHOLE file into memory first:

open LARGELIST, "large_source_file.txt" or die $!;
print "Loading file...\n";
@lines = [LARGELIST];
print "Done";
@sorted = sort(@lines);


As mentioned earlier, you need to use something sequential like a merge sort that allows you to look at only pieces at a time. A bubblesort would work as well.

Someone suggested

while (sort([LARGELIST])) {

Which is also wrong. The sort interprets your file as an array, loading it ALL at once. You need to do something like:
my $a = '';
while $b () {
if ( $b > $a ) ... bubblesort stuff, look up the algo, it's the same in every language.
}

posted by judge.mentok.the.mindtaker at 6:55 AM on October 9, 2008


Oops forgot to encode my entities in my haste -- should be:

my $a = '';
while $b (<LARGELIST>) {
if ( $b > $a ) ... bubblesort stuff, look up the algo, it's the same in every language.
}

posted by judge.mentok.the.mindtaker at 6:57 AM on October 9, 2008


You should probably just do it brute-force. Write a C program that does a bubble sort on the input file, and outputs the results to the output file. Terribly inefficient, but with today's processors, should work fine.

Basic process:

Outside the program, copy the existing file to "FILE1"

Open FILE1 for input. Open FILE2 for output.

Read in first word as WORD1.

Start loop.

Read in next word as WORD2.

Compare the two.

Output the lesser (or greater) of the two words to FILE2.

Move remaining word to WORD1.

Repeat loop until EOF.

Rename FILE2 to FILE1.

Using boolean flags, repeat this process until no more word swaps have been made.
posted by gjc at 6:59 AM on October 9, 2008


This is why computer science textbooks gave us binary trees.

#!/usr/bin/perl
use Tree::RedBlack; # available at a CPAN mirror near you

$t=Tree::RedBlack->new;
while (<>) {
$t->insert($_,0)
}

$n = $t->root;
while (1) {
while ($n) {
push @s, $n;
$n = $n->left;
}
$n = pop @s;
last unless $n;
print $n->key;
$n = $n->right;
}

posted by Zed_Lopez at 10:34 AM on October 9, 2008


gjc, bubble sort on 94 meg of words would be a legendarily bad idea. How bad? Using some figures from an algorithm analysis table:
Say there are 470 million words there (average 5 letters/word).
Bubble sort would take on average O(n^2), or 2.2 * 10^17 comparisons.
Quicksort would take O(n*log(n)), or 4 * 10^9 comparisons.
At, say, 1Ghz for comparisons, that's 61,361 hours - 7 years for bubble sort - vs 4 seconds for quicksort.

Friends don't let friends bubble sort*.

*For any n over, say, 30. Even then, it's likely a bad idea. I'm rusty on my algorithms.
posted by Pronoiac at 3:07 PM on October 9, 2008 [2 favorites]


How do you fit 470 million words into a 94 meg uncompressed file?
posted by Class Goat at 5:56 PM on October 9, 2008


Use a smaller typeface?

How I bungled the math: multiply instead of divide.

So, with about 19 million words:
Bubble sort would "only" take about 100 days.
Quicksort would take about .15 second.
posted by Pronoiac at 6:21 PM on October 9, 2008


« Older Should I go to Italy?   |   Oahu help! Newer »
This thread is closed to new comments.