Comments on: How can I load (and sort) a really huge text file in Perl?

Question: How can I load (and sort) a really huge text file in Perl?

kosmonaut — Wed, 08 Oct 2008 22:06:49 -0800

How do I load and then sort a really huge (94 MB) text file list of words? Perl runs out of memory trying to load all the lines-- before I have a chance to try to sort it.

My text file contains a list of words, one per line. I'd like to sort this list into alphabetical order (and then ultimately traverse the sorted list to easily parse out all of the non-unique tokens). Here is what I was doing in Perl:



open LARGELIST, "large_source_file.txt" or die $!;

print "Loading file...\n";

@lines = [LARGELIST];

print "Done";

@sorted = sort(@lines);

(The brackets around LARGELIST are supposed to be the less-than and greater-than signs, but that would get stripped out in AskMeFi for looking like an HTML tag.)

After chugging for a while, perl runs out of memory while at the loading-file stage. I'm definitely not a Perl guru, so is there a more memory-efficient way I should do this (or is there some entirely different approach/language I should be trying)?

By: Class Goat

Class Goat — Wed, 08 Oct 2008 22:20:33 -0800

What operating system are you on? If it's a Unix offshoot, there's a "sort" command available from the shell.

By: ghost of a past number

ghost of a past number — Wed, 08 Oct 2008 22:21:58 -0800

My non-expert solution would be to load chunks of the file, say 1MB each, sort and filter out unique items from each and save to memory or temporary files. You can then combine the sorted lists stepwise --- supposing the number of unique words is significantly smaller than the total --- and sort again, or use some simple (probably horribly inefficient) algorithm like picking the smallest element off the top of each list.

All this would take quite a bit of code to implement, but I don't think there is a trivial solution to your problem.

By: mosk

mosk — Wed, 08 Oct 2008 22:27:22 -0800

Wow -- 94MB at one word per line definitely qualifies as a "really huge" file.

My answer is probably of no help to you, but as a FileMaker developer
I can tell you it would be very easy to do these tasks in FileMaker Pro, although it would take a while to sort and then de-dupe the resulting file. The end result would be a sorted text file with a single, unique instance of each word. Although I honestly have no idea how long "a while" is in this case...which sort of has me curious.

If you just want the results and don't mind hosting "large_source_file.txt" somewhere, I'd be glad to download it and send you a text file with the resulting output. However, if you are really more interested in doing this as an exercise, I will cheerfully defer someone with a broader programming repertoire.

By: bottlebrushtree

bottlebrushtree — Wed, 08 Oct 2008 22:27:49 -0800

Textpad under windows may be able to do this.

By: ghost of a past number

ghost of a past number — Wed, 08 Oct 2008 22:32:50 -0800

A simpler way occured to me, which however is probably very inefficent: Read one line at a time and build up a sorted list of unique words as you go. You can't use fancy sort algorithms this way, but at least you won't run out of memory.

By: ayerarcturus

ayerarcturus — Wed, 08 Oct 2008 22:36:20 -0800

If you're on a Unix or Unix work-alike, in addition to the sort command, there is the 'uniq' command, which will spit out a file without duplicate lines. Try something like:

uniq -i large_source_file.txt unique_src_file.txt

There are also sorting algorithms that can sort in pieces at a time, like Merge Sort. From Wikipedia:

Merge sort is so inherently sequential that it's practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not change with the number of data elements.

Send me an e-mail if you want a sample implementation or something.

By: qxntpqbbbqxl

qxntpqbbbqxl — Wed, 08 Oct 2008 22:38:35 -0800

How much RAM is Perl using?? You could write this program in C, and it wouldn't have a problem as long as you had ~94 MB RAM available. I think Perl is doing something bogo.. maybe sort() creates a copy?

Here's Python code that at least does an in-place sort...

f = open('large_source_file.txt')
lines = f.readlines()
f.close()
lines.sort()

By: eisenkr

eisenkr — Wed, 08 Oct 2008 22:51:42 -0800

I think ayerarcturus is onto the right track. Unix is very good at this and, if you don't have unix, downloading CYGWIN will give you the tools you need. The amazing thing about unix is that if you want a sorted list of only the unique words, it takes just one commend (sort -u).

By: robtf3

robtf3 — Wed, 08 Oct 2008 22:52:21 -0800

It's running out of memory when you are dumping the contents of the file into an array. Process the file one line at a time or in chunks and you won't have that problem. Another possible solution is ulimit, assuming you are on a unix machine.

By: majick

majick — Wed, 08 Oct 2008 23:01:22 -0800

Sitting at the shell, just do:

sort -u large_source_file.txt > unique_tokens_file.txt

Having said that, I've slurped much, much bigger files using Perl. A properly configured installation on vaguely modern hardware should not keel over on a paltry 94 meg text file, even copying it like you're doing. Either your system or Perl installation is broken in some way.

By: o0o0o

o0o0o — Wed, 08 Oct 2008 23:06:30 -0800

# Maybe try the following?

open LARGELIST, "large_source_file.txt" or die $!;
print "Loading file...\n";
while (sort([LARGELIST])) {
print $_, "\n";
}
close LARGELIST;

# Should definitely be less overhead. Otherwise
# you could write out to a second file, and handle it that
# way, or use the unix 'sort' command and be done with it.
#
# Note: using [] in keeping with your example above.

By: kosmonaut

kosmonaut — Wed, 08 Oct 2008 23:17:23 -0800

Thanks for the great suggestions and explanations.

I'm running a unix (OS X). I'm now trying the sort command in the shell. At the moment it is running and currently consuming a slowly-growing 444 MB of RAM (I have 2 GB). I'm not sure how long it will take to complete (or fail), but if it works, then I'm in great shape. And as eisenkr pointed out, it even has a -u flag for finding unique entries–– however, the uniq command looks like it can also give me the count of each unique item, which would be even better for me. It looks like I still need to have the lines sorted before running uniq, though.

As for where Perl went wrong: I am using the standard install on an OS X system, which I assume should function reasonably. Also, I have been working with other files of similar size (including the file whose words I parsed to create this one-word-per-line file) and Perl didn't choke, so I am thinking it must be like robtf3 suggested-- that there was an array issue because of the ridiculous number of lines, combined with the way I was implementing things in my code.

Again, thanks for all the suggestions!

By: zippy

zippy — Wed, 08 Oct 2008 23:31:40 -0800

sort wordfile | uniq -c | sort -t ' ' -k 1nr,1

will give you the output sorted by frequency. If you drop the final sort, you'll have the results in alphabetic order.

By: hattifattener

hattifattener — Wed, 08 Oct 2008 23:36:49 -0800

I don't know about OSX's sort, but on some unix systems I've used, the sort command will sort smaller files in-core and will do a disk-based mergesort for larger files (presumably doing the first pass in-core).

By: Rhomboid

Rhomboid — Wed, 08 Oct 2008 23:55:05 -0800

In HTML you use <FOO> to get <FOO>, and Metafilter is no exception.

By: Mike1024

Mike1024 — Thu, 09 Oct 2008 01:14:27 -0800

Having said that, I've slurped much, much bigger files using Perl. A properly configured installation on vaguely modern hardware should not keel over on a paltry 94 meg text file, even copying it like you're doing. Either your system or Perl installation is broken in some way.

It is my understanding that Perl only runs out of memory when the operating system stops letting it allocate memory. However, if you're using a shared machine there might be a limit on memory per user. You can check this with the command

ulimit -a

which will give an output along the lines of:

userfoo@serverbar:~$ ulimit -a

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

max nice                        (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) unlimited

max locked memory       (kbytes, -l) 409600

max memory size         (kbytes, -m) 409600

open files                      (-n) 2048

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) unlimited

max rt priority                 (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 150

virtual memory          (kbytes, -v) 409600

file locks                      (-x) unlimited

So user userfoo on server serverbar cannot load more than 409600 kilobytes, or 400 megabytes, into memory. Try to allocate more than that and you'll get an 'out of memory' error like you're seeing.

Memory limits like this make a lot of sense on shared servers because if user A uses up all the physical memory (either with a big file or a misbehaving program) they can dramatically reduce the machine's performance for other users.

By: singingfish

singingfish — Thu, 09 Oct 2008 04:48:11 -0800

The sort | uniq solutionis probably your best bet. If you want to do this in Perl, look at the module Tie::File. From the documentation:

The file is not loaded into memory, so this will work even for gigantic files.

By: meta_eli

meta_eli — Thu, 09 Oct 2008 05:51:39 -0800

I'm no Perl whiz, but if you really want to do it in a script I'd consider using a database like SQLite to store the sorted version, and then just read BIGFILE in a line (or chunk) at a time.

By: judge.mentok.the.mindtaker

judge.mentok.the.mindtaker — Thu, 09 Oct 2008 06:55:53 -0800

Wow. Noone else mentioned it so here it goes.

You made a rookie Perl mistake. You read the WHOLE file into memory first:

open LARGELIST, "large_source_file.txt" or die $!;
print "Loading file...\n";
@lines = [LARGELIST];
print "Done";
@sorted = sort(@lines);

As mentioned earlier, you need to use something sequential like a merge sort that allows you to look at only pieces at a time. A bubblesort would work as well.

Someone suggested

while (sort([LARGELIST])) {

Which is also wrong. The sort interprets your file as an array, loading it ALL at once. You need to do something like:
my $a = '';
while $b () {
if ( $b > $a ) ... bubblesort stuff, look up the algo, it's the same in every language.
}

By: judge.mentok.the.mindtaker

judge.mentok.the.mindtaker — Thu, 09 Oct 2008 06:57:03 -0800

Oops forgot to encode my entities in my haste -- should be:

my $a = '';
while $b (<LARGELIST>) {
if ( $b > $a ) ... bubblesort stuff, look up the algo, it's the same in every language.
}

By: gjc

gjc — Thu, 09 Oct 2008 06:59:51 -0800

You should probably just do it brute-force. Write a C program that does a bubble sort on the input file, and outputs the results to the output file. Terribly inefficient, but with today's processors, should work fine.

Basic process:

Outside the program, copy the existing file to "FILE1"

Open FILE1 for input. Open FILE2 for output.

Read in first word as WORD1.

Start loop.

Read in next word as WORD2.

Compare the two.

Output the lesser (or greater) of the two words to FILE2.

Move remaining word to WORD1.

Repeat loop until EOF.

Rename FILE2 to FILE1.

Using boolean flags, repeat this process until no more word swaps have been made.

By: Zed_Lopez

Zed_Lopez — Thu, 09 Oct 2008 10:34:04 -0800

This is why computer science textbooks gave us binary trees.

#!/usr/bin/perl

use Tree::RedBlack; # available at a CPAN mirror near you



$t=Tree::RedBlack->new;

while (<>) {

  $t->insert($_,0)

}



$n =  $t->root;

while (1) {

  while ($n) {

    push @s, $n;

    $n = $n->left;

  }

  $n = pop @s;

  last unless $n;

  print $n->key;

  $n = $n->right;

}

By: Pronoiac

Pronoiac — Thu, 09 Oct 2008 15:07:34 -0800

gjc, bubble sort on 94 meg of words would be a legendarily bad idea. How bad? Using some figures from an algorithm analysis table:
Say there are 470 million words there (average 5 letters/word).
Bubble sort would take on average O(n^2), or 2.2 * 10^17 comparisons.
Quicksort would take O(n*log(n)), or 4 * 10^9 comparisons.
At, say, 1Ghz for comparisons, that's 61,361 hours - 7 years for bubble sort - vs 4 seconds for quicksort.

Friends don't let friends bubble sort^*.

^*For any n over, say, 30. Even then, it's likely a bad idea. I'm rusty on my algorithms.

By: Class Goat

Class Goat — Thu, 09 Oct 2008 17:56:29 -0800

How do you fit 470 million words into a 94 meg uncompressed file?

By: Pronoiac

Pronoiac — Thu, 09 Oct 2008 18:21:39 -0800

Use a smaller typeface?

How I bungled the math: multiply instead of divide.

So, with about 19 million words:
Bubble sort would "only" take about 100 days.
Quicksort would take about .15 second.