How to find-and-replace on a BIIIIIIG file?
June 12, 2018 2:53 PM   Subscribe

I have an XML file that is large. I mean 250MB or so, large. It's actually nothing but text, but there's a lot of it. There is a certain short phrase (about 10 characters) that occurs about 120,000 times, and each instance of it needs to be swapped out with another phrase that's around the same size. I'm not used to working with documents this big, how do I do it?

Here's what you need to know:

Computer level: 2011 Macbook Air. (A Chromebook is also available if that helps.)
The story so far: Both MS Word and LibreOffice crash the computer upon opening the file. I am only able to even open the file using TextEdit. When I tried to use the find-and-replace function in TextEdit, it hung, then crashed the computer again.

What other options can I try? I also have Homebrew installed on the Mac so if the solution calls for installing some package, I'm OK with it.
posted by The Pluto Gangsta to Computers & Internet (15 answers total) 3 users marked this as a favorite
 
Download Sublime Text and use the 'search and replace' functionality. 250 MB isn't really all that large these days, Sublime Text should handle it fine.
posted by Fidel Cashflow at 2:56 PM on June 12, 2018 [5 favorites]


Are you at all comfortable in the terminal?

sed is a unix utility that is already installed on your system that can do this sort of thing about as quickly as possible.
sed -i '/new_text_here/ s//original_tesxt_here/g' some_really_big_file_name_here
posted by mce at 3:00 PM on June 12, 2018 [11 favorites]


In macOs, you'll need to do sed like so:

sed -i 'txt' 's/old_string/new_string/g' big_file.txt

...where you specify the filetype after '-i'.

Works a treat, though.
posted by sandettie light vessel automatic at 3:07 PM on June 12, 2018 [1 favorite]


mce's example assumes you're using GNU sed, whereas macos ships with BSD sed, so the command is slightly different:

sed -i "" 's#ORIGINAL_PHRASE#REPLACEMENT_PHRASE#g' FILENAME

You'll need to replace FILENAME with the full path to the file you want to change.
Replace ORIGINAL_PHRASE and REPLACEMENT_PHRASE as well, hopefully that's straightforward.
Also note that by using the # separator in the sed command, the phrase you are replacing cannot have a # in it anywhere - if it does, you'll need to escape it with \.

To sum up: to replace all occurrences of the phrase old and busted with the phrase new hotness in a file named bigfile.xml that is on your desktop, open the mac terminal and:

sed -i "" 's#old and busted#new hotness#g' ~/Desktop/bigfile.xml and hit enter. I did this on my mac, it took about 12 seconds for a 282MB text file, YMMV slightly.
posted by namewithoutwords at 3:09 PM on June 12, 2018 [4 favorites]


And you don't even have to be comfortable to use the nice template command supplied by namewithoutwords.

I only barely know sed, but it is still enormously useful. Sed is like a shark: sleek and powerful. A sort of living fossil that has remained basically unchanged for a really long time, because it's just so good at what it does. Even if you don't know how to do something, the world of friendly internet strangers who love to spread the regex way will give you the answer within a day or so, or in this case under 30 minutes.

So when I have weird/hard problems with text processing, I start googling sed manuals, and if I give up and ask for help on an appropriate forum, they answer my question and teach me something new :)
posted by SaltySalticid at 3:11 PM on June 12, 2018 [6 favorites]


If you want a MacOS app, I've used BBEdit with multi-gigabyte files, and it handled them with considerable aplomb.
posted by RichardP at 3:17 PM on June 12, 2018


the only thing I'd add is that for namewithoutword's answer, it will edit the file in place and resave it. If you're not comfortable, I'd change it to

sed -i ".old" 's#old and busted#new hotness#g' ~/Desktop/bigfile.xml

and that will save the original copy to bigfile.xml.old, in case you mess up.

Or just make a copy of it somewhere before you get started.
posted by thewumpusisdead at 3:32 PM on June 12, 2018 [3 favorites]


I was going to suggest trying BBEdit, too. It's got a free version, so it's easy enough to download and see if it will at least open your file.
posted by leahwrenn at 3:33 PM on June 12, 2018


I use Sublime Text for tasks like this. I'm sure sed is awesome, but it's not a tool I've learned yet. Sublime Text is enough like the applications you're used to that using it is quite intuitive.
posted by infinitewindow at 3:51 PM on June 12, 2018


Response by poster: If I try using sed in the terminal, is it a problem if both the new and old phrases have spaces in them? If I replace them with "%20" will the percent sign screw it up?
posted by The Pluto Gangsta at 3:58 PM on June 12, 2018


You don't need to treat spaces like that. Sed isn't interpreting HTML, you should be able to just use spaces in the phrases separated by # or whatever. I'm 90% sure this is the case, anyway :)
posted by Alensin at 4:01 PM on June 12, 2018


yep, the spaces in the phrase are fine - note that in the example I provided, both the old and new phrase have spaces! no need to use html encoding.
posted by namewithoutwords at 4:07 PM on June 12, 2018


This may be obvious, but...

Work on a duplicate.
posted by zadcat at 6:16 PM on June 12, 2018 [5 favorites]


Make a tiny sample/test file to practice the sed command. Once familiar these tools are easy and automatic but get comfortable with the notation on a small test sample.
posted by sammyo at 6:25 PM on June 12, 2018 [4 favorites]


perl -i.bak -p -e 's/FROM/TO/g;' infile.xml

Has the side benefit of making a .bak copy of the original.
posted by benzenedream at 6:42 PM on June 12, 2018 [1 favorite]


« Older Best mail-order food for US delivery?   |   Getting email at a domain that uses third party... Newer »
This thread is closed to new comments.