How to find-and-replace on a BIIIIIIG file?
June 12, 2018 2:53 PM Subscribe
I have an XML file that is large. I mean 250MB or so, large. It's actually nothing but text, but there's a lot of it. There is a certain short phrase (about 10 characters) that occurs about 120,000 times, and each instance of it needs to be swapped out with another phrase that's around the same size. I'm not used to working with documents this big, how do I do it?
Here's what you need to know:
Computer level: 2011 Macbook Air. (A Chromebook is also available if that helps.)
The story so far: Both MS Word and LibreOffice crash the computer upon opening the file. I am only able to even open the file using TextEdit. When I tried to use the find-and-replace function in TextEdit, it hung, then crashed the computer again.
What other options can I try? I also have Homebrew installed on the Mac so if the solution calls for installing some package, I'm OK with it.
Here's what you need to know:
Computer level: 2011 Macbook Air. (A Chromebook is also available if that helps.)
The story so far: Both MS Word and LibreOffice crash the computer upon opening the file. I am only able to even open the file using TextEdit. When I tried to use the find-and-replace function in TextEdit, it hung, then crashed the computer again.
What other options can I try? I also have Homebrew installed on the Mac so if the solution calls for installing some package, I'm OK with it.
Are you at all comfortable in the terminal?
sed is a unix utility that is already installed on your system that can do this sort of thing about as quickly as possible.
sed is a unix utility that is already installed on your system that can do this sort of thing about as quickly as possible.
sed -i '/new_text_here/ s//original_tesxt_here/g' some_really_big_file_name_hereposted by mce at 3:00 PM on June 12, 2018 [11 favorites]
In macOs, you'll need to do sed like so:
sed -i 'txt' 's/old_string/new_string/g' big_file.txt
...where you specify the filetype after '-i'.
Works a treat, though.
posted by sandettie light vessel automatic at 3:07 PM on June 12, 2018 [1 favorite]
sed -i 'txt' 's/old_string/new_string/g' big_file.txt
...where you specify the filetype after '-i'.
Works a treat, though.
posted by sandettie light vessel automatic at 3:07 PM on June 12, 2018 [1 favorite]
mce's example assumes you're using GNU sed, whereas macos ships with BSD sed, so the command is slightly different:
You'll need to replace
Replace
Also note that by using the
To sum up: to replace all occurrences of the phrase
posted by namewithoutwords at 3:09 PM on June 12, 2018 [4 favorites]
sed -i "" 's#ORIGINAL_PHRASE#REPLACEMENT_PHRASE#g' FILENAME
You'll need to replace
FILENAME
with the full path to the file you want to change.Replace
ORIGINAL_PHRASE
and REPLACEMENT_PHRASE
as well, hopefully that's straightforward. Also note that by using the
#
separator in the sed
command, the phrase you are replacing cannot have a # in it anywhere - if it does, you'll need to escape it with \
.To sum up: to replace all occurrences of the phrase
old and busted
with the phrase new hotness
in a file named bigfile.xml
that is on your desktop, open the mac terminal and: sed -i "" 's#old and busted#new hotness#g' ~/Desktop/bigfile.xml
and hit enter. I did this on my mac, it took about 12 seconds for a 282MB text file, YMMV slightly.posted by namewithoutwords at 3:09 PM on June 12, 2018 [4 favorites]
And you don't even have to be comfortable to use the nice template command supplied by namewithoutwords.
I only barely know sed, but it is still enormously useful. Sed is like a shark: sleek and powerful. A sort of living fossil that has remained basically unchanged for a really long time, because it's just so good at what it does. Even if you don't know how to do something, the world of friendly internet strangers who love to spread the regex way will give you the answer within a day or so, or in this case under 30 minutes.
So when I have weird/hard problems with text processing, I start googling sed manuals, and if I give up and ask for help on an appropriate forum, they answer my question and teach me something new :)
posted by SaltySalticid at 3:11 PM on June 12, 2018 [6 favorites]
I only barely know sed, but it is still enormously useful. Sed is like a shark: sleek and powerful. A sort of living fossil that has remained basically unchanged for a really long time, because it's just so good at what it does. Even if you don't know how to do something, the world of friendly internet strangers who love to spread the regex way will give you the answer within a day or so, or in this case under 30 minutes.
So when I have weird/hard problems with text processing, I start googling sed manuals, and if I give up and ask for help on an appropriate forum, they answer my question and teach me something new :)
posted by SaltySalticid at 3:11 PM on June 12, 2018 [6 favorites]
If you want a MacOS app, I've used BBEdit with multi-gigabyte files, and it handled them with considerable aplomb.
posted by RichardP at 3:17 PM on June 12, 2018
posted by RichardP at 3:17 PM on June 12, 2018
the only thing I'd add is that for namewithoutword's answer, it will edit the file in place and resave it. If you're not comfortable, I'd change it to
and that will save the original copy to bigfile.xml.old, in case you mess up.
Or just make a copy of it somewhere before you get started.
posted by thewumpusisdead at 3:32 PM on June 12, 2018 [3 favorites]
sed -i ".old" 's#old and busted#new hotness#g' ~/Desktop/bigfile.xml
and that will save the original copy to bigfile.xml.old, in case you mess up.
Or just make a copy of it somewhere before you get started.
posted by thewumpusisdead at 3:32 PM on June 12, 2018 [3 favorites]
I was going to suggest trying BBEdit, too. It's got a free version, so it's easy enough to download and see if it will at least open your file.
posted by leahwrenn at 3:33 PM on June 12, 2018
posted by leahwrenn at 3:33 PM on June 12, 2018
I use Sublime Text for tasks like this. I'm sure sed is awesome, but it's not a tool I've learned yet. Sublime Text is enough like the applications you're used to that using it is quite intuitive.
posted by infinitewindow at 3:51 PM on June 12, 2018
posted by infinitewindow at 3:51 PM on June 12, 2018
Response by poster: If I try using sed in the terminal, is it a problem if both the new and old phrases have spaces in them? If I replace them with "%20" will the percent sign screw it up?
posted by The Pluto Gangsta at 3:58 PM on June 12, 2018
posted by The Pluto Gangsta at 3:58 PM on June 12, 2018
You don't need to treat spaces like that. Sed isn't interpreting HTML, you should be able to just use spaces in the phrases separated by # or whatever. I'm 90% sure this is the case, anyway :)
posted by Alensin at 4:01 PM on June 12, 2018
posted by Alensin at 4:01 PM on June 12, 2018
yep, the spaces in the phrase are fine - note that in the example I provided, both the old and new phrase have spaces! no need to use html encoding.
posted by namewithoutwords at 4:07 PM on June 12, 2018
posted by namewithoutwords at 4:07 PM on June 12, 2018
This may be obvious, but...
Work on a duplicate.
posted by zadcat at 6:16 PM on June 12, 2018 [5 favorites]
Work on a duplicate.
posted by zadcat at 6:16 PM on June 12, 2018 [5 favorites]
Make a tiny sample/test file to practice the sed command. Once familiar these tools are easy and automatic but get comfortable with the notation on a small test sample.
posted by sammyo at 6:25 PM on June 12, 2018 [4 favorites]
posted by sammyo at 6:25 PM on June 12, 2018 [4 favorites]
perl -i.bak -p -e 's/FROM/TO/g;' infile.xml
Has the side benefit of making a .bak copy of the original.
posted by benzenedream at 6:42 PM on June 12, 2018 [1 favorite]
Has the side benefit of making a .bak copy of the original.
posted by benzenedream at 6:42 PM on June 12, 2018 [1 favorite]
« Older Best mail-order food for US delivery? | Getting email at a domain that uses third party... Newer »
This thread is closed to new comments.
posted by Fidel Cashflow at 2:56 PM on June 12, 2018 [5 favorites]