Deleting very large directory
April 10, 2008 9:42 AM Subscribe
How to delete a huge directory on Linux without overloading the system?
I've looked at this question which is basically my exact situation, but I can't get any of the solutions described too work without killing the system.
I've tried:
rm -rf mydir
nice rm -rf mydir
nice -n 19 find mydir -type f -exec rm -v {} \;
nice -n 19 rm -rf mydir&
And many other combinations of rm, find rm, nice find rm, etc, but all cause the server load to rise to dangerous levels quickly (I kill the process when `top` hits 20, I'm assuming the machine would hang if allowed to continue).
So is it possible to remove a directory with a lot of files without killing the server?
I've looked at this question which is basically my exact situation, but I can't get any of the solutions described too work without killing the system.
I've tried:
rm -rf mydir
nice rm -rf mydir
nice -n 19 find mydir -type f -exec rm -v {} \;
nice -n 19 rm -rf mydir&
And many other combinations of rm, find rm, nice find rm, etc, but all cause the server load to rise to dangerous levels quickly (I kill the process when `top` hits 20, I'm assuming the machine would hang if allowed to continue).
So is it possible to remove a directory with a lot of files without killing the server?
(Or
posted by nicwolff at 9:58 AM on April 10, 2008
sleep .05
or .01
or whatever doesn't crush your CPU.)posted by nicwolff at 9:58 AM on April 10, 2008
find mydir -type f | while read; do rm -v "$REPLY"; sleep 0.2; done
posted by flabdablet at 10:02 AM on April 10, 2008
posted by flabdablet at 10:02 AM on April 10, 2008
Sorry, should be
find mydir -type f | while read -r; do rm -v "$REPLY"; sleep 0.2; done
just in case any of your filenames have backslashes in them. This won't remove files with newlines in their names, but those are pretty rare.
posted by flabdablet at 10:05 AM on April 10, 2008
find mydir -type f | while read -r; do rm -v "$REPLY"; sleep 0.2; done
just in case any of your filenames have backslashes in them. This won't remove files with newlines in their names, but those are pretty rare.
posted by flabdablet at 10:05 AM on April 10, 2008
Follow it up with
find mydir -depth -type d | while read -r; do rmdir -v "$REPLY"; sleep 0.2; done
to remove the directory tree, if you have tens of thousands of subdirectories and rm -rf is still too harsh.
posted by flabdablet at 10:09 AM on April 10, 2008
find mydir -depth -type d | while read -r; do rmdir -v "$REPLY"; sleep 0.2; done
to remove the directory tree, if you have tens of thousands of subdirectories and rm -rf is still too harsh.
posted by flabdablet at 10:09 AM on April 10, 2008
It would be interseting to know both how many files are there (ls |wc -l) and what OS/filesystem is in use.
posted by TravellingDen at 10:10 AM on April 10, 2008
posted by TravellingDen at 10:10 AM on April 10, 2008
Best answer: If there are enough files that rm -rf hangs you up, you don't want to be doing anything that sorts their names - TravellingDen's curiosity would be best served with
find mydir -type f | wc -l
posted by flabdablet at 10:12 AM on April 10, 2008
find mydir -type f | wc -l
posted by flabdablet at 10:12 AM on April 10, 2008
I did some experiments on this a few years back. Between the 2.4 and 2.6 kernel series, directories became better and adding and opening files, but much slower at deleting them.
Once you've solved your immediate problem, the next step is to avoid building huge directories. One approach is to partition files into subdirectories based on something easily computable, such as the first few characters of the filename, or the md5/sha1 hash of the filename if that gets you a better distribution. This is how git (a distributed version control system) manages huge numbers of files (e.g., the source files for Linux, so if this is how Linux does it for Linux, it's an approach worth considering).
posted by dws at 10:14 AM on April 10, 2008
Once you've solved your immediate problem, the next step is to avoid building huge directories. One approach is to partition files into subdirectories based on something easily computable, such as the first few characters of the filename, or the md5/sha1 hash of the filename if that gets you a better distribution. This is how git (a distributed version control system) manages huge numbers of files (e.g., the source files for Linux, so if this is how Linux does it for Linux, it's an approach worth considering).
posted by dws at 10:14 AM on April 10, 2008
Response by poster: Using flabdablet's method of counting files (and nicwolf's method of deleting them), there's currently a little over 5 million, dropping at a rate of a few hundred a second. Top seems stable, hovering around 3.0, plus or minus.
It's Red Hat ES, kernel 2.6.18.
posted by justkevin at 10:35 AM on April 10, 2008
It's Red Hat ES, kernel 2.6.18.
posted by justkevin at 10:35 AM on April 10, 2008
ionice will probably help:
$ ionice -c 3 rm -rf <dir>
puts rm in the "idle" io scheduler class, which should mean that it only gets to do IO when nobody else wants to.
posted by pharm at 10:40 AM on April 10, 2008 [4 favorites]
$ ionice -c 3 rm -rf <dir>
puts rm in the "idle" io scheduler class, which should mean that it only gets to do IO when nobody else wants to.
posted by pharm at 10:40 AM on April 10, 2008 [4 favorites]
The ionice/nice command loads should NOT concern you when they rise.
Trust the scheduler - trust the system. The scheduler *knows* your command has no priority and it will move it aside for other applications when they request it. The load rises because the system starts doing what you requested.
The load of a system is just a measure. The scheduler will still do the correct thing when requested. You are not "killing" anything, I promise.
posted by unixrat at 12:08 PM on April 10, 2008 [2 favorites]
Trust the scheduler - trust the system. The scheduler *knows* your command has no priority and it will move it aside for other applications when they request it. The load rises because the system starts doing what you requested.
The load of a system is just a measure. The scheduler will still do the correct thing when requested. You are not "killing" anything, I promise.
posted by unixrat at 12:08 PM on April 10, 2008 [2 favorites]
The scheduler only knows that the command has the same priority as every other command unless you tell it otherwise.
Although I agree that a high load value isn't in and of itself a bad thing.
posted by pharm at 12:23 PM on April 10, 2008
Although I agree that a high load value isn't in and of itself a bad thing.
posted by pharm at 12:23 PM on April 10, 2008
Agreed with using ionice, and letting the thing chug away. It will let you use your computer for anything else, since everything else will get higher priority to the disk. And the delete will go when nothing else wants to.
No reason to keep load down, use the computer to it's fullest.
posted by cschneid at 2:49 PM on April 10, 2008
No reason to keep load down, use the computer to it's fullest.
posted by cschneid at 2:49 PM on April 10, 2008
Also: if you're regularly creating directories with millions of files in them, you might want to consider putting those on a ReiserFS file system. Reiser is good at that. But read this first.
posted by flabdablet at 5:21 PM on April 10, 2008
posted by flabdablet at 5:21 PM on April 10, 2008
« Older Who will clean my (suede) bag? | big anniversary coming - wife wants to travel, I... Newer »
This thread is closed to new comments.
find mydir -type f -print0 | perl -0 -MTime::HiRes=sleep -ne 'unlink; sleep .1;'
posted by nicwolff at 9:56 AM on April 10, 2008