Help me delete a few hundred thousand files from a webserver
December 18, 2012 3:15 PM   Subscribe

Let's say I have 300,000 files on a webserver in a single directory. I want to download them to a local directory and then delete them. My FTP client of choice is choking on the task. What should I do?
posted by fake to Computers & Internet (23 answers total) 3 users marked this as a favorite
 
What OS is running on the server and what OS is running on the machine you are downloading them to?
posted by dgeiser13 at 3:18 PM on December 18, 2012


Do you have SSH/telnet access? If so, I would log in and tar the directory and then download the tar file.


If not, the wget utility might work for you
posted by bottlebrushtree at 3:19 PM on December 18, 2012 [2 favorites]


Tar!

You're going to want to create a tarball of the whole directory so that you can download one, big file

the usage will probably be something like "tar -czf foo.tar.gz foo/", which will create a zipped tar archive of the "foo" directory.
posted by Oktober at 3:20 PM on December 18, 2012 [1 favorite]


Response by poster: I have SSH access, the server is on Dreamhost so assume all Linux/Unix tools and the usual shared hosting stuff is available. I can download to a Windows or Linux machine.

SO, if I understand correctly, I connect via SSH, command the server to create a .tar file of all the files in the dir, and then download just that .tar file?
posted by fake at 3:27 PM on December 18, 2012


there:
cd ~; tar -cvf mydir.tar.gz ~/foo.com/dir/to/be/tarred


local:
cd ~; scp foo.com:mydir.tar.gz .

verify.

For deletion:

You can't use * on the remote machine (like rm foo.com/dir/to/be/tarred/*), so you have to either just delete the whole dir (easiest) or use find.

I would rm -rf foo.com/dir/to/be/tarred
then
mkdir foo.com/dir/to/be/tarred
chmod 755 foo.com/dir/to/be/tarred

Don't do any of this until you understand what each command does - this is a point to start, not instructions on how to do it.
posted by bensherman at 3:36 PM on December 18, 2012


Yep you've got it. Connect via SSH and create a tarball:
tar -cvfj fakes_archive.tar.bz2 directory/
That will give you a file called 'fakes_archive.tar.bz2' which you may download and use locally as you desire. If you don't want a bzip2 archive/one of the systems involved doesn't understand bzip2, you can use 'tar -xvfz fakes_archive.tar.gz directory/' instead to create the archive.

You can decompress and expand the archive on another Linux box with 'tar xvjf fakes_archive.tar.bz2' or 'tar -xvzf fakes_archive.tar.gz'
posted by zachlipton at 3:37 PM on December 18, 2012


I would use rsync with the archive (-a I believe) flag, plus compression enabled.
posted by zippy at 3:42 PM on December 18, 2012 [4 favorites]


Using bzip2 will give you a smaller tarball, at the expense of taking longer to make than a gzip-based tarball. This may be a useful consideration if your connection is slow. If your connection is fast, then you probably won't gain much from making a bzip2-based tarball, and it may actually take longer to complete making the tarball, copying it over and extracting it, overall.
posted by Blazecock Pileon at 3:43 PM on December 18, 2012


I've tried using FileZilla, and others, for very large ftp projects (+400 GB) and it's just not robust enough. I eventually found that Core FTP was able to handle months of queued operations, connections falling over, servers falling over, etc, with minimal babysitting beyond checking on it and occasionally forcing a reconnection attempt. It's not particularly pretty but it gets the job done when all the other FTP clients are choking on their own intestines.
posted by laconic skeuomorph at 3:43 PM on December 18, 2012 [1 favorite]


Minor correction to bensherman's first command -- it should be
cd ~; tar -czvf mydir.tar.gz ~/foo.com/dir/to/be/tarred

Or use zachlipton's syntax, but to compress you need either a 'z' or a 'j' in the options. If it's a 'z' then put a .gz extension on the filename. If it's 'j', then a .bz2 extension.
posted by McCoy Pauley at 3:43 PM on December 18, 2012


Response by poster: I really appreciate the excellent help, here.
posted by fake at 3:48 PM on December 18, 2012


I'd give rsync a try first: something like

rsync -azv user@host.com:/file/path/here /destination/path/here

should do it from a *nix command line.
posted by holgate at 3:48 PM on December 18, 2012


If you have the time, I'm curious as to how well sshfs would perform were you to mount the directory on your home computer and then simply copy it to another location on your hard drive.
posted by jsturgill at 3:51 PM on December 18, 2012


There's also SCP...

If you're on windows, I think PuTTY has a version of scp you could use as well (pscp.exe in the install folder).

You'd run a command something like this:

C:\Program Fi[...]\PuTTY> pscp.exe username@domain.com:/path/to/source/ /path/to/destination

The same basic idea would work using "scp" instead of "pscp" on your linux computer.
posted by jsturgill at 4:05 PM on December 18, 2012 [1 favorite]


Yeah, rsync is the best tool for the job here. Or tar (or even zip) if it's a one off thing; it'll be faster than rsyncing many little files.

But for future reference if you really have to FTP, the solution is to go old school and command line ftp. Details vary a bit depending on your FTP client, but basically it'll be something like
open example.com
binary
prompt
cd path/to/files
mget *
binary and prompt may or may not be necessary; you're trying to tell your FTP client not to do ASCII conversion and not to confirm every single file download. The nice thing about this approach is your FTP client never really sees the whole list of files or tries to do anything clever like figure out dates and sizes for all the files. That's probably what's choking FileZilla. Instead it just grabs the files and you win.

One other bit of old knowledge; turn off antivirus checking temporarily if transferring lots of little files. Last I checked (in 2005!) on Windows, Norton AntiVirus on FileZilla added like 200ms per file transferred, which adds up fast.
posted by Nelson at 4:42 PM on December 18, 2012 [1 favorite]


Maybe it's because I'm lazy, but from Linux to Linux, I would just scp them.

From the machine you want them to end up on

scp -r username@host:/path/to/dir nameoflocaldir

Then go off and do something else for a while.
posted by advicepig at 4:51 PM on December 18, 2012 [1 favorite]


rsync due to the automagical restart ability. There is a very clever way to pipe tar across a network connection to use tar's excellent directory handling, more warranted for a complex directory structure.
posted by sammyo at 4:51 PM on December 18, 2012


Can you explain what you mean by choke in more detail? In the past I have resolved problems with Filezila by increasing the default timeouts.
posted by phil at 7:33 PM on December 18, 2012


scp has the advantage that it doesn't create a big archive file anywhere, but it's slow, especially if you have lots of little files. sammyo mentions this solution, which is much faster than scp:

ssh user@host 'cd /path/to/parent/dir && tar czf - dir' | tar xzf -

Basically, this reaches out over ssh to the remote host and tells it to tar up the directory but to print out the tarfile rather than saving it; then as the tarfile is printed back over the ssh connection it is passed to tar on the local machine to be untarred and saved, recreating the directory.
posted by nicwolff at 9:33 PM on December 18, 2012 [5 favorites]


If your environment supports PHP you could possibly use a script like this to delete the files.
posted by rmmcclay at 10:36 PM on December 18, 2012


rsync!

sammyo mentioned its restart capabilities, but I feel the need to more strongly emphasize these capabilities: if the transfer glitches for any reason, rsync can pick up where it left off via checksums on either side. Dreamhost recommends it. It's also one of the more readable CS PhD theses ever written, and just perusing Tridgell's thesis increases my understanding, confidence and joy in using rsync.
posted by at at 12:07 AM on December 19, 2012 [1 favorite]


WinSCP is an excellent SCP client, and you can launch PuTTY from its menu bar.
posted by Sunburnt at 8:05 AM on December 19, 2012


Nicwolff's solution was used so much at my previous job that there was a piece of whiteboard space in a public area reserved just for it.
posted by azarbayejani at 8:27 AM on December 20, 2012


« Older Need advice wrt mens dress shoes.   |   Choices, Choices, Choices! Newer »
This thread is closed to new comments.