Why do you get corrupt FTP transfers if it's over TCP/IP?
November 24, 2004 4:26 AM   Subscribe

FTP on TCP over IP: As far as i can see, ftp should transfer files perfectly. If you trace the responsibility for validation through the relevant RFCs everything should be handled by the TCP checksum. Yet "everyone" knows that ftp regularly transfers corrupt files. Why? More generally, where is a good forum for discussing this kind of thing (I'm also interested in, for example, multiple parallel ftps (eg gridftp) and the possibility of checking an "instantaneous" checksum value at various points during transfer to detect errors as soon as possible)?
posted by andrew cooke to Computers & Internet (21 answers total)
 
Wouldn't any kind of "real-time" checksum value involve:

1) A lot of overhead communication between the server and the client? (Since the transmission isn't perfectly serial, wouldn't the client have to keep telling the server what it's gotten so far, and asking for a new checksum?)

2) A substantial extra load on the server processor to keep re-calculating interim checksums? (Wouldn't each partial checksum for each download potentially be unique?)

Even if these things are true, I guess it wouldn't mean that they're unworkable, but they would make for a substantially less efficient transfer protocol.

On the other hand, I _could_ just be completely wrong--I'm more just asking...
posted by LairBob at 4:38 AM on November 24, 2004


What do you mean, "regularly transfers corrupt files?"

I've been using FTP for years, and the only corruption that I've ever had, is when I've forgotten to go to binary mode for binaries.
posted by veedubya at 4:39 AM on November 24, 2004


What veedubya said.

FTP Will NOT corrupt your files.
However, if you transfer a binary file in ASCII mode, it will convert some characters and strip off the last bit of every character. More information here
posted by seanyboy at 4:48 AM on November 24, 2004


As you say, the TCP checksum will ensure that data is unlikely to be corrupted. The confusion comes perhaps from older, command-line clients that required you to manually set binary mode to transfer all 8 bits of each byte rather than the 7 needed for text-only files.

I don't understand why FTP is in use any more. HTTP 1.1 offers everything that FTP does, and allows security through https as well. A further advantage of using encryption is that you get the instantaneous checksumming for free, if that's a real concern.
I also can't see what GridFTP gives over BitTorrent, but I don't know your application.

On preview, what they said.
posted by quiet at 4:53 AM on November 24, 2004


Response by poster: i know about binary mode.

i work in an area (astronomy) where large amounts of data are regularly transferred. the consensus here is that ftp is not reliable (these people also know about binary mode).

i have argued the same points people have said above, and people have got very angry with me, saying that ftp is not reliable. i don't have a reason why, either. it's possible there are problems with poor implementations, or that other layers in the comms stack are corrupting data. it might even be local problems with nfs (all reasons which support, incidentally, a separate end-to-end checksum).

so maybe you are right. but, if so, i would like the opinion of someone with status to back me up. so if you're involved in protocol design please says so. otherwise, could you point me to where this kind of thing is discussed?

as far as intermittent checksums go. consider transferring 10Gb. the additonal overhead of transmitting a checksum ever 1 or 10Mb is negligable and would save a lot of effort if there is corruption early in the process. there's no extra load on the processor to sample the instantaneous checksum value compared to calculating a checksum anyway (which you may think not necessary, since tcp does this, but i am dealing with people who, as i have already explained, do not trust ftp and so want a separate check).
posted by andrew cooke at 5:33 AM on November 24, 2004


Response by poster: gridftp is approrpiate for us beacause it helps avoids throttling due to tcp stack size restrictions when data rates are very high (those parallel channels are not for broadcast).
posted by andrew cooke at 5:37 AM on November 24, 2004


Well, I'm sceptical. I've used FTP to move around multi-gigabyte files, and have never had any concern about reliability. I guess that's not a lot of help to you, though. I'm curious to know if these people have ever shown you an example of this unreliability, or if it's more a "friend of a friend says" sort of thing.

On the subject of intermediate checksumming, I would think that was a natural for shell scripting. Basically, have a script that takes a file, chops it into multiple parts, and appends an MD5 checksum to each part's file name. Another script could pull each part across and reassemble it, checking the MD5 as it does so.
posted by veedubya at 5:57 AM on November 24, 2004


Likely explanations for corrupted ftp transfer: 1. Bad memory on either the server or the client, or 2. A buggy server or a buggy client. In both cases, data could get corrupted before checksumming on the server or after checksumming on the client.

I'd consider a protocol with built-in strong "instantaneous" checksumming, such as BitTorrent or rsync. Both chop your file up into chunks and checksum each. BitTorrent seems a bit more rigorous with its checksumming, but I'd reccomend rsync as it's more suited to single server-client transfers. If an rsync transfer gets corrupted, just run it over again, as it'll get "healed" in a jiffy. I often just run rsync twice on extra-large downloads just to be doubly sure.
posted by zsazsa at 6:07 AM on November 24, 2004


Response by poster: two days after the last discussion we had on this, i got an email from a colleague detailing an exact case. so it's not just rumour. something is flakey somewhere.

as i said, i agree that there should be no reason for failure. but i've also been around computers and software long enough to know that there's a big difference between theory and practice.

doing everything in a script means processing the data many times. the data have to flow through the comms stack so the most efficient place to checksum data is there. and a pre-existing solution would be preferable (which you might hope exists, if the people are work with are not completely crazy and something really is happening to corrupt things).

zsazsa - yes, i'm tempted to think is an error in the server or client (memory seems less likely because we don't have machines falling over at random). especially since other tools based on the same technology are more widely trusted (eg gridftp).

something "self-healing" like rsync is a very nice idea, but i'm not sure it would work for us for various reasons (we're not just copying directory trees around).

anyway, it looks like the consensus is with ftp being reliable in theory, which is a help - at least i'm not going daft. thanks all.
posted by andrew cooke at 6:13 AM on November 24, 2004


1) ftp is fine. used ftp regularly for about 15 years now, and I don't think I've ever had a problem.

2) rsync is self-healing? clearly you've never REALLY used rsync. If you want to see rsync suck, point it at complex directory structures filled with multi-gigabyte files. rsync loves to suck in a lot of different ways.
posted by mosch at 7:24 AM on November 24, 2004


" Yet 'everyone' knows that ftp regularly transfers corrupt files."

I don't know these 'everyone' people, but they are wrong. While it's true that there are broken FTP implementations out there -- on both the client and server side -- there's nothing wrong with the protocol itself. If something is "flaky somewhere," I'm certain that "somewhere" is equal to either a broken client or a buggy server, certainly not attributable to FTP itself.
posted by majick at 7:26 AM on November 24, 2004


I wouldn't count out the network and transfer layers being 100% bulletproof, especially with files this gigantic. TCP uses a pretty weak 16-bit checksum and Ethernet uses a relatively weak CRC32 checksum (compared to MD5, for example). A 16-bit checksum could mean that a corrupt packet could get through 1/65536th of the time. That's pretty bad. CRC32 collisions aren't unheard of, either. It just takes one broken checksum along the many network paths and your data is ruined.
posted by zsazsa at 7:30 AM on November 24, 2004


Response by poster: yes, zsazsa, i'd forgotten about that. thanks. there's also a known issue with ipv4 and high data rates (frame numbers, iirc) due to a similar problem (fixed, too-small field size), but we've not hit that in practice yet.
posted by andrew cooke at 7:38 AM on November 24, 2004


Response by poster: i've not used rsync to transfer large amounts of data, no. i was referring to the idea that you iterate the process and converge on the correct result - with an iteration cost that scales proportional to extent of the error rather than the total data volume (which is, i believe, what rsync tries to achieve) you might be able to find a robust practical solution.
posted by andrew cooke at 7:46 AM on November 24, 2004


Yep, I'd say there's a problem with your particular network or setup. If you can, I'd go with FTP, ignore your colleagues, and find a way to make it work. Because I've also done massive data transfers using FTP and have not had problems. There is nothing inherently wrong with FTP (of SFTP) that should cause data corruption.
posted by Mo Nickels at 8:21 AM on November 24, 2004


You know, FTP and HTTP are very rude and featureless protocols for file transfer. It affords no error correction, no compression, and very little security. It's always been that it's a common protocol for file transfer between different systems therefore it's convienent and there are lots of clients, most notably (and sadly too) the browser.

However, we're really forgetting our salad days. There's plenty of file transfer protocols that offer error correction, compression and security. There's Kermit (also available on almost every operating system), X/Y/Zmodem... (Zmodem particularly offers some awesome features we've all fogotten about). These protocols were particularly useful where packet loss was a lot more of a problem (CARRIER LOST...) so, naturally in an environment where connections to networks were more or less permanent and constant, this was less and less of a problem. Everyone settled upon TCP/IP/UDP as a standard network protocol and the rest is forgotten history.

Well, no one wants to add a layer to file transfer so we use what we have available to us but, it's worth a thought. I'm not sure of the details of your particular case but, I hope that you can consider alternatives to FTP for sensitive data. There has been so many awesome gains in error corrention (see Reed-Solomon codes) I would hope that someone besides the usenet people put them to good use.
posted by Dean_Paxton at 9:26 AM on November 24, 2004


HTTP does not offer easy remote command execution support (for ex: SITE commands). Sooory... Yeah, you can get this through cgi, but that's not really what HTTP is about. Also, HTTP generally only transfers a single file and requires a complicated client to transfer entire directories. With FTP, implementing "MGET" isn't a big deal. Or so it seems.
posted by shepd at 11:01 AM on November 24, 2004


My experience is that FTP is way faster than HTTP for large files(+25MB) and the FTP client in IE seems to be fundementally broken.
posted by Mitheral at 11:39 AM on November 24, 2004


I haven't ever had a problem with ftp corrupting the data it delivers and I've been using ftp for 25+ years. However, ftp can quit in the midst of a file transfer and you'll end up with only a portion of the transferred file. Always check file lengths to be certain. This will also catch most of the ascii/binary mode confusion.

A bad packet with a the correct checksum can get through but there would be a lot of bad packets with incorrect checksums that got caught for that one packet that snuck through which means you'd have a very noisy, very slow (since the bad packets that got caught would be resent) line.

All that said I'd use a command line ftp over a browser thingie for huge files so that I could see the output.
posted by rdr at 12:49 PM on November 24, 2004


I'm a software engineer that worked in the development of a TCP/IP derivative protocol in the past. So I know about how FTP should work, but I can give concrete examples of inexplicably corrupted files following FTP download too.

I have a website where the user can download MPEG II files from my server. I could play the files perfectly when I put them on the server and so could all the users in question. If I download them via SMB, they play fine. However, when I give users the option of downloading over FTP or HTTP, the files are corrupted and will not play on most machines. On some machines they can be played, but those machines are running the same OSes and codecs as "bad" machines. It doesn't matter what FTP/HTTP client was used to download. I even tried zipping the files first, but that didn't help.

MPEG2 has its own codec issues, but these files will play fine when downloaded via SMB. So any ideas why an MPEG2 file specifically would be corrupted via FTP download?
posted by McGuillicuddy at 2:01 PM on November 24, 2004


rdr, wget gives really good progress information. I greatly prefer to pull stuff down with wget than with an FTP client.
And HTTP downloads with wget generally seem faster to me than FTP, although I have no evidence proving it so. Perhaps it's just the better feedback I get from a client like wget.
posted by xiojason at 3:25 PM on November 24, 2004


« Older What's the best pay-as-you-go mobile deal in the...   |   Design Resources Newer »
This thread is closed to new comments.