Why did tar and nc not play nice?
September 26, 2009 5:54 PM   Subscribe

Can anybody immediately see why nc and tar didn't work together the way I expected they would?

I wanted to copy an Ubuntu installation from a laptop with one filesystem to a desktop box with two. So I booted an Ubuntu live CD on each and opened terminals; then on the laptop did

sudo su -
mount /dev/sda2 /mnt
cd /mnt
tar c . | nc -l -p 10000

and on the desktop box did

sudo su -
mkfs -t ext3 -L root /dev/sda3
mkfs -t ext3 -L home /dev/sda4
mount /dev/sda3 /mnt
mkdir /mnt/home
mount /dev/sda4 /mnt/home
cd /mnt
nc 192.168.1.3 10000 -q5 | tar xv --numeric-owner

As expected, a huge list of filenames scrolled by on the desktop box as tar extracted the files. When that all stopped, I hit ctrl-D on the desktop end to close nc's standard input; five seconds later the shell prompt returned on the laptop as well. So everything seemed to be working as expected.

After making the necessary corrections to /mnt/boot/grub/menu.lst, /mnt/etc/fstab, /mnt/etc/hosts and /mnt/etc/hostname on the desktop box, I umounted everything and rebooted it, but assorted things were badly amiss. Turns out that a random assortment of vital files had been created with zero length and zero permissions instead of being properly copied.

I have since got the machine-to-machine copy done by mounting the laptop's hard drive in a USB enclosure, plugging it into the desktop box and using cp -av so I'm not looking for ways to get the primary job done any more.

What I would like to know: before I spend more time trying to work out why the tar | nc <--> nc | tar method failed, can anybody see some documented reason why it was doomed to do so?
posted by flabdablet to Computers & Internet (21 answers total) 2 users marked this as a favorite
 
From man nc on my linux box, see the italicized bit:
 -l  Used to specify that nc should listen for an incoming connection
     rather than initiate a connection to a remote host.  It is an
     error to use this option in conjunction with the -p, -s, or -z
     options.  Additionally, any timeouts specified with the -w option
     are ignored.

posted by axiom at 6:16 PM on September 26, 2009


Actually, never mind that. I checked on my Ubuntu machine, and your nc invocation is fine. That man page is apparently for some ancient version of nc.
posted by axiom at 6:19 PM on September 26, 2009


I must admit I'm not too familiar with nc so forgive me if I'm missing something, but:

It occurs to me that you have the producer tar paired with the listener nc. This seems kind of counterintuitive because it relies on nc to queue bytes to send to the consumer tar once the connection is made. In other words, the laptop's nc has to queue all the data being spewed out by tar until the desktop connects and the data can be offloaded over the network. I'd expect you to create the consumer tar on the desktop together with the listener nc then cause the producer tar to initiate the connection. Basically, swap the arguments to nc between the desktop and the laptop.

Next time, just use rsync?
posted by axiom at 6:31 PM on September 26, 2009


the laptop's nc has to queue all the data being spewed out by tar until the desktop connects and the data can be offloaded over the network

No. The data is queued by the pipe, not by nc. And when the pipe's buffer fills (typically 4kb), the writer end blocks until the reader end has drained some of the data, so tar will block after writing a few kb.

I think nc is a red herring here, it has nothing to do with your problem. There are some things like device files that tar is just incapable of reproducing, though with devfs that shouldn't really matter.
posted by Rhomboid at 6:45 PM on September 26, 2009 [1 favorite]


Yeah, specifically, what files didn't make it? I'd be surprised if any special block devices survive that treatment (notably, anything in /dev). You might have better luck with cpio(1), although I haven't tried this myself.
posted by cj_ at 8:18 PM on September 26, 2009


So yeah, confirmed cpio will call mknod on special block devices. To wit:
$ echo "/dev/null" | cpio -o | sudo cpio -i --make-directories
1 block
cpio: Removing leading `/' from member names
1 block
$ ls -l /dev/null dev/null 
crw-rw-rw-  1 root  wheel    0,  10 Sep 26 20:31 /dev/null
crw-rw-rw-  1 root  wheel    0,  10 Sep 26 20:32 dev/null

nc would sit between cpio -o and cpio -i. Docs here here (there are a lot of options).
posted by cj_ at 8:35 PM on September 26, 2009


Look at the error messages. Tar should not have any problems creating devices.

My guess is that you ran out of disk space on your target machine.
posted by rdr at 9:33 PM on September 26, 2009


Well, he got the copy done with cp, so I doubt it's space either.

I'm at a loss if it's not block devices giving you trouble. Try running tar without "v" so errors don't get swallowed by the file list spam, or redirect stderr to a logfile.

I also doubt nc is the problem here. If you think it might be, rsync is probably a better way of doing this anyway.
posted by cj_ at 10:03 PM on September 26, 2009


also for future reference (for those times when you don't want the authentication that rsync requires) consider dd over nc which wouldn't even need a filesystem to write on to.
posted by dirm at 12:51 AM on September 27, 2009 [1 favorite]


rsync needs rsyncd set up on the source end first, doesn't it? I couldn't be bothered looking up how to do that, or even setting up sshd, and wasn't sure what rsync and scp would do with special files.

I'd successfully used (Gnu) tar before to clone an entire Ubuntu installation (including /dev) from a tar archive burned to CD-ROM so I don't think it's a special files issue. In any case, the files that didn't make it were ordinary files, and surrounded by similar files that did work (a random peppering of the binaries in /usr/bin ended up zero-length, for example).

I did actually try the same thing with nc and cpio, and it failed in a different (but possibly related) way.

I didn't use dd (or just nc with standard input redirected to the source partition) because /home was on a separate filesystem at the target end.

Since it doesn't look like I've made an obvious blunder, I'll just try to reproduce this effect on a non-customer pair of machines and debug it properly in my Copious Free Time (tm).

Thanks for thinking about this.
posted by flabdablet at 2:33 AM on September 27, 2009


"can anybody see some documented reason why it was doomed to [fail]?"

You discarded, ignored, or failed to report error output. While you appear to have used your tools mostly-correctly for the task at hand for the most generic case of copying some files, but since it did in fact fail the chances are you did in fact do something wrong. Not reporting any of the errors means we can't tell you what.

" the files that didn't make it were ordinary files"

Were these files suid or sgid? What did tar say about those files when it was creating the tarstream? What did it say about them when it was expanding the tarstream? You say this is a customer machine. Surely you typescript your activity on customer machines. Do you still have the typescript from this session or did that get chucked? Can you go back and pull out the output of tar to the tty to find out what it said?

"I hit ctrl-D on the desktop end to close nc's standard input"

That's outright wrong -- netcat's STDIN was connected to the pipe, not the tty, so it never read your ^D as input -- but harmless if you really only hit ^D since all you'd have done was type a ^D into the tty buffer for the shell. You can just skip that step next time and just wait for the commands to complete.

"I did actually try the same thing with nc and cpio and it failed in a different (but possibly related) way."

What way? What was the tty output of both ends of cpio during that attempt?

"rsync needs rsyncd set up on the source end first, doesn't it?"

No.

In any case, good luck hunting this down. It's non-obvious what your mistake was, and in my case those are always the kinds of problems that wake me up in the middle of the night.
posted by majick at 5:21 AM on September 27, 2009


Not sure what went wrong with yours, but when I've done this sort of thing, I've always done the nc listening the other way around:

On the receiver (desktop in your case)

nc -l -p 10000 | tar xvSf -

Then on the sender

tar cvSf - . | nc -w 3 192.168.1.3 10000

but yeah, these days I tend to use rsync.
posted by fings at 7:38 AM on September 27, 2009 [1 favorite]


"rsync needs rsyncd set up on the source end first, doesn't it?"

No.


To get a little more specific, it seems rsync can operate over remote shells, in which case the other side of the connection would need rshd or sshd running, for example, and the rsync client.
posted by olaguera at 9:25 AM on September 27, 2009


... you appear to have used your tools mostly-correctly for the task at hand for the most generic case of copying some files

That's what I came here to get confirmation of.

but since it did in fact fail the chances are you did in fact do something wrong. Not reporting any of the errors means we can't tell you what.

Not having seen any, possibly because they were buried in the tar -v filename spew, means I can't tell either. At the time I didn't really have the time to invest in a proper debug session, so I just got the job done a different way without networking. In any case, I didn't come here expecting others to debug my problem by telepathy, just to see if there was anything obviously bogus in the way I was invoking the tools.

Were these files suid or sgid?

That's a good question, and if I manage to duplicate this fault it's one I'll look for an answer to.

What did tar say about those files when it was creating the tarstream?

Nothing. The tar on the sending end had no visible output. Note to self: make sure tar's error output really does go to stderr, not stdout.

What did it say about them when it was expanding the tarstream?

Can't swear to that because all I saw was pathname spew, but scrolling back through the last 500 lines of that didn't reveal anything except pathnames.

You say this is a customer machine. Surely you typescript your activity on customer machines.

Not when I'm doing bare metal builds in customers' houses using live CDs and not expecting tricksy trouble from standard tools that have always worked well before.

"I hit ctrl-D on the desktop end to close nc's standard input"

That's outright wrong -- netcat's STDIN was connected to the pipe, not the tty, so it never read your ^D as input -- but harmless if you really only hit ^D since all you'd have done was type a ^D into the tty buffer for the shell. You can just skip that step next time and just wait for the commands to complete.


Actually it's not wrong at all; nc is a little counterintuitive. When I do

tar c . | nc -l -p 10000

on the server machine, its nc ends up with tar's stdout piped to its stdin, and its own stdout and stderr connected to the tty. On the client machine,

nc 192.168.1.3 10000 -q5 | tar xv --numeric-owner

nc's stdin and stderr are tty, stdout is the pipe to tar, and there's a two-way TCP connection to the server's nc, kind of off to the side of the pipelines. While the two tars are busy doing their thing, you can amuse yourself by typing guff on the client's tty and watching it come out on the server.

When the server tar completes and closes stdout, the server nc does not immediately close the TCP connection. The various flavors of nc have various options to make them close the TCP connection after an EOF on stdin, but in the past I've been burnt by race conditions where the connection is closed before the last few bytes to arrive on stdin have actually been transmitted over it. So I've got out of the habit of using that option on the server end. I use it on the client end instead because I don't care about the integrity of any amusement guff going back from client to server.

What was the tty output of both ends of cpio during that attempt?

All quiet on the server end, the occasional anomaly (maybe not a pathname, maybe just a long pathname wrapped) flicking by in the spew on the client end, and nothing but legit pathnames in the scrollback which I had unfortunately not had the wit to set to 10000 lines before trying the cpio. If I recall correctly, cpio didn't actually lose any file contents but it screwed up lots of permissions and yes, I did use the -depth option on the find that piped in the list of files to send. I have not yet worked through the cpio documentation enough to find some equivalent of tar's --numeric-owner option, which is probably going to end up being the key to getting permissions right in this use case. The cpio was the last thing I tried before abandoning networking altogether and reaching for my USB enclosure. I'd prefer to stick with tar if I can make it work, since the cpio and find options are kind of squirrelly.

when I've done this sort of thing, I've always done the nc listening the other way around ... tar cvSf - . | nc -w 3 192.168.1.3 10000

Can't see why it would make a difference except right at the end (doing it my way avoids the race condition implied by the 3 after your -w) but I will certainly be trying this and looking for clues. Thanks for the reminder about the S (sparse) option. Is there any particular reason you explictly use "f -"?

it seems rsync can operate over remote shells, in which case the other side of the connection would need rshd or sshd running

both of which need rather more fiddling about to set up than tar|nc. This, and uncertainty about how rsync or rcp handle special files, is why I have generally preferred what I thought was a well-tested and reliable archiver.

Thanks again for all your suggestions so far. I'll report back when I've tracked the issue down.
posted by flabdablet at 7:14 PM on September 27, 2009 [1 favorite]


" Is there any particular reason you explictly use "f -"?"

Personally, I always specify f - because I can't guarantee that any given machine I'm using uses GNU tar as /usr/bin/tar or is known to have sensible defaults. That habit of mine goes back to at least my Ultrix or SVR3 days when I was bitten by a weird default output device that was emphatically not STDOUT. I forget what I was trying to do, but a piece of hardware started moving that should not have been moving at that time.

The behavior you ascribe to netcat sounds broken. Because it's unmaintained code, there are lots and lots of versions of netcat out there -- sounds like you use one that's pretty different from the old l0pht code -- so I could see being wary around it if it's bitten you.

For what it's worth, I've done dd|nc and nc|dd plenty of times before (remind me to tell you about P2V virtualizing the world's funkiest hand-hacked Gentoo server some time) and never had trouble. On the other hand, if I lost the last couple bytes of a filesystem there's a strong chance I'd never know or care unless that final block were touched.

99% of the time when I want to tar files over a network I've ended up with rsync-over-ssh, but permissions, ACLs, ownership, or other metadata is not guaranteed to be preserved in the manner to which you are accustomed as a tar user.

"it seems rsync can operate over remote shell"

I don't think I've seen an actual running rsyncd in many, many years. It's all done over ssh now, as with almost all other forms of remote invocation.
posted by majick at 8:12 PM on September 27, 2009


The behavior you ascribe to netcat sounds broken

Are you referring to (a) the connection model (stdin and/or stdout available to pipe, with a non-pipe TCP connection between the nc client and nc server) or (b) race conditions between stdin EOF and TCP connection close?

(a) doesn't strike me as broken; it strikes me as a reasonable design choice given that TCP connections are inherently bidirectional and pipes aren't.

(b) definitely does strike me as broken. If I could be assured that nc would always flush all available data to the TCP connection before closing it, I'd certainly be putting the -q on the server end. I haven't dug through the nc source to find out, but I can see no point to the -q option's mandatory timeout parameter if there is, by design, no race to be avoided.

permissions, ACLs, ownership, or other metadata is not guaranteed to be preserved in the manner to which you are accustomed as a tar user


When the task at hand is cloning an entire installation and I can't use a straight partition-to-partition device copy because the recipient has a different filesystem vs partition structure to the donor, preserving all that metadata is exactly what's required. If rsync (or scp) don't do that, I think I'll stick with archiver-based methods. Assuming, of course, that I can find out why the bloody thing didn't work. :-)
posted by flabdablet at 9:06 PM on September 27, 2009


For what it's worth, I've done dd|nc and nc|dd plenty of times before

I'm curious about one thing: what value does dd add to that process? In other words, what makes

dd if=/dev/sdX | nc

a better proposition than

nc </dev/sdX

?
posted by flabdablet at 9:12 PM on September 27, 2009


Well, I like using block-aware tools on block devices more as a matter of policy or habit than anything else, since the tools presumably do more buffering and IO clustering than getc/putc-ing a character stream.

Practically speaking, though? The block count summary at completion is reassuring.
posted by majick at 4:16 AM on September 28, 2009


the tools presumably do more buffering and IO clustering than getc/putc-ing a character stream

My point is that nc will be getc/putc-ing the pipe regardless of how efficiently dd handles the other side of it, and that dd+pipe is therefore nothing but data-move overhead. Don't know if you'd ever actually see that cause a difference in transfer time on a modern system, though.

Block counts: point taken. There is something reassuring about seeing them be the same at both ends.
posted by flabdablet at 5:10 PM on September 28, 2009


"nc will be getc/putc-ing the pipe"

Yes, and presumably nc will be the bottleneck, too. But getc/putc on block devices has historically been really horribly insanely slow on a lot of platforms over the years (Linux included -- you can see the difference yourself by running a bonnie right now; it used to be worse) so it has in the past paid off to use block tools on block devices even if there's a character device somewhere in the pipeline. catting a block special really did used to be that horribly slow, much slower than pipe IPC and much slower than Ethernet.

In cases where you're doing blocks over the network for a large filesystem, it behooves you to be sure that the network will in fact be the bottleneck. Maybe this is generational: on modern hardware and modern OSes that's a pretty likely outcome if you do the obvious thing, but it wasn't always, and I'm not about to break decades of sysadmin habit just because a few recent UNIXes got a little better at IO. =)
posted by majick at 7:01 AM on September 29, 2009


root@jellybelly:~# fdisk -l

Disk /dev/sda: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x0000a1d3

Device Boot Start End Blocks Id System
/dev/sda1 * 1 1824 14651248+ 83 Linux
/dev/sda2 1825 2197 2996122+ 82 Linux swap / Solaris
/dev/sda3 2198 111616 878908117+ 83 Linux
root@jellybelly:~# time cat </dev/sda2 >/dev/null

real 0m28.881s
user 0m0.160s
sys 0m6.292s
root@jellybelly:~# time dd if=/dev/sda2 of=/dev/null
5992245+0 records in
5992245+0 records out
3068029440 bytes (3.1 GB) copied, 28.4318 s, 108 MB/s

real 0m28.606s
user 0m2.332s
sys 0m11.557s
root@jellybelly:~# time dd if=/dev/sda2 of=/dev/null bs=1M
2925+1 records in
2925+1 records out
3068029440 bytes (3.1 GB) copied, 28.9733 s, 106 MB/s

real 0m28.978s
user 0m0.032s
sys 0m13.889s
root@jellybelly:~#

Real time is pretty much a wash on this box. Interesting that dd appears to be responsible for about twice as much internal data movement.
posted by flabdablet at 5:00 PM on September 29, 2009


« Older What's an IT career path for me?   |   Which mini camcorder should I choose? Newer »
This thread is closed to new comments.