How do files on Usenet “decay”?
March 18, 2013 11:56 AM Subscribe
Large binaries on Usenet are often distributed in pieces, with parity archives included to fix bitrot. Very old files often require substantial repairs, and sometimes can’t be salvaged. I’m curious about the actual mechanism by which this decay takes place: lossy copies? Encoding errors? Solar radiation? How many network hops and transfers do messages typically undergo from their pristine original state to too-far-gone? Why would a digital file “go bad”?
I could be totally off-mark, but it seemed to me that not all Usenet providers were equally reliable in terms of getting the original binary files. I could be wrong, but I vaguely recall downloading binaries with gaps in the middle, not at the beginning.
posted by filthy light thief at 12:20 PM on March 18, 2013
posted by filthy light thief at 12:20 PM on March 18, 2013
Originally Usenet was transmitted on normal modems (remember them?) and phone line noise could scramble the text.
posted by Chocolate Pickle at 12:50 PM on March 18, 2013
posted by Chocolate Pickle at 12:50 PM on March 18, 2013
Yeah, the parity files are to defend against a given server not receiving all of the articles in a given post. It's not so much that they go away, it's that they're never received in the first place (or the disk write fails or whatever). Each rar file is split across a very large number of articles because the news servers have fairly low article size limits. 40ish is pretty common for TV shows. Blu-Ray rips are more like 200 articles per rar file. Lots of opportunity for something to get lost in transit.
It's pretty easy for one to get lost somehow, so some bright person made up par files. Those weren't great, though, because the block size was fixed to the size of a rar file. So if there were 4 different files that were messed up and only 3 par files, you were completely out of luck even if only one article were missing from each of the four files. Now we have par2 files, which have (smaller) variable block sizes and are therefore more robust.
posted by wierdo at 12:56 PM on March 18, 2013 [1 favorite]
It's pretty easy for one to get lost somehow, so some bright person made up par files. Those weren't great, though, because the block size was fixed to the size of a rar file. So if there were 4 different files that were messed up and only 3 par files, you were completely out of luck even if only one article were missing from each of the four files. Now we have par2 files, which have (smaller) variable block sizes and are therefore more robust.
posted by wierdo at 12:56 PM on March 18, 2013 [1 favorite]
Response by poster: This makes sense, thanks.
In my recent experience, I’ve been seeing complete collections of rar files with corrupted content. I.e., they might all be there but each one has some amount of damage that requires fixing. Are rar files and articles 1:1, or do the rar files themselves get split up among multiple articles?
If I wanted to put some large binary file onto Usenet and reliably get it back at some time in the future, what could I do to help ensure its integrity? Substantially increase the number of par2 files? Split it up amongst a larger number of small rars, or keep it as one big slug?
posted by migurski at 1:31 PM on March 18, 2013
In my recent experience, I’ve been seeing complete collections of rar files with corrupted content. I.e., they might all be there but each one has some amount of damage that requires fixing. Are rar files and articles 1:1, or do the rar files themselves get split up among multiple articles?
If I wanted to put some large binary file onto Usenet and reliably get it back at some time in the future, what could I do to help ensure its integrity? Substantially increase the number of par2 files? Split it up amongst a larger number of small rars, or keep it as one big slug?
posted by migurski at 1:31 PM on March 18, 2013
As people have alluded to, there are two types of bit errors you're trying to guard against here.
1) You are missing chunks of the file.
As people mentioned, you might only be able to find / download 90% of the posts that contain the bits of your file. To fix this up, you need some of those posts to have redundant data. Usually people use variations on Reed-Solomon codes to fix this up. The wikipedia article is way over my head, but thanks to a previous job I know that there are error correction codes that will let you divide data into d data chunks and p parity chunks and you can lose any p chunks and still recover your data. So for instance, you could have 20 data chunks and 5 parity chunks, and as soon as you have ANY 20 chunks, you have enough bits to reconstruct your data.
2) Bit errors in the chunks themselves.
Usually you defend against these using a different checksumming mechanism, which may or may not be able to recover your data (it seems like maybe par2 supports recovery?). For instance, you might just check the crc of the whole chunk, and if it fails, you throw that out and hope you have enough parity chunks left. You can get more complicated, but I don't know how rar files work, so I can't help you specifically.
Personally, I would not rely on usenet as long term storage. As other have mentioned, it's unreliable (by design), and hosts only keep a certain amount of data before chucking it in the garbage.
posted by Phredward at 1:38 PM on March 18, 2013 [1 favorite]
1) You are missing chunks of the file.
As people mentioned, you might only be able to find / download 90% of the posts that contain the bits of your file. To fix this up, you need some of those posts to have redundant data. Usually people use variations on Reed-Solomon codes to fix this up. The wikipedia article is way over my head, but thanks to a previous job I know that there are error correction codes that will let you divide data into d data chunks and p parity chunks and you can lose any p chunks and still recover your data. So for instance, you could have 20 data chunks and 5 parity chunks, and as soon as you have ANY 20 chunks, you have enough bits to reconstruct your data.
2) Bit errors in the chunks themselves.
Usually you defend against these using a different checksumming mechanism, which may or may not be able to recover your data (it seems like maybe par2 supports recovery?). For instance, you might just check the crc of the whole chunk, and if it fails, you throw that out and hope you have enough parity chunks left. You can get more complicated, but I don't know how rar files work, so I can't help you specifically.
Personally, I would not rely on usenet as long term storage. As other have mentioned, it's unreliable (by design), and hosts only keep a certain amount of data before chucking it in the garbage.
posted by Phredward at 1:38 PM on March 18, 2013 [1 favorite]
Response by poster: Thank you Phredward, makes sense. Going back to my original question though, *how* are these errors introduced into the chunks? I use a paid News service with multiple years of retention, for example. I would imagine that when they received a chunk of data at some time in the past it was written to a disk and then left undisturbed, subject to hardware reliability and the provider’s redundancy policy. Is that the case, or is there some other source of instability in the Usenet protocol?
Sorry if I’m being dense here, I really am asking a “why is the sky blue” question.
posted by migurski at 1:49 PM on March 18, 2013
Sorry if I’m being dense here, I really am asking a “why is the sky blue” question.
posted by migurski at 1:49 PM on March 18, 2013
I don't know the answer for sure, but encoding errors seem like a plausible explanation. For example, binaries on usenet are often base64-encoded. Base64 is a poorly-specified encoding scheme with lots of variants. Even if the original message was readable, any server or filter or other tool along the way that decodes and/or re-encodes a message could cause problems if it does not correctly support the original variant, or because it outputs some variant that a later reader doesn't support.
Or maybe a message can get truncated in transit because it exceeds someone's length restrictions?
Or indeed it might just have a completely random single-bit error introduced by random noise/radiation/hardware failures in transit or storage -- but I would expect these to be very rare since error correction built into various layers like TCP should catch the vast majority of them.
posted by mbrubeck at 2:31 PM on March 18, 2013
Or maybe a message can get truncated in transit because it exceeds someone's length restrictions?
Or indeed it might just have a completely random single-bit error introduced by random noise/radiation/hardware failures in transit or storage -- but I would expect these to be very rare since error correction built into various layers like TCP should catch the vast majority of them.
posted by mbrubeck at 2:31 PM on March 18, 2013
Best answer: I work for a Usenet provider.
Usenet articles are not very big, so in a large binary post each rar is made up of many articles. The errors you are seeing are most likely individual missing articles. It would be very rare for articles to get corrupted once they were on disk. In fact, we replace disks with any errors, re-copying the article from a redundant copy.
The most common cause of missing articles is DMCA takedown notices. Many takedown firms will send takedowns for only a few articles in each file. Just enough to make it unusable.
It used to be the case that articles would be removed once they were older than the provider's retention, but most providers now continually add storage and grow their retention.
posted by mad bomber what bombs at midnight at 3:06 PM on March 18, 2013 [3 favorites]
Usenet articles are not very big, so in a large binary post each rar is made up of many articles. The errors you are seeing are most likely individual missing articles. It would be very rare for articles to get corrupted once they were on disk. In fact, we replace disks with any errors, re-copying the article from a redundant copy.
The most common cause of missing articles is DMCA takedown notices. Many takedown firms will send takedowns for only a few articles in each file. Just enough to make it unusable.
It used to be the case that articles would be removed once they were older than the provider's retention, but most providers now continually add storage and grow their retention.
posted by mad bomber what bombs at midnight at 3:06 PM on March 18, 2013 [3 favorites]
Rar files and articles are not 1:1; typical rar files are too large to meet the file sizes that are acceptable to most newshosts. A rar file will be made of dozens to hundreds of articles.
The best way to ensure its integrity is to crosspost a reasonable amount (newshosts don't want you spamming half the newsstream), to repost often (these days, once a year would be often), and to provide lots and lots of parity-archive files; a typical parchive that I see will weigh in at about 10-15% of the full filesize. You might want to bring that closer to 50% or even more.
Why do articles go bad? It doesn't take much disk corruption or transmission error to render an article useless. These articles are passed from server to server and copied many times before it gets from the origin posting server to your newsserver.
posted by Sunburnt at 3:08 PM on March 18, 2013 [1 favorite]
The best way to ensure its integrity is to crosspost a reasonable amount (newshosts don't want you spamming half the newsstream), to repost often (these days, once a year would be often), and to provide lots and lots of parity-archive files; a typical parchive that I see will weigh in at about 10-15% of the full filesize. You might want to bring that closer to 50% or even more.
Why do articles go bad? It doesn't take much disk corruption or transmission error to render an article useless. These articles are passed from server to server and copied many times before it gets from the origin posting server to your newsserver.
posted by Sunburnt at 3:08 PM on March 18, 2013 [1 favorite]
1) You are missing chunks of the file.
There's also a 1)a. here in that servers theoretically might receive only part of a chunk, that is, cut off the end. Since it's ASCII-encoded, the recipient has no way of knowing for sure that it's the complete chunk or not without attempting to assemble it and checking the parity.
This probably isn't as much of an issue today as it used to be, though.
posted by dhartung at 4:57 PM on March 18, 2013
There's also a 1)a. here in that servers theoretically might receive only part of a chunk, that is, cut off the end. Since it's ASCII-encoded, the recipient has no way of knowing for sure that it's the complete chunk or not without attempting to assemble it and checking the parity.
This probably isn't as much of an issue today as it used to be, though.
posted by dhartung at 4:57 PM on March 18, 2013
Response by poster: Mad Bomber, very useful inside info! Do you mean that takedown firms know how to calculate the correct amount of degradation that a par2 file will withstand, and DMCA just a little over that? Do they send you article ID’s, or do the takedowns wend their way through Usenet like the original articles?
posted by migurski at 5:49 PM on March 18, 2013
posted by migurski at 5:49 PM on March 18, 2013
I do not work for a Usenet provider, but my experience has been that Astraweb has been doing near-instantaneous auto take-downs without confirmation, but without actually deleting the file(s) - only replacing it with garbage. Since Astra is huge in the US/world and the takedown happens so quickly after posting, when files on Astra get propagated elsewhere, everybody else's servers have the corrupt files.
Also, it doesn't appear to be "just enough," for example, all the rar files for certain things are routinely completely garbage (at least in the eyes of NewsBin Pro 6.41). Some prominent posters/posting-groups have changed names or changed to lower profile newsgroups to post in, and/or are posting to non Astra servers. Or have stopped posting and instead we now have much better seeded torrents than before.
For a brief period of time I had access to Astra and an non-Astra EU-based server. Quite a lot of stuff (especially if it was originally posted on a non-Astra server) was perfectly intact on the EU-based server but not on Astra's (both US and EU servers).
There appears to be a lot of "fake" posts ("real" titles/names but garbage data of the same approximate size as the "real" thing) that are masquerading as scene posts. Usenet has also been blitzed/flooded with fake crap (of a much smaller file size) for the last year or so in an attempt to overwhelm headers to search/browsing. But I think this used to happen every few years and was cyclical.
The cynic in me suspects that there's a lot of steganography-like data dissemination hidden in this sea of fake data, much of it not taken down (despite near-identical header names) that are intact passworded rars or .exe executables). Like, spy stuff or kiddie porn.
I have seen partially corrupted collections that exceed the included par2 recovery set (or the par2 isn't correct), but I suspect that's most likely for a very different set of reasons.
Hmm, is a posted par2 set (basically a partial hash(?) of the allegedly infringing material) actually technically a case of infringement that can be legally taken down?
posted by porpoise at 7:44 PM on March 18, 2013 [1 favorite]
Also, it doesn't appear to be "just enough," for example, all the rar files for certain things are routinely completely garbage (at least in the eyes of NewsBin Pro 6.41). Some prominent posters/posting-groups have changed names or changed to lower profile newsgroups to post in, and/or are posting to non Astra servers. Or have stopped posting and instead we now have much better seeded torrents than before.
For a brief period of time I had access to Astra and an non-Astra EU-based server. Quite a lot of stuff (especially if it was originally posted on a non-Astra server) was perfectly intact on the EU-based server but not on Astra's (both US and EU servers).
There appears to be a lot of "fake" posts ("real" titles/names but garbage data of the same approximate size as the "real" thing) that are masquerading as scene posts. Usenet has also been blitzed/flooded with fake crap (of a much smaller file size) for the last year or so in an attempt to overwhelm headers to search/browsing. But I think this used to happen every few years and was cyclical.
The cynic in me suspects that there's a lot of steganography-like data dissemination hidden in this sea of fake data, much of it not taken down (despite near-identical header names) that are intact passworded rars or .exe executables). Like, spy stuff or kiddie porn.
I have seen partially corrupted collections that exceed the included par2 recovery set (or the par2 isn't correct), but I suspect that's most likely for a very different set of reasons.
Hmm, is a posted par2 set (basically a partial hash(?) of the allegedly infringing material) actually technically a case of infringement that can be legally taken down?
posted by porpoise at 7:44 PM on March 18, 2013 [1 favorite]
Different takedown firms send different amounts. Some seem to not know about par files, some send takedowns for every single part, including par's. They email a list of infringing message-id's and it's our responsibility to remove them.
posted by mad bomber what bombs at midnight at 9:06 AM on March 19, 2013
posted by mad bomber what bombs at midnight at 9:06 AM on March 19, 2013
« Older Carry-on baggage, worth the hassle? | Visual (Dollars and Coins) Time and Money Tracking... Newer »
This thread is closed to new comments.
posted by kindall at 11:58 AM on March 18, 2013 [1 favorite]