Why is it so difficult to prevent buffer overflow attacks?
May 16, 2019 7:27 PM   Subscribe

I have a layman's understanding of how a buffer overflow attack works and it seems pretty straightforward to me. So help me understand why this type of attack, which continues to be a common vector for malicious code or malware (nobody was surprised this was the source of the recent WhatsApp vulnerability) seems so difficult to account for when writing code. Or is it not difficult to anticipate and there are other reasons why it keeps happening to supposedly well-written code and to operating systems that have been around for years?
posted by theory to Computers & Internet (14 answers total) 4 users marked this as a favorite
It's hard to have the discipline to validate every last bit of input and data moving around, especially when users get to craft the data coming into your software.

In the ideal world, your compiler would check for this but..... *elaborate Gallic shrug*
posted by wenestvedt at 7:30 PM on May 16, 2019 [2 favorites]

Best answer: Humans are not good at the detail work required to capture every place you need to check for a buffer overrun.

Computers are a lot better at figuring out everyplace where a check could possibly be warranted, but not so good at figuring out where the check is redundant. Therefore, programs that automate these checks tend to be slower than programs written by humans that only check when needed. It is only fairly recently that computing performance has gotten to the point where it doesn't actually matter so much whether redudant checks are done, so we're only slowly putting paradigms that always do these checks into place.

It doesn't help that common computer programming laguages were designed in the bad old days are still in use, since bounds checks end up having to be a feature of the language - you can't just add it to a language that doesn't natively bound check because the laungage doesn't even really express what the bounds are. For example, the languages used in iOS still allow unsafe memory accesses. Modern languages allow for marking of bounds and automatic bound checking, but even if your application developer is using a modern bounds-checked language it'll eventually end up in system code that is generally (and for good reasons!) capable of unsafe access.

So things are getting better, but we'll still be dealing with bounds check errors for a long time to come.
posted by doomsey at 8:09 PM on May 16, 2019 [3 favorites]

Best answer: Imagine you write contracts in some weird world where technically a contract is only enforceable if every line has an odd number of vowels.

Also, all your references have to pass the same test: you can technically only cite a law, regulation, or decision if every line in that document has an odd number of vowels.

Also, in your world the language of law is Hebrew, so you don't actually write down the vowels; people just figure out from context whether "ppl" means "people" or "apple". So you can't just get a computer to review everything for you.

Also, you've got reams of contracts from before anyone cared about the odd-vowel rule. You have no idea whether they're compliant. Your business partners mostly don't care, because they're in the business of selling apples and not litigating vowel counts.
posted by meaty shoe puppet at 8:15 PM on May 16, 2019 [18 favorites]

Best answer: What it all comes down to is that market-driven systems encourage participants to employ the principle that worse is better.

A hell of a lot of software is written in the C programming language and others descended from it, because that's long been seen as the path of least resistance toward writing performant software.

The reason for that is that C is essentially just a portable assembly language: it gives coders a largely architecture-independent way to specify those kinds of computer behaviour that pretty much all computer architectures are capable of, without requiring them to shift their view very far from what the machine is actually doing. A line of C code usually compiles to a very small number of processor instructions, and programming in C gives the coder quite close control over how their code is actually executed.

The disadvantage of this is that C provides very close to zero protection against unintentionally telling the processor to do horribly unsafe things.

In particular, C has no inbuilt support at all for enforcing the principle that accessing some element of an array by its index must only succeed if that index is smaller than the size of the array. If I have an array of ten characters, and I write a line of C that's supposed to do something with the Jth element of that array, and J happens to contain the value 20, then the C compiler will emit code that accesses the memory location 20 places past the start of the array regardless of the fact that what's stored there has got nothing to do with the array at all.

A typical buffer overflow involves code that's not quite as obviously wrong as that, but close. If I have a buffer say 80 characters long, and I'm using that to hold say a surname, and I fill that buffer by calling a standard library routine that reads characters into it up to the next newline, then as long as nobody ever enters a surname longer than 80 characters all will be well.

And since computer programs are large and complex, and the place in the source code where those 80 characters are allocated for the buffer might be nowhere near the place where stuff actually gets read into it, the language's lack of any kind of checking will pretty much inevitably result in programmers getting this wrong occasionally.

The most likely cause for the WhatsApp bug is a programmer having chosen the size of some buffer somewhere in such a way as to make overflow impossible inside the routines that handle the communication protocol involved provided that all such communication was performed in accordance with the protocol specification. This is a very easy trap to fall into, especially for relatively less experienced coders, and it's also why fuzzing is such a valuable testing technique.

The fact that buffer overflows are still a thing nearly five decades after the first appearance of C was the main motivation for designing the computer language Rust, which is intended to be very nearly as easy to write performant code in as C while doing much, much more to ensure that memory access is always performed safely.

It remains to be seen whether the kickstart that Rust gets from being adopted by Mozilla for future work on Firefox will be enough to see it begin to displace C and derivatives from the wider software development world. But even if it does, there is so much legacy code out there that's never going to be economically feasible to re-implement, and so much untested potential behaviour of that legacy code, and so many coders writing poorly tested software with bad tools under crazy-stupid deadline pressure, that buffer overflow attacks are likely to remain a rich vein for exploit writers to mine for the foreseeable future.
posted by flabdablet at 8:27 PM on May 16, 2019 [16 favorites]

In short, it's not particularly difficult to prevent buffer overflow attacks. But worse is better when you actually want to ship stuff, and it is a little more difficult to prevent them than not to prevent them, and the tools that would make this untrue are still not in widespread use, so the possibility for them keeps on being designed in. Same applies to every other software failure mode.
posted by flabdablet at 8:35 PM on May 16, 2019 [2 favorites]

and it is a little more difficult to prevent them than not to prevent them, and the tools that would make this untrue are still not in widespread use, so the possibility for them keeps on being designed in

Also to add to this that if you prevent them in 99.9% of the cases and introduce one indirectly they are still 100% vulnerable once someone discovers the thing they missed.
posted by mark k at 8:42 PM on May 16, 2019 [2 favorites]

Everything is broken
posted by flabdablet at 9:44 PM on May 16, 2019

Best answer: program to user: give me 500 characters and i'm going to check i got 500 characters and assume that a character is one byte which i am sure fits in a 500 byte space.

user: here's 500 chinese characters

program: i have verified receipt of 500 characters and will keep them in memory

memory: thank you for 2000 bytes and damn the consequences, i don't care
posted by zippy at 11:20 PM on May 16, 2019 [4 favorites]

One thing to keep in mind is that in lower-level programming languages like C and C++, you can have a variable that keeps a pointer to a memory address--and for many common operations, this is the expected or even only way to represent memory containing data you want to keep track of. This pointer is mutable; that is, you can change its value to whatever you want and your program will dutifully do so, and so it can point to any chunk of memory the system allows it to (in modern systems, this typically means pretty much any memory that belongs to the program that's being run). Usually, this pointer is "typed" so that you know, hey, this is a pointer to a string of characters, or an array of integers, or whatever, but there's nothing that prevents you from reinterpreting that pointer as something else, or even using a "void" pointer that has no type information.

Programming languages that allow this low level representation of a computer's memory are largely to blame for these sorts of bugs, as the language can offer no correctness guarantees about what in the end is just a memory address; it doesn't know anything about the nature of the data at that address, beyond what a programmer tells it to expect. Any mismatch of expectations can cause this sort of issue.
posted by Aleyn at 1:04 AM on May 17, 2019

Another factor is the sheer size of modern computer programs.

When I wor but a tiny lad, a program with tens of thousands of lines of source code was considered large. I remember being told, in tones of distinct awe, that the code that ran all the shiny new train arrival and departure displays at all the railway stations in my home city was 22,000 lines of Pascal.

The browser you're probably using to read this comment consists of something like twenty million lines of code, and I think it would be fair to say that there isn't a single person on the entire planet who understands the whole thing. And that's just one application program and doesn't include all the operating system components it needs to interact with in order to actually do something.

Exploitable bugs of all kinds - not just buffer overflows - have just got so much code to hide in these days that stamping them all out is virtually impossible. And as mark k points out, sometimes it only takes one.

The only reason that modern software works at all is because extensive testing is built into every modern development methodology; I'm frequently astonished that it all hangs together as well as it does.
posted by flabdablet at 2:10 AM on May 17, 2019 [5 favorites]

I was going to smugly answer "because programmers are idiots and still use memory unsafe languages like C and C++". And that's mostly true, for most exploits. But I'm not sure that's the problem with this WhatsApp overflow, CVE-2019-3568.

What's interesting is it affects many versions of WhatsApp on Android, iOS, and Windows Phone. Those are programmed in three different languages! And one of them (Java on Android) is memory safe and should be immune to simple overflows. Maybe WhatsApp is passing on corrupt data to the operating system and it's the OS that's actually overflowing? But it's happening on three different OSes! The bug has to do with the network stack in SRTCP, but I'm scratching my head how the exact same bug happens on three platforms.
posted by Nelson at 7:29 AM on May 17, 2019 [1 favorite]

Most often the fault is with C's simplistic string functions like strcpy(), and the way C++ didn't have a standard string class for years so everybody wrote their own one. Dig into any sizeable C++ app and you find half a dozen different string classes with different bugs.

You can make C code that works, but you need to do a lot of bounds checks and correctly handle things that are too long or zero length or NULL pointers.
The length-checking string functions like strncopy() (with an "n" in the middle) have their own problems which I won't get into, and are not a panacea.
With UTF8 that C code can even cope with Unicode, as long as you don't slice and dice things too enthusiastically, as many characters are now sequences of bytes, that should not be broken up. This complicates bounds checking and truncation as the number of characters typed no longer equals the number of bytes. Although it does for the kind of data that English-speaking programmers test with, so the problem gets found downstream if at all.

Things are better in pretty much all newer languages as pointed out by other posters.
They have strings that handle Unicode, strings are variable length (so they don't overflow), and their lifetime is managed for you in some way. I love the way Objective C does strings with NSString, although it is admittedly a lot of typing to do anything. Swift attempts to make this more concise, although I find Swift hard to love.
posted by w0mbat at 9:57 AM on May 17, 2019 [1 favorite]

I love the way Objective C does strings with NSString

Personally I get a bit peeved with any string library that works with UTF-16 internally. It seems to me that if what you actually need is an array of fixed size char, and you're working with Unicode, then your char type ought to be a simple scalar capable of accommodating any Unicode code point natively; the most natural encoding that fits this bill is platform-endian UTF-32.

If you don't need the fast indexing capability you get by representing a string as a simple array of a scalar char type, then picking UTF-8 typically wastes a lot less space and plays nice with long-established string-passing conventions that rely on a zero-byte terminator rather than a length. The distinction between a byte of text-encoding data and a single text character is encountered as a matter of course when working with UTF-8, and is therefore more likely to be handled correctly.

In fact, given the existence of abstract characters whose glyphs can only be specified as multiple Unicode code points, it's arguable that no fixed-size character representation is actually workable at all any more, and that the native representation for a single character in 2019 really does need to be an arbitrary-sized blob of bytes.

UTF-16 offers the worst of both worlds. It positively encourages the programmer to forget about the distinction between a character code and a Unicode code point, thereby creating endless potential bugs as soon as the code tries to deal with code points outside the basic multiligual plane. It won't ever be able to encode more than 220 plus loose change code points, unlike UTF-8 which is quite capable of going to 231 and only restricted to the usual 221 by arbitrary convention.

UCS-2 got into Win32, Java and Objective-C back in those days of innocence and light when nobody seriously believed that more than 216 code points would ever be required (demonstrating the same flagrant disregard for the principle of None, One or Enough as the apocryphal Bill Gates remark about 640K being enough memory for anybody) and backward compatibility considerations gave subsequent designers very little choice but sneaky redefinition of their 16-bit char type as a 16-bit UTF-16 encoding type instead. But baking UTF-16 into the fundamental data types of any new programming language strikes me as an easily avoidable error and I'm quite cross about it showing up in Javascript. UTF-8 or GTFO.

All of which constitutes a somewhat elliptical backgrounder for the point that zippy made above: modern encodings can make it quite difficult to predict the amount of memory that arbitrary textual input will ultimately require, even if the length of that text in characters is known in advance.
posted by flabdablet at 12:55 PM on May 17, 2019 [1 favorite]

Blaming hardware, and C in particular, ignores the even more fundamental issue: this kind of checking should be done in hardware, not in software. Computers with hardware bounds checking existed at one point before instruction sets became a monoculture. The entire industry went backwards and now the price is being paid.
Burroughs Corporation built stack machines with segmented memory. That meant the stack was maintained in hardware and storage regions had hardware bounds checking. There were extra bits for each word identifying what kind of data it contained. Memory storage segment allocation/de-allocation was integrated with the virtual memory mechanism. And all the software was written in a high level language, so programmer assembly errors were impossible.
The kind of security issues that plague all computers today could not happen in this architecture. Newer languages like Java, JavaScript, Python and Rust followed an evolutionary path to provide this kind of software environment but the hardware base is still completely vulnerable. Unless someone decides to build a system with this level of hardware support for memory usage and a language integrating the built in memory model no system will ever be as reliable as Burroughs computers were in the 1980s.
posted by Metacircular at 5:03 PM on May 19, 2019 [2 favorites]

« Older Using machine learning to generate procedural...   |   Who was this Eurovision artist? Newer »
This thread is closed to new comments.