BioHacker News | A Guide to Undefined Behavior in C and C++ (2010)

▲A Guide to Undefined Behavior in C and C++ (2010)(blog.regehr.org)

79 points by GarethX 120 days ago | 6 comments

This is an area the newer languages get right - I don’t think Rust or Go has any undefined behavior? I wish they would have some kind of super strict mode for C or C++ where compilation fails unless you somehow fix things at the call sites to tell the compiler the behavior you want explicitly.

▲zyedidia 117 days ago

I think data races can cause undefined behavior in Go, which can cause memory safety to break down. See https://research.swtch.com/gorace for details.

▲jjmarr 117 days ago

> I wish they would have some kind of super strict mode for C or C++ where compilation fails unless you somehow fix things at the call sites to tell the compiler the behavior you want explicitly.

The C++ language committee _does not_ want to add more annotations to increase memory safety.

▲anon-3988 117 days ago

Not even annotations. The committee standardize this https://en.cppreference.com/w/cpp/container/span/operator_at

So they clearly doesn't care so there's no point convincing them.

▲maxlybbert 117 days ago

The committee tends to also provide bounds checked interfaces ( https://en.cppreference.com/w/cpp/container/span/at ). But that requires people read the documentation, and based on the number of people who I see write "std::endl" when they really want "'\n'", I don't have much hope for that ("std::endl" both sends '\n' to the stream, and flushes it; people are often surprised about the stream getting flushed).

▲AlotOfReading 117 days ago

We've known since the 80s that programmers almost always choose the more ergonomic interface and telling people they're holding it wrong doesn't scale.

Besides, throwing an exception is a terrible way to do range checking. There's a huge number of projects out there banning exceptions that would benefit from safe interfaces.

▲shiomiru 116 days ago

> Besides, throwing an exception is a terrible way to do range checking. There's a huge number of projects out there banning exceptions that would benefit from safe interfaces.

I thought "banning exceptions" would mean fno-exceptions, turning the throw into an abort - that's pretty good for a systems programming language, no?

What other way would you propose?

▲anon-3988 116 days ago

Make it explicit that the API can fail. Panicking is not acceptable in systems programming. Force me to handle all the cases.

▲imtringued 116 days ago

I'm not trying to be provocative, but I genuinely am not seeing what the benefit of this "doctrine" is if the end result is a bunch of if else statements that do nothing, but bubble the error up and then exit the process anyway.

Panicking is in no way different from say Undefined Behavior, with the exception that panicking tends to be "loud" and therefore fixed promptly.

▲AlotOfReading 116 days ago

There are systems banning exceptions that aren't allowed to crash. The software running cars and planes for example, or LLVM.

▲anon-3988 116 days ago

EDIT: HN formatting is absolute horrendous so here's a pastebin https://pastebin.com/raw/BUJBAqc1

> but bubble the error up and then exit the process anyway.

What is so bad about this? Its only a problem because all languages are not ergonomic. Even Rust. Here's how to do it properly in my mind.

What I am proposing is not Rust, but what I wish Rust would have been.

First, you have functions that cannot fail

``` fn not_failable() -> Result<u32, ()> { return 0; } ```

Notice that the function still returns Result<T,E> but the error is (). Now what happens if the function can fail?

``` fn failable(value) -> Result<u32, () | A | B | C | D> { if value == 0 { return Ok(0); } else if value == 1 { return Err(A()); }else if value == 2 { return Err(B()); }else if value == 3 { return Err(C()); }else if value == 4 { return Err(D()); } } ```

Notice that we don't have to specify a type for the errors, they are just the unions of all the error types that is possibly returned by the function. This union could be inferred by the type system to be ergonomic (meaning it can be omitted from the type signature for ergonomic purposes)

You might think that this is almost like exceptions. And you are right, but this is where exceptions got wrong, the user of this function.

When using this function, you are forced to handle all the possible error types (exceptions) returned by the function

``` fn use1() -> Result<u32, () | C | D> { match failable() { Ok(v) => {} Err(A) => {} Err(B) => {} e => return Err(e), } } ```

Notice 2 things:

1. You are FORCED to handle all the possible exceptions. 2. You can specify which exceptions you want to handle and what you throw back. The difference to try/catch here is just the syntax. 3. The function signature can be automatically be duduced by the fact that A and B are already handlded, and that this function can only throw C or D now.

Now you might complain about ergonomic, why can't things just blow up the moment anything bad happens? I propose a trait that will be automatically be implemented for all Result<T,E>

``` impl Unwrap for Result { fn unwrap(self) -> u32 { match self() { Ok(ok) => return ok, Err(err) => panic!("error"), } } } ```

Which means that you can simply do this,

``` fn fail_immediately() -> Result<u32, ()> { return failable().unwrap(); } ```

Or, if you want to bubble up the errors, you can use ?

``` fn fail_immediately() -> Result<u32, A | B | C | D> { return failable()?; } ```

▲eddd-ddde 116 days ago

That looks just like zig error handling to me. The only missing component being that errors themselves are just a tag without any more data content.

▲nemetroid 116 days ago

Ending your string literal with \n is much easier than using std::endl.

▲AlotOfReading 116 days ago

The comment was about bracket notation versus .at()

My only opinion on the \n versus std::endl discussion is that people set overly aggressive linting rules that always flag endl even when flush is intended.

▲tialaramex 117 days ago

Several programming languages can testify to the fact than a Benevolent Dictator For Life is not a panacea. Several more than testify that having a Committee to design the language is likewise not a panacea. Perhaps uniquely C++ can clarify for us that both is in fact worse than either alone.

▲pjmlp 116 days ago

Same applies to C.

▲tialaramex 116 days ago

Do you see Brian and Dennis dominating WG14 meetings? Nope. They moved on, Bjarne Stroustrup never did. After his initial prototypes and his first book about C++ he's written lots more books and papers, he's lectured classes, he's given huge numbers of talks, all of them about his baby, C++. If you ask WG21 people directly they'll insist he's just one vote (ah yes, JTC1 consensus "voting") but for example WG21 says it will heed the "advice" of its Direction Group, a self-selecting handful of people which is dedicated to following advice from a book written by Bjarne and weirdly always giving exactly the same advice as Bjarne, which makes sense because its most notable member is Bjarne but this advice is signed "The Direction Group" ...

It's like being surprised that the UN Security Council keeps making decisions which favour Russia.

▲pjmlp 116 days ago

It doesn't change the fact C is equally a design by committee language with all the negativity it entails.

In fact, WG14 very clearly has acted against Dennis when he submitted papers that could have improved C's safer.

Maybe his fat pointers proposal was not good enough, but apparently is wasn't something worth improving upon either.

C authors indeed moved on, first with Alef (which granted had a few design issues), Limbo and finally Go, as C as being driven by WG14 was no longer their thing, C on Plan 9 isn't even C89 compliant.

▲steveklabnik 117 days ago

Safe Rust has no undefined behavior. Unsafe Rust does.

▲vlovich123 117 days ago

cough std::env::set_var cough :D.

▲whytevuhuni 116 days ago

std::env::set_var [1] has already been changed to unsafe in the 2024 edition of the compiler [2].

So yeah, such things exist, but what's important is what the compiler devs choose to do once such issues are found. The C++ compiler devs say "That's an unfortunate case that cannot be fixed." The Rust devs say "That's a bug, here's the issue link."

[1] https://doc.rust-lang.org/std/env/fn.set_var.html

[2] https://doc.rust-lang.org/edition-guide/rust-2024/newly-unsa...

▲vlovich123 116 days ago

Yes I know all that. Just pointing out that “safe” Rust in practice can have unsoundness because of unsoundness in the underlying implementation. And yes it can be fixed but it did take quite a while to fix this one unfortunately (not that marking it unsafe meaningfully changes things)

▲steveklabnik 116 days ago

"software has bugs" isn't a particularly interesting thing. The point is, safe Rust has no UB. If mistakes are made, they're fixed.

▲vlovich123 116 days ago

I’m not arguing that purely Safe Rust has UB theoretically and I’m 100% sympathetic to the difficulty. I’m saying in practice unsoundness can creep in from the real world and even the std library is not immune from this. This is even ignoring unsoundness due to compiler bugs which are smaller issues for now (but will become more so as ecosystems lag on updating the compiler).

It doesn’t hurt anything to acknowledge that safe Rust can be unsafe due to mistakes in abstractions while simultaneously acknowledging that Rust still has orders of magnitudes fewer memory safety issues than c/c++ even with these problems.

As for this specific bug, this kind of bug took a long time to fix for what it’s worth since it can take up to 3 years for a new edition to allow for fixing it. And “fixing” it doesn’t actually fix the unsoundness in existing code - it just changes the responsibility of who’s supposed to validate the usage is safe. It basically shifts the “blame” to the user for holding their tool wrong because code patterns that had been documented as being sound are now documented as being unsound and the user has to figure out how to make it safe once more.

I’m not trying to cast blame or aspersions - mistakes happen and that this was dealt with shows the strength of Rust’s ability to solve these problems vs c++ which is hopeless. But pretending like safe Rust exists purely in a vacuum devoid of interaction with the real world isn’t helpful I think.

I say all this with the utmost respect to your expertise and we basically agree on a lot of rust-related things. I just disagree slightly on the messaging here.

▲steveklabnik 116 days ago

> we basically agree on a lot of rust-related things.

I suspect we do too!

> It doesn’t hurt anything to acknowledge that safe Rust can be unsafe due to mistakes in abstractions

I think that the core disagreement here is not that I think that it's harmful, it just seems incredibly banal to me. That is, like, of course it can! So bringing it up feels like an attempt at a "gotcha" that's not really a gotcha.

▲ultimaweapon 116 days ago

To be fair the UB caused by this function come from the underlying C implementation and this function already marked as unsafe on 2024 edition.

▲pjmlp 116 days ago

Older languages as well, those that weren't a copy-paste from C with extras.

Modula-2, Ada, Object Pascal, Eiffel, Delphi,...

▲almostgotcaught 117 days ago

you realize UB is basically an escape hatch from the standard for compilers right? it's not like a flaw in the language, it's gaps negotiated by the standards committee (well for the most part i guess). so the reason new languages don't have UB is because new languages don't have multiple implementations (go definitely doesn't, does anyone use rust-gcc?).

▲Dylan16807 117 days ago

You only need implementation-defined behavior to grease the wheels of multiple implementations. You don't need the gaping void of undefined for that use.

There's a big difference between "it'll be some number, not promising which one" and "the program loses definition and anything can break, often even retroactively".

▲almostgotcaught 117 days ago

Potato potato. My point is UB isn't an accident it's intentional. Mind you I'm not saying it's great, just that it's not some kind of slipup.

▲maxlybbert 117 days ago

"Implementation defined" and "undefined" are different things.

On my laptop, sizeof(long) is 8; that's implementation defined. It could be different on my phone, or my desktop, or my work laptop.

Undefined means, roughly, "doing X is considered nonsensical, and the compiler does not have to do anything reasonable with code that does X." In "Design and Evolution of C++," Stroustrup says that undefined behavior applies to things that should be errors, but that for some reason the committee doesn't think the compiler will necessarily be able to catch. When he came up with new ideas for the language, he would often have to choose between making a convoluted rule that his compiler could reliably enforce, or a simple rule that his compiler couldn't always give a sensible error message for.

For instance, the original compilers relied on the system's linker. If the compiler could interact with the linker, it could perhaps detect violations of the One Definition Rule ( https://en.cppreference.com/w/cpp/language/definition ), but since the linker might have been written by a completely different company, and it's acceptable for different source files to be compiled by different compilers (and even be written in other languages -- https://en.cppreference.com/w/cpp/language/language_linkage ) and put together by the linker, and it's common for binary libraries to be sold without source, there's no guarantee that the compiler will ever have the information necessary to detect a violation of the One Definition Rule. So the committee says that violations create a nonsense program, which isn't required to behave in any particular way.

▲Dylan16807 117 days ago

Do you think my last sentence is describing things incorrectly? I don't really understand how you could take that depiction and call it "potato potato".

▲wolvesechoes 116 days ago

"go definitely doesn't"

What? gccgo, TinyGo, and GopherJS.

▲porridgeraisin 116 days ago

More prominently, microsoft go

▲hoseja 116 days ago

Don't let people performatively horrified of undefined behaviour know about Gödel.

▲zombot 116 days ago

The most horrifying aspect of UB is that it can affect your program without the instructions triggering it ever being executed. And many greenhorns don't know that or even believe it to be false. So the effects of Dunning-Kruger may be more severe in C(++) than in other languages.

▲maxlybbert 116 days ago

It’s not just “there’s a line in some file that’s undefined” partly because undefined behavior is often caused by the state of the world. Dereferencing a pointer is defined unless the pointer is invalid, and any particular pointer may be valid sometimes and invalid others.

But since a compiler can do anything when a program is ill-defined, if a line of code could be well-defined in some cases and ill-defined in others, a compiler is allowed to only handle the well-defined cases, knowing that it will do the “wrong” thing when something about the code is undefined (because in that case, there is no “right” or “wrong” behavior.

This does lead to weird things:

    auto val = *ptr;
    if (!ptr) {
        . . .
    }

The compiler can delete the “if” statement because the potential undefined behavior happens before the check. Either the pointer is valid when dereferenced (and the “if” statement gets skipped), or the pointer is invalid, and skipping the “if” statement is acceptable for “the compiler can do anything, even things that can’t be expressed in the language.” But it only does the weird things in cases where an invalid pointer would have been dereferenced.

▲guimplen 117 days ago

The first example (signed integer overflow) is no longer valid in newer standards of C. Now it should use the two-complement semantics and no UB.

▲Rusky 117 days ago

I believe they only standardized the two's-complement representation (so casts to unsigned have a more specific behavior, for example) but they did not make overflow defined.

▲LegionMammal978 117 days ago

Yeah, signed integer overflow is as UB as ever. I've heard the primary reason for it is to avoid the possibility of wraparound on 'for (int i = 0; i < length; i++)' loops where the 'length' is bigger than an int. (Of course, the more straightforward option would be to use proper types like size_t for all your indices, but it's a classic tradition to use nothing but char and int, and people judge compilers based on existing code.)

▲ForTheKidz 117 days ago

ptrdiff_t is also useful in this case if signed semantics are desired.

▲vlovich123 117 days ago

> I've heard the primary reason for it is to avoid the possibility of wraparound on

Making it UB doesn’t fix that in any way that I can think of.

▲anttihaapala 116 days ago

What it means is that since i as the variable is monotonically increasing, an array indexing operation that is in the loop body can be replaced with an incrementing pointer instead, which eliminates quite a lot of code. An example here: https://pvs-studio.com/en/blog/posts/cpp/0374/

▲pajko 116 days ago

UB can be converted to ID by using -fwrapv (to "standardize" the wrapround, which does not necessarily help if the overflow was not intentional) or -ftrapv (generate an exception).

▲fsckboy 117 days ago

my opinion as a very experienced C system programmer:

there must be better sources to guide people than a poorly written and infantilizing article from 15 years ago.

▲jcranmer 117 days ago

My experience is that self-described "very experienced C system programmers" are simultaneously the people who are most in need of a good explainer on undefined behavior and the most likely ones to throw a conniption fit halfway through and stop reading, for the hallmark of a good explainer on UB is that it will explain that a) it exists for a reason; b) no, just doing a "little" UB isn't safe; and c) it's not the compiler's fault that things go awry when you do UB, it's the programmer's fault.

One of the blog posts I've long queued up for writing is "In defense of undefined behavior." It's only half-written, though, but the gist is justifying UB by pointing out that you can't optimize C code with it (via an example using pointer provenance), then pointing out why uninitialized values look weirder than you think by reference to the effects of system libraries, and then I would actually walk through why specification authors should reach for undefined behavior in various places.

▲raphlinus 117 days ago

Oh hey, I also have "in defense of undefined behavior" in the queue of blog posts I'd like to write some time, with that exact title. What a coincidence. That said, it's unlikely to get written as I have things that are more specific to my actual research ahead of it.

One of the things I'd want to say is that UB is a useful and accurate way to model what happens when, say, a program writes over memory used by the allocator. Languages like Odin might try to pretend they don't have UB, but in my opinion it's impossible to get there just by disabling certain compiler optimizations (see https://news.ycombinator.com/item?id=32800814 for an argument about this).

I see UB as essentially a proof obligation, to be discharged in some other way. A really good way is to have UB in the intermediate representation, and compile a safe language into it (with unsafe escape hatches when needed). But there are other ways, including formal methods, rigorous testing, or just being a really smart solo programmer who's learned how to avoid UB and doesn't have to work in a team.

Feel free to send me your draft.

▲vlovich123 117 days ago

Raph, I think you may be using a different definition of UB than what compiler authors are using? As I understand it in the language sense of the word, UB technically allows the compiler to interpret the code however it wants. To me utility in UB are relying on some kind of well-defined behavior to result which would imply that you are either just relying on today’s behavior OR you are doing something that’s non-deterministic but not violating language rules? Or some intermediate definition where it’s both violating language rules but no future version of the compiler is likely to be able to detect the UB and change behavior?

UB is very useful for compiler authors because they can apply very useful optimizations with “illegal” code and then emit illegal code constructs when they want those optimizations to apply. I have a hard time understanding how that’s useful to language users though.

▲raphlinus 116 days ago

The argument I have in mind is subtle and nuanced, and I didn't write clearly in that comment (the bit about the smart solo programmer was mostly sarcasm but with a grain of truth). But to try to answer:

The value of UB is to clearly document what the obligations are for valid programs. It's not valuable to indiscriminately expose that to programmers at scale without some mechanism to discharge those obligations. I don't think C's choices for UB are defensible in a modern world, and for part of that evidence see how many misconceptions there are in this thread (just to pick one, that at least some people think the move to two's complement means that signed overflow is no longer UB). On the other hand, unsafe Rust's choice to include more UB is defensible (aliasing a mutable reference is UB in unsafe Rust but not UB in C) is defensible, as it makes the whole system safer. And Odin's approach (claiming there's no UB when there actually is UB) is even worse from a "clear communication" perspective.

But maybe I should actually write the blog post some time.

▲AlotOfReading 116 days ago

    The value of UB is to clearly document what the obligations are for valid programs.

I think an argument can be made for C (where annex J exists), but this definitely doesn't apply to C++. Hence the long languishing of P1705.

I would also disagree for C though, because annex J can only cover explicit UB by definition and there's the entire category of implicit UB outside that.

▲vlovich123 116 days ago

Ah gotcha. Yeah it’s useful from a language perspective.

One thing I wish unsafe Rust would do, given that there’s more potential for UB in unsafe Rust than in C/C++, would be some way to enforce more compiler checks against UB you didn’t intend to stumble across but I don’t know how that would work or if it’s even possible. But Rust unsafe is decidedly sharper than C++.

▲pjmlp 116 days ago

Using protective gear, or making cars safer for crashes, also slows down physics versus not using them at all, yet lifes are saved every year where people would otherwise die or be crippled.

As someone that rather prefers Wirth culture on programming languages, UB at the expense of safety isn't a clear win, that is why we end up with security exploits or hardware mitigations for UB based optmizations gone too far.

▲fsckboy 116 days ago

>most likely ones to throw a conniption fit halfway through and stop reading

listen to you, you didn't read the article it's clear, because the article doesn't agree with the rest of what you said. So, since you're not defending an article, you're just attacking me, which is pretty schmucky

▲jcranmer 116 days ago

I did read the article. I have my disagreements with John Regehr on UB, though, as he's definitely in the camp of "let's try to specify it out entirely from the compiler," which doesn't entirely work.

▲staunton 117 days ago

Being a very experienced programmer, I'm sure you know many such sources. Can you share any?

▲camel-cdr 116 days ago

The C standard Annex J has a list of undefined behavior: https://port70.net/~nsz/c/c99/n1256.pre.html#J

▲AlotOfReading 116 days ago

Annex J is a list of explicit undefined behavior. It can't and doesn't attempt to enumerate the vastly larger universe of implicit undefined behavior.

There's also no official list for C++, just a proposal to make one that's languished in committee for the past 6ish years.

▲rberg 117 days ago

Agreed. Running with a basketball is very much possible, I'm unsure as to why John thinks otherwise.

▲ultrarunner 117 days ago

Perhaps you could draw on your wealth of experience to write one. I’d love to read it!

▲imtringued 116 days ago

In my opinion it's not infantilizing enough. If you are a C developer and have never heard of model checking, then you are grossly incompetent and should never be allowed near a computer.

▲pjmlp 116 days ago

If only folks would write code in a way that infantilizing article from 15 years ago aren't as actual as ever.

▲mwkaufma 117 days ago

Despite the high frequency that alarmist "formatting you HDD" is cited in discussing UB, I've never seen it happen. Surely there exist real examples of catastrophic failures which could actually teach us something, beyond making a hyperbolic point.

▲vlovich123 117 days ago

It was intentionally hyperbolic tongue in cheek and understood to be as such. The reason is to fight through the discounting people (at least at the time) had that UB was just a segfault or something. Here’s a kernel exploit that was a result of UB [1]. It’s not hard to imagine that hypothetically UB in the kernel could result in “just so” corruption that would call the “format your HDD routine” even if in practice it’s extremely unlikely (& forensically it would basically be impossible to prove that it was UB that caused it).

https://lwn.net/Articles/342330/

▲AlotOfReading 117 days ago

One example is control flow bending [0], which uses carefully crafted undefined behavior to "bend" the CFG into arbitrary, turing complete shapes despite CFI protections. The author abused this to implement tic tac toe in a single call to printf for a prior obfuscated C contest [1].

Of course, that misses the real point that "formatting your HDD" is simply an allowed possibility rather than a factual statement on the consequences.

[0] https://www.usenix.org/conference/usenixsecurity15/technical...

[1] https://www.ioccc.org/2020/carlini/index.html