Design for Failure: Your software should break

Design for Failure: Your software should break

Don't overthink your Error Handling, it's simple.

I've been preaching Design for Failure for a very long time. And I'm still holding on to it but you gotta be careful what you wish for. Many people overdo it and make their development lifecycle harder than it should be.

What is Design for Failure

DFF is a principle that applies to the UI and the Code part likewise: Make sure errors are handled. This principle doesn't originate in Software. Let me give you an example from the automotive sector: It is an option in cars to mechanically decouple your steering wheel so you don't interfere with the driving assistant. Now that sounds scary until you understand how it works: It can only decouple as long as the electronic assistant is alive. If the assistant doesn't work for whatever reason the permanent signal that's holding back (!) the steering wheel is interrupted and the wheel naturally snaps back into place - this is Design for Failure. It doesn't matter why it fails, the only thing that matters is getting control over the car.

My Vacuum Robot had bad DFF

I don't like being spied on sitting on the toilet. You don't have to be paranoid, you just have to be realistic. Especially as a Programmer, you should know that things that can spy you will spy on you - every big company has been hacked and they will be in the future. Nowadays I own a Valetudo-based robot - it's awesome ❤️.

However, before that, I owned an expensive Vacuum Robot from a green, german brand claiming to have proper data protection (and especially not a camera built-in but lasers).

That thing's Error Design drove me nuts - and I mean that. I've even written the company multiple E-Mails offering them consultancy in Software Architecture but I eventually gave up and they didn't even respond to my last mail. It was the definition of "Hardware that is causing Mental Health Problems".

From a coding perspective it was clear that it had to follow some kind of logic like this:

if (SENSOR_BOX_FULL) {
  print "Please clean the container"
} else if (SENSOR_LASER_BLOCKED) {
  print "Something is blocking the laser"
}
// else if ........ whatever 
else {
  // finally, the catch-it-all-case!
  print "I am stuck"
}

That meant that every time there was an error that wasn't specific enough it just said something like "I am stuck". Yeah, right. Stuck in what, with what? Definitely not stuck-stuck because it is free roaming on the floor.

Imagine this company would build manned space rockets and there is a leak with massive pressure loss: "Space Rocket is stuck".

It gets worse: The Notifications shown on the Smartphone apparently weren't using the same code logic as the App Code (like wtf?). That meant: My Notification Center showed different error messages than when clicking on it (which opens the App) e.g.

  • Notification Center: Clean the Brush

  • App: I am stuck

Needless to say that the brush was new and that it wasn't stuck. I've checked every single micrometer of that piece of ****.

Okay but where are you going with this Vacuum Sample David?

Error messages are supposed to help. They are not necessarily supposed to solve a problem. E.g.: If your internet is down it will not solve your problem knowing that it is down. Or if you have wrong pressure in your space rocket it will also not solve the problem by telling you so. But it will help you trying out the right things and calling the right people.

This also means: If an error cannot be exactly narrowed down to be exactly one thing you must provide enough details - even if they won't help solving the problem immediately. Even if that means showing "Error X1230900 Stack Overflow In Line 33:20". Because even if this reads obscure it is way more helpful than a robot telling you "I am stuck" even though it is not stuck. With the obscure message you can at least tell support the exact error message and they will be able to follow up.

There is a pretty clear path on how to design for errors

From my own experience, there are 3 (4) essential types and I will show the software design in human-readable samples.

  1. Recoverable Errors: The Vacuum Robot can't clean anymore because the box is full of dirt. This is solvable by emptying the box. The Software should do a couple of things here:

    • Notify on the App ("Clean Me")

    • Wait a reasonable amount of time in place (e.g. 5-8 Minutes) to let the person empty the box.

      Expect the person not to be at home and provide a fallback after the waiting time for the robot to drive back to the station with a full box. This is super-useful because the Robot won't run out of battery and will be ready to continue once the box is emptied.

    • The aforementioned means also saving the position where the robot was and how much it cleaned right before it drove back to the station.

    • In Web applications recoverable errors go hand in hand with good UI Design: Couldn't save data? Don't ask the user to retry. Auto-retry 2 to 3 times before annoying the user and if it still fails tell the user (but then it falls into the definition of 2/3).

  2. Unrecoverable but known Errors: Your coffee machine detects that there is water in the water tank but it can't pull water:

    • Assuming there is LCD but no App for the Coffee Machine: Give exact error code.

    • There is no display? No problem. There certainly are some LEDs somewhere. Abuse one/or more to give indications about the error (e.g. 2 LEDs are permanently blinking)

  3. Unknown Errors: In our interconnected world it can very well be that your software cannot determine the exact error why something isn't working. When I am building Software as a Service it will most likely use an external provider for Login (e.g. Auth0, Pocketbase, Supabase, ...). Now often such providers again use themselves providers e.g. for hosting data or for connecting to other services (Apple Login, Google Login, ...). So we have a lot of potential places that could fail.

    There is a simple solution to handle these errors: Your task is to be able to help identifying the issue. This is why it's crucial that your software logs actions to some kind of log-store of what happened until and including the error e.g.

    1. log(loginUserData, currentTime())

    2. log('Clicked on Start', currentTime())

    3. log('Caught Error, could'nt start, Error, currentTime())

Now some people argue: But what if the internet connection is down and you cannot send logs to the server? You don't have to! At the point of error happening you show a modal to the user which says something like:  
"Copy this message and send it to support."

The user doesn't care. The user just wants help and the user will get help by contacting you with the full error log. We implemented this in a way that "hides" this from non-technical users by using `base64` encoding which comes down to something like copying `=YWZkcztmZGF...` (simply do not confuse the user with an interpretation of messages)
  1. Errors that aren't errors: When I opened the lid to clean the box, my vacuum cleaner gave constant annoying beeps. Same when I simply tried to move it to a different place in idle mode. Try to think of a more common sample: Cars. Now imagine when you open the car door the car starts to tell you in very annoying ways that the door is open - Once you open the door: BEEP BEEP. Even if you're just loading or unloading it - it will constantly beep. That makes total sense when I'm trying to drive but not when it's parked. So stop trying to make error handling where error handling isn't needed in the first place. If you feel like you missed something: Give a beta out to real users - #usertests.

The Misconception of catching errors

There is a misconception out there: Developers think they have to catch errors and then "gracefully continue" in whichever way - I don't know where people learn this but this doesn't make sense.

A concept that is available in many languages is try-catch . Check this pseudo code:

try {
  basically.all.your.software.runs.here
} catch (Error) { // <- this "Error" contains information about what happened (e.g. "Server Error 503")
  whoopsie.something.bad.happened
}

It makes sure that your application doesn't just quit of failure - which would be even worse because then you cannot do any kind of error handling anymore.

However, inside this catch requires you to consider one of the 3 error scenarios from above. So e.g.:

try {
  // ... blabla
} catch (Error) {
    showErrorDialogToUserSoUserCanCopyPasteAndMailMe(Error);
}

This isn't as obvious as you think

Some developers would think now: That's very obvious ain't it? If you think so then please go back to your code and tell me if you handle the error or if you try to gracefully overgo it. Because the latter one is what I usually see. And this is really bad as it implies covering errors until errors happen that are even worse.

Let me give you a sample:

let someData;
try {
  someData = loadSomeDataFromServer();
} catch {
  someData = []; // empty data
}

This is bad. Any error happening in your try block will just be gracefully caught and you continue the application. To put it very clearly: This is the exact same thing as when you have a massive leak in your roof and every time it rains you have a lot of water in your room and the only thing you do is removing the water every single time.

The application will still work because you provide it with fallback data. Now where's the benefit for the user? The error is not resolved and the data is empty. How is the user benefiting from empty data (which also makes the application unusable) rather than an error that will at least let the user know why the app is unusable.

The Misconception of Fallback Data

Bad UX is when you gracefully, silently catch errors because then the user is even more confused about why the app isn't working properly.

I also want to give you another perspective here: Setting default fallbacks with good intentions (shadowing errors that should be errors) can be very harmful and time-consuming. Imagine a few people leaving the company, some new people joining, things seem to work fine, and people start working on the project. They run it on the internal servers: All good.

Now things are supposed to go to production: Nothing works. For the internal servers it worked because somewhere in the project there is a file with something like

if (not(config)) {
  // fallback
  config = someDefaultConfigThatOnlyWorksOnInternalServers
}

If you don't have that fallback it would be a more consistent solution: Defining configs per environment. And people would know and learn very early on because it's the same requirement on all Environments, no matter if Production or not.

And now comes the worst part about fallbacks: Partial (unintended) fallbacks. I had that in my own project and it's a severe problem: You set-up your config but you missed 1 value. Congratulations, now you have fallback values mixed up with Production.

Conclusion

Let your software break. It must break. Things must break to be able to get them fixed. Yeah, let that sink in. And that doesn't imply at all that it will be a painful experience for the user. Humans deal with errors all their life. But in the non-technical era they hadn't had as much frustrating errors. Your bed broke? Not cool but the error is clear: It's broken. Your pencil is dried out? That's a clear one as well. Water Pipe clogged? Call a plumber.

But Website says "Something wen't wrong, please try again" is the most frustrating thing as there is not a single thing the user is able to do. The user can't give you detailed error information, the user doesn't know if it makes sense to restart/reload, the user knows nothing and is indeed psychologically frozen in that state which can cause extreme frustration.