The Dark Side of Error Logging

Today we have another guest post by Phillip Johnston about error logging. Phillip is an embedded systems consultant specializing in bare-metal firmware development in C and C++. Embedded Artistry is a website dedicated to teaching others about C, C++, embedded systems, and development processes. You can contact him via email, or on Twitter.

Logging Horrors

I was recently reading Timeless Laws of Software Development by Jerry Fitzpatrick when I encountered a mind-jarring section about error logging and its downsides (page 45, “Minimize Error Logging”). I have been a long-time proponent of error logging, and I’ve written many embedded logging libraries over the past decade.

While I initially was skeptical of Fitzpatrick’s attitude toward error logging, I started paying closer attention to the log files I was working with as well as the use of logging in my own code. What I noticed wasn’t pretty.

This quote from a Stack Exchange post neatly summarizes my general experience with error logs:

I recently had to investigate a field issue for our large enterprise application. I was horrified by the logs that I had to comb through in an attempt to find the problem and at the end of the day the logs did not help at all identifying/isolating the bug.

A Recent Experience

I experience this horror on a regular basis. Consider this simple log sample captured on the Snapdragon Flight drone:

[08500/02]  03:03.626  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.626  HAP:4133:41100 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.626  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.626  HAP:4133:41416 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.627  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.627  HAP:4133:41731 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.627  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.627  HAP:4133:42060 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.627  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.627  HAP:4133:42400 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.628  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.628  HAP:4133:42728 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.628  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.628  HAP:4133:43037 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.628  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.628  HAP:4133:43346 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.630  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.630  HAP:4133:43665 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.630  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.630  HAP:4133:43993 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.630  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.630  HAP:4133:44343 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.630  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.630  HAP:4133:44660 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/02]  03:03.630  HAP:4133:I2C_Write failed: 285212872 num_bytes_written 0  0414  i2c.c
[08500/00]  03:03.630  HAP:4133:44968 Error: i2c write failed. Reported -1 bytes written   0270  I2CDevObj.cpp
[08500/00]  03:03.630  HAP:4133:44999 Initialized 1 Sensor (max supported: 16)   0130  sensor.cpp
[08500/00]  03:03.630  HAP:4133:45031 --- Sensors Detected ---   0133  sensor.cpp
[08500/00]  03:03.630  HAP:4133:45056 Sensor 0: switchAddr=0x70, switchChannel=0    0139  sensor.cpp

If you’re a new developer working on this project for the first time, or a client who is testing my software, it is reasonable to think that something terrible has happened. Since the log file looks so concerning, you might spend an hour or two investigating these errors before realizing that these I2C write failures are expected and acceptable.

This particular program supports dynamic detection of connected sensors using an I2C write. If the sensor is not connected, the write will fail and we will move on to the next slot. Even worse, the same I2C error spew occurs if you try to run an I2C bus sweep to detect connected devices. The log becomes practically worthless in such a situation, as each I2C sweep call would generate up to 128 I2C error prints. The obvious answer is to disable the log statement, but the underlying I2C layer was provided to us as a pre-compiled library.

This example is hardly a shining beacon of useful error logging, but is indicative of typical logging experiences.

The Dark Side of Error Logging

Logging is a seductive technique. We cannot always debug failures when they occur, and customers and teammates offer us vague descriptions of the problems they run into. Logging is the natural step to bridge this gap. We can create an impartial reference that allows us to reconstruct the sequence of events leading to an error. Once that ball is rolling, why not log as much information as possible? You never know what will be useful, especially when you’re trying to debug a tricky 1-in-100,000 problem.

Consider this: how many times have you successfully identified and fixed a critical issue using existing log statements? I can’t think of a single case where existing log statements were useful enough to debug a problem – and in such cases, the natural reaction is to add additional logging.

Just because a technique is cherished, doesn’t mean it’s wise.
–Jerry Fitzpatrick

Disadvantages

One of the primary disadvantages of error logging is that it adds additional code, and we want to minimize the code we write and maintain. We typically add conditional logic or exception handling blocks in order to detect and log errors. Logging code is unattractive and breaks up the flow of a function, making code more difficult to parse at a glance. Collectively, these factors contribute to an increase in our program size and clutter.

Logging isn’t free. There is always a runtime and log storage cost. In some cases, especially on embedded systems, log statements can create or hide timing-related bugs. Even if logging calls are disabled in release builds, the error detection logic is rarely disabled. Complex inputs to logging functions may still be evaluated even if the results won’t be logged.

Log files are often ignored by developers and users. If a log file indicates an obvious failure or bug, someone has to actively look at the log to notice it. This is why we should prefer assertions and unit tests: they will catch and report failures in an automated fashion, ensuring errors will be fixed quickly.

Once a log statement has been added, it is rarely removed. Many log statements become irrelevant or misleading due to code changes. Our logs end up full of red herring statements that send us chasing false errors or unrelated problems. Teams will fight back against removing statements with vague arguments such as, “You never know when we might need that information”.

Software maintenance is a fight against disorder and clutter, and our log files are no different. Without active push-back against over-logging, our log files become unapproachable, indecipherable, and send us on snipe hunts.

Common Anti-Patterns

Aside from general disadvantages with error logging, I’ve observed a variety of anti-patterns across the numerous code bases I’ve worked on. The three most common are:

  1. Log-and-throw
  2. Logging unexceptional or uninteresting failures
  3. Senseless log statements

For a more detailed dive into exception handling anti-patterns, check out this post by Oracle.

Log-and-throw

The most common (and most annoying) anti-pattern is log-and-throw, which is applicable both to error logging and exception handling.

Consider this exception example:

try
{
  someFunction();
}
catch (…)
{
  log_error("Exception occurred!");
  throw;
}

Or this return example:

int r = someFunction();
if(r < 0)
{
   log_error("SomeFunction returned error code: %d\n", r);
   return r;
}

The general flow of both cases is to check for an error or exception, log that it occurred, and then re-throw the exception or return the error. In most cases, the caller of the function will detect an error, log it, and re-throw or return. This often happens all the way up the call stack. Each additional error check and log statement adds code bloat and duplicated information in the logs.

If we cannot handle an exception, we should not catch it. Additionally, we should either log the error or throw/return it, but never both. This ensures that each error is logged only once.

If you’re looking for a quick way to reduce the lines of code in your programs, eliminate duplicated error detection and log statements.

Logging Unexceptional or Uninteresting Errors

It’s difficult to get entire teams to agree on definitions for FATAL, ERROR, WARNING, INFO, and DEBUG. This is especially true if there is nobody enforcing proper use of logging levels.

In many cases, I observed that developers will always log at whatever levels are shown by default. This is the lazy workaround for changing the logging level during runtime, and it results in decreased log usability.

Additionally, not all errors are created equally. In the I2C write example above, failures are acceptable and expected. By logging that case as an error, the library’s log statement communicates a different level of seriousness than our code does. The I2C write failure should have been demoted to DEBUG, INFO, or WARNING.

By setting realistic expectations for utilizing different logging levels and enforcing the standard, you can keep your logs clean and usable.

Senseless Log Statements

While error logging has its uses, I regularly encounter vague log statements such as:

log_error("Auth failed");
log_info(""Status: %d", resp->status_code");
log_error("Unknown state: %d", state);

While these log statements make sense within the context of the function they’ve been added to, they are rarely understandable when reading through the log. Often I need to do a search through the code base to find out where a specific string is used. Common such strings such as “status” or “Unknown state” are often re-used throughout the code base, making it difficult for developers to understand which case they are hitting.

Make sure that your log statements provide a complete and unique statement that can be used debugging. It is helpful to provide additional context, such as the function or file name where the log function was called.

Error Logging Guidelines

I’ve painted a bleak picture of error logging, but there is no denying that it is a useful tool in our toolkit. However, we need to collectively adopt guidelines for a healthier approach to error logging.

Here are my error logging guidelines:

  1. If it’s worth saving to a log file, it’s worth adding an assertion, unit test, or showing in the user interface. Prefer automated and visible solutions over logging whenever possible.
  2. Keep your logs small and simple. Log only the most obvious or critical failures.
    1. Only allow a minimal amount of logging into your mainline code. Brutally push back on logging during code reviews. Only allow statements in master when need or usefulness is demonstrable.
    2. Focus on only using logging for controlled experiments. When the experiment is finished or the bug is fixed, delete the logging code.
    3. Log each exception or error only once, at the highest level where it is handled.
  3. Add a last-chance exception handler at the entry point of your program for cases where you did not catch an exception.
  4. Create sensible and understandable log statements. These messages should be useful on their own and not require hunting through the code base to find context.
  5. All logging should be disable-able. Logic and input arguments must not remain in the program if logging is disabled.

Further Reading

Previous Post
Next Post
Facebooktwittergoogle_plusredditlinkedinFacebooktwittergoogle_plusredditlinkedinby feather

2 Comments


  1. In the context of embedded C++ this seems to make sense. In other areas I wouldn’t agree as much. Having proper log analytic tools such the ELK-stack is crucial for certain setups. With Kibana you can easily filter for whatever application, section, message, etc. you want, so there’s never really “too much” logging. This of course shouldn’t be an excuse to have useless logging or misuse log levels.

    Consider this: how many times have you successfully identified and fixed a critical issue using existing log statements?

    It’s a trick question, because logs aren’t here to immediately understand the issue and come up with a solution on the spot, but logs exist to a) inform the team that something did go wrong and b) to give context to that situation.

    Consider this instead: how many times have you successfully identified and fixed a critical issue using no logs at all to see what happened and the only error description you have is “it stopped working”? Ah and that just within a few hours and not days or weeks?

    Personally, I’ve identified many issues just based on logs, but of course needed to dig into the code in order to see what the best solution is. But not having worked on any embedded systems, my experience may not apply in a similar fashion.

    Reply

    1. Hi explo9t3r,

      Thank you for sharing your perspective. You raise great points, and I thank you for challenging my views.

      My goal with this article was not to invalidate the idea of error logging as a tool in our arsenal. I use logging on every project that I work on. I notice general a tendency toward terrible error logging practices (even in my own programs!) and wanted to highlight them.

      It’s a trick question, because logs aren’t here to immediately understand the issue and come up with a solution on the spot, but logs exist to a) inform the team that something did go wrong and b) to give context to that situation.

      No “trick” was intended here. I was trying to trigger reflection via a rhetorical question, perhaps unskillfully.

      There are numerous cases in my direct experience where an issue was reported but no trace existed in the log. Even if a helpful log message captures a failure detail, I usually needed MORE information than what is provided in the log.

      Logs certainly provide us with additional information that can help pinpoint the issue, but I am arguing that they cannot be relied upon as the primary tool in our debugging arsenal.

      But not having worked on any embedded systems, my experience may not apply in a similar fashion.

      The primary difference between the embedded use cases and PC/phone cases is the storage, memory, and computational capacity.

      In my embedded programs, I usually am limited to a set log buffer size that limits the log history that can be stored. Additionally, log strings take up precious space in the binary, which can be problematic when your program needs to fit into 512KB-8MB (a typical range I work with).

      Perhaps this fact leads me to experience the more painful side of error logging, since log statements become another contentious resource. It’s much easier to avoid arguing about the usefulness of a particular log statement if you have the memory and storage capacity to handle them 🙂

      Consider this instead: how many times have you successfully identified and fixed a critical issue using no logs at all to see what happened and the only error description you have is “it stopped working”? Ah and that just within a few hours and not days or weeks?

      The “trick” in your version of the question is that “it stopped working” is the only description I receive. Often there is a better description than that, and hopefully I can go and talk to the person who reported the issue to get additional details.

      Regardless of whether I get a log or not, I am typically going to investigate with GDB or by adding custom log statements to pinpoint the problem area. And a major point I wanted to make was that these custom log statements probably don’t need to be permanently added to the log – they can be removed once the error is corrected.

      As with all things, the truth of the matter lies somewhere in the middle of “log everything” and “log nothing”.

      Cheers,

      Phillip

      Reply

Leave a Reply