The one thing that all enterprise applications have in common, regardless of their framework, language, or communication protocols is that they produce logs, logs that are allegedly human-readable lines of strings that aim to help us to debug, trace, or otherwise inspect what’s going on. The question is, how useful are these logs and do they help us solve problems that actually happen. In this post, I will explore this question, highlight some of the challenges that come with logging, and explore alternative solutions.
Historically, logging had always been a part of enterprise software. We’ve all seen a lot of logging frameworks and may even have created our own. There are lots of conversations about supposedly best practices on how to implement reasonable logs. If you ask developers, logging can be used for debugging, tracing, journaling, monitoring, and printing errors. In general, every piece of information that developers think may be important will be written to a log.
When I refer to logging in this post, it relates to writing (more or less) human-readable strings to files or to stdout, similar to:
2022-02-14 07:10:25.800 [http-nio-8080-exec-7] My custom log format INFO com.sebastian_daschner.example.CarOrderController - User info@example.com ordered car with color: red, manufacturer: BMW, model: M3
2022-02-14 07:09:25.915 [http-nio-8080-exec-37] My custom log format INFO com.sebastian_daschner.example.HelloController - /hello called with param World, for the 535th time
2022-02-14 07:09:26.817 [http-nio-8080-exec-5] My custom log format INFO com.sebastian_daschner.example.CarOrderController - User test@example.com ordered car with color: blue, manufacturer: Audi, model: A3
...
You might ask: Sebastian, what exactly is wrong with that?
Shortcomings
When it comes to the kind of logging and typical JVM logging frameworks (Log4j, Slf4j, Logback, etc.) I’m describing in this post, we can identify certain issues:
Performance is certainly the biggest one. If you talk to a JVM performance expert, they will tell you that how you log can have one of the biggest, if not the biggest, negative impact on your application’s performance. If you really want your application to perform poorly, you should add a lot of logging statements in your main use cases. This logging should engage in creating and concatenating a lot of strings. And no logging framework is without several layers of buffering. The biggest performance issue for applications is the cost of a Java heap allocation, and logging usually allocates disproportionately when compared to typical business code.
It’s not just allocation costs as high allocation rates ultimately will hyper-activate the garbage collector. This in turn will result in high CPU utilization and increased frequency of tail latency. It’s quite interesting to have a look at such scenarios in production, in which an application allegedly utilizes the CPU a lot, which in fact turns out to be caused by the garbage collector because of excessive allocation.
Disk I/O is another factor to consider. Writing and flushing a lot of data to disk will impact the performance of all applications running on the same hardware. Even worse, log files that reside in network storage impact the throughput even more, since the write operation hits the operating system I/O twice, with file system and network calls involved. Logging makes these hardware devices which are shared between all applications part of the critical path. We often see this as a “noisy neighbor”.
The number of dependencies that logging frameworks bring along, directly or transitively, creates a few potential issues, as well. Not only do dependencies and their transitive baggage inflate the application’s size and build time. The more dependencies and moving parts we have, the higher the changes that there are version conflicts, bugs, or security issues, which not least the recent Log4Shell vulnerability has shown once again. In general, the less dependencies we include the better.
Log formats and which one to choose are another issue. Which information should be include (I dare you to have a look at the Log4j layouts), how should we structure our logging messages, which information should be logged at which log level, and so on. On the one hand, logs are produced to be human-readable but the volume of data the logs lead to creates the necessity to use more tooling such as log aggregators and search engines. Unfortunately, human-readable formats tend to be expensive for machines to read which leads to the conclusion that logs are generally neither really human nor machine readable.
In this cases, it makes more sense to consider a binary or a more machine-friendly format. Some applications do log lines of JSON, but the question remains, is this really any better? The log format will impact performance not only with regards to how much is added to each line, but also how many string concatenations, method invocations, and reflection lookups have to be performed.
Log levels are another topic that I haven’t seen being used reasonably out there in real-world projects, which I think is not the projects’ fault. Log levels might sound nice in theory, that you observe different levels of detail as to what information is currently interesting, but in practice that doesn’t really work well. Once some certain condition happened that you would have liked to debug using log statements, it’s likely that the log level wasn’t detailed enough, since debug and trace logs are usually not available in production. After all, keeping detailed log levels on in production that result in many statements being written to disk will hurt your performance. Some implementations support the ability to dynamically change the log level at runtime. The only issue is that by the time you realize you need more information, it’s likely too late to capture what is needed. Choosing a proper log level, and then of course which statements should be logged in the code using which level, is thus always a trade-off. Often this task is left to the developers to decide without any architectural guidance and this further degrades the usefulness of the logs.
Logs can produce huge amounts of data that is written to log files and over time can result in large files that are cumbersome to handle and expensive to parse. Parsing log formats in general bears quite an avoidable overhead. Especially if we collect all logs in another stack such as ELK, these tools then need to parse all these formats just again, which makes one question if the approach was a sensible one to begin with.
In general, one might doubt if logging is the best choice for implementing debugging, tracing, journaling, monitoring, or printing errors. If this is the case, then what are the alternatives? Lets start this discussion by looking at why we log.
Why developers use logging
Developers have different reasons why they use logging in an application running in production. At first, let’s look into negative examples — concerns that should be implemented differently:
◉ Debugging (appending debug statements into the log)
◉ Journaling (writing business-related events or audits, usually synchronously)
◉ Tracing (printing method invocations, arguments, and stacks)
◉ Monitoring (appending business or technical metrics to the log)
◉ Health checks (writing status messages that ensure the application is still alive)
Using logging for debugging in production does not only have a huge negative performance impact but also might not even be of much help. Information that isn’t available at a configured log level won’t help you to reproduce a specific bug and setting a log level that is too low, especially for third-party framework code and libraries, typically results in an overwhelming number of lines being logged per user request. Debugging race conditions and concurrency-related errors will very likely change the race which will lead to a different outcome and again be of little help. When debugging functionality, it’s more advisable to use a debugger, such as the one that is included in your IDE, that can connect against a running application, either locally or remotely.
Logging statements that aim to record business-motivated information in order to create an audit train is akin to a poor man’s version of journaling. This activity is better accomplished by using a journaling solution or event sourcing. If the business requirements demand a journaling, audit log, or alike, it should be treated as such and made explicit. Using event sourcing or proper journaling technology such as Chronicle Queue persists the messages more efficiently, with lower footprint, lower latency, and higher throughput.
Business- or technically-motivated tracing should also be made explicit in the application and implemented using a fit-for-purpose solution, such as OpenTracing or another journaling solution. (Mis-)using logging in order to trace method invocations and arguments has the same drawbacks as using logging for debugging.
Instead of using logging to monitor, one should use a proper monitoring solution, which usually includes emitting metrics in an endpoint. For example, publishing metrics using the Prometheus format via HTTP, and then scraping those metrics at the receiving end. All major enterprise frameworks support solutions out of the box, and even custom-built implementations are usually a better choice for realizing monitoring than appending individual metrics to the log.
The same holds true for health checks, for which the applications should offer some sort of health checking functionality like an endpoint or status page. These solutions integrate nicely into containerized environments like Kubernetes.
When and how to do logging
Now, we’re seen many cases for which we should avoid using traditional logging — so should we even use logging and if so when?
A great usage of stdout logging is for debugging while in development mode, such as the Quarkus dev mode. I very often insert a quick System.out.println, reload the application which in case of Quarkus takes 1-2 seconds, see the result, adapt my code, and remove the logging statement usually right away. This is often faster than starting up the debug mode of the IDE, at least for simple debugging. However, one should not leave these statements in the final committed version that runs in production.
Logging is useful for showing the completion of automated jobs, that otherwise won’t easily be seen. Single lines in the log that summarize the outcome of the job that has been executed can turn out very helpful, if they don’t appear often, that is if the job runs rarely, ideally every other minute or less.
It is helpful to log errors that are unrecoverable or unexpected especially when all other means to expose this information have failed. For example, if your application is unable to connect to a database, logging maybe the only way to capture the error information. You may want to suppress this error in cases where multiple user invocation will cause a multitude of retries. In this case, we’re not really interested in the performance impact on the “broken” application but more in being a good citizen by minimizing the impact on the rest of the system.
It’s also helpful to log any error that you didn’t expect or account for, such as the ones that hint to a bug. I know, you might ask: “what error is expected, anyways?”. Take NullPointerException as an example. This error implies a faulty, unexpected condition that the code is not handling properly. This is different from a user-provoked error which usually shouldn’t end up in the log. For example, I was once debugging my router when it showed a NullPointerException stack trace in the web interface. It turned out, the code was not properly handling the condition when there were no cables connected. This is a user-provoked condition that wouldn’t require further developer intervention yet instead of signaling that a cable needed to be connected, I was instead presented with this technical exception that had no useful information. User-provoked does include technical users and external systems, such as failures that are caused during a HTTP client request. That’s the distinction I’d make: You can log any exception that implies that the developers need to investigate and fix something.
Containers and orchestration frameworks have had some impact in as to how logging should be done. Rotating log files aren’t required anymore and container frameworks typically take the container log stream, that is the stdout and stderr, and correlate it further. For that reason, what might sound very controversial to a few of you, if I do have to log, I use these wonderful loggers that have been shipped with Java since JDK 1.0, called System.out and System.err. To some of you this might sound overly simplistic or even ignorant, but quite the contrary: using a simple, straightforward logging solution avoids a multitude of potential transitive problems.
The impact and transitive dependencies that third-party libraries have on our applications are not to be neglected. Unless we have a specific requirement, it does make sense to go with a simple, straightforward solution, to which we comprehend what it does, its consequences and benefits.
Source: javacodegeeks.com
0 comments:
Post a Comment