Centralized Logging Using the ELK Stack
When your microservices's instance are volatile, centralized logging is a must!
Why would you need a centralized system that gathers logs? We recently switched one of our projects from an architecture with a dozen monoliths to a microservices based approach. Each of these microservices is deployed in a cloud environment (AWS). While this has many advantages, it also has some drawbacks. One of the drawbacks we immediately encountered is the fact that you'll never know for sure where an application is hosted. An application could become unhealthy for any reason and could be deployed on a new instance in the cloud automatically. If an application becomes unhealthy it is important to know the exact reason: maybe the database is unreachable or there is a memory leak, maybe a bug even slipped through to production (hypothetically, of course). So it is really important to know the root cause, which you'll know only for sure if you can monitor the logs. However, since the instances on which the application is deployed are very volatile, it would be a pain to get the logfiles. In some cases, the instance could already be gone, which would mean that the logfiles are gone too.
Hence the centralized system that gathers logs. Enter the ELK stack.
ELK stack
ELK is an acronym that stands for Elasticsearch, Logstash and Kibana. Elasticsearch is a distributed, RESTful, JSON-based search engine. Logstash is best understood as a powerful ingest pipeline and Kibana is the visualization layer. While ELK is mainly used to facilitate search solutions, it can also be used to centralize logs from different applications. Since Beats was added recently, the ELK stack has been renamed to Elastic Stack. More at elastic.co/elk-stack.
Disclaimer
All our applications are based on Java and Spring Boot, using Logback as logging framework. Examples will be based on that tech stack. For more information on visualisations and dashboards, click here to skip to the next chapter.
How
The typical process to ship logs to Elasticsearch works as follows: Beats (Filebeat in particular) sends output from a logfile to Logstash. Logstash would filter and transform this input to a standard format and on would send the output to Elasticsearch in turn so it's ready to be searched and visualized.
One of the proposed options is to ship logfiles directly with Beats (Filebeat). While this is surely a valid option, we didn't want to configure Filebeat every single time we created a new application. So we chose to embed most of the logic in a 'shared library'. Each application can use this shared library, which also ensures that all logging is in the same format.
We often use Logback as logging framework. So we had to find a way to send our logs directly to Logstash, which would process the logs and send it to Elasticsearch. Luckily this piece of the puzzle had already been invented: the logstash-logback-encoder provides encoders, layouts and appenders to log JSON directly to Logstash.
In our applications we send different kind of logs: requests, errors, HikariCP pool status ... To keep a better overview, we will keep the examples to requestlogs.
First thing we wanted to share in our library were the Logback appenders:
elk-appenders.xml
<included>
<appender name="elk-requests" class="net.logstash.logback.appender.LogstashTcpSocketAppender">
<destination>${logging.logstash.servers:-localhost}</destination>
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<mdc/>
<context/>
<version/>
<logLevel/>
<loggerName/>
<threadName/>
<provider class="com.foreach.across.modules.logging.request.RequestLoggerJsonProvider"/>
<logstashMarkers/>
<arguments/>
<stackTrace/>
<pattern>
<pattern>
{
"application": "${logging.logstash.application}",
"appender": "requests"
}
</pattern>
</pattern>
</providers>
</encoder>
</appender>
<appender name="elk-errors" class="net.logstash.logback.appender.LogstashTcpSocketAppender">
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>ERROR</level>
</filter>
<destination>${logging.logstash.servers:-localhost}</destination>
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<mdc/>
<context/>
<version/>
<logLevel/>
<loggerName/>
<threadName/>
<message/>
<logstashMarkers/>
<arguments/>
<stackTrace/>
<pattern>
<pattern>
{
"application": "${logging.logstash.application}",
"appender": "errors"
}
</pattern>
</pattern>
</providers>
</encoder>
</appender>
</included>
In this example there are a few important things to remark:
destination: the destination of Logstash. This will be replaced with a environment variable. The same is true for ${logging.logstash.application}.
class="net.logstash.logback.appender.LogstashTcpSocketAppender": this appender is async by default. This will avoid that the application has to wait until the logging has been sent.
<provider class="com.foreach.across.modules.logging.request.RequestLoggerJsonProvider"/>: this class will be used to ensure that the logs always have the same fields.
An example of the RequestLoggerJsonProvider could be:
public class RequestLoggerJsonProvider extends AbstractFieldJsonProvider<ILoggingEvent>
{
@Override
public void writeTo( JsonGenerator generator, ILoggingEvent event ) throws IOException {
try {
String message = event.getFormattedMessage();
String[] split = message.split( "\t" );
String remoteAddress = split[0];
String method = split[1];
String url = split[2];
String servletPath = split[3];
String requestMapping = split[4];
String handlerName = split[5];
String viewName = split[6];
int status = Integer.parseInt( split[7] );
int duration = Integer.parseInt( split[8] );
JsonWritingUtils.writeStringField( generator, "remoteAddress", remoteAddress );
JsonWritingUtils.writeStringField( generator, "method", method );
JsonWritingUtils.writeStringField( generator, "url", url );
JsonWritingUtils.writeStringField( generator, "servletPath", servletPath );
JsonWritingUtils.writeStringField( generator, "requestMapping", requestMapping );
JsonWritingUtils.writeStringField( generator, "handlerName", handlerName );
JsonWritingUtils.writeStringField( generator, "viewName", viewName );
JsonWritingUtils.writeNumberField( generator, "status", status );
JsonWritingUtils.writeNumberField( generator, "duration", duration );
}
catch ( Exception e ) {
throw new IOException( e );
}
}
}
The ILoggingEvent event contains the message outputted by Logback. The logs themselves are generated by using the Across logging module.
So now that we have our generic appender, which outputs the logs in the same manner every time, it's time to use this appender in our Spring Boot application. An example configuration is the following:
logback-spring.xml
<?xml version="1.0" encoding="UTF-8"?>
<include resource="elk-appenders.xml"/>
<logger name="com.foreach.across.modules.logging.request.RequestLogger" level="DEBUG" additivity="false">
<appender-ref ref="elk-requests"/>
</logger>
<root level="DEBUG">
<appender-ref ref="elk-errors"/>
</root>
<configuration>
application.yml
logging:
logstash:
application: name-of-application
servers: logstash-ip
There is only one thing left to do: configure Logstash. Logstash can reformat logs by using Grok filter plugin.
logback.conf
input {
tcp {
port => 5010
codec => "json_lines"
add_field => {
"logback_env" => "prod"
"logback" => "true"
}
}
}
filter {
if [logback] == "true" {
if [appender] == "requests" {
grok {
"match" => {
"message" => "^%{DATA:remoteAddress} %{DATA:method} %{DATA:url} %{DATA:servletPath} %{DATA:requestMapping} %{DATA:handlerName} %{DATA:viewName} %{NUMBER:status} %{NUMBER:duration}$"
}
}
}
}
output {
if [logback] == "true" {
elasticsearch {
hosts => [location-of-elasticsearch]
index => "%{logback_env}-logs-%{application}-%{appender}_%{+YYYY.MM.dd}"
document_type => "log"
}
}
}
Few things to notice:
In the input part we'll add extra fields to add information about our environment. In this case prod will be used.
The filter part contains a Grok filter to format the output.
The output part decides where our Elasticsearch is hosted and defines the index. We create an index per environment, application, appender and date. Example: prod_logs_awemsome-application_requests_2019.06.18.
In the next chapter we'll show how this data can be used to create visualizations and dashboards.
Using the data
First we have to create our index pattern. This can be done by following this tutorial. In our case the index pattern will be prod_logs_awemsome-application_requests_*. Remember that we created an index per day? By using * we make sure that we can see all logs for all days together.
After creating this index pattern, we are already able to see and search logs for this particular application and appender. An example of one log statement:
Cool! This already works. Wouldn't it be great if we could show a dashboard showing a graph with the ratio per application of requests ? To do this we have to create a new visualization first. Timelion is best suited for this use case. Timelion is a time series data visualizer that enables you to combine totally independent data sources within a single visualization.
(
.es(index=prod-logs-*-requests*, metric=count:port, q="*:* AND NOT status: 200" , split=application.keyword:100)
.label("$1", "^.* > application.keyword:(\S+) > .*")
.divide(.es(index=prod-logs-*-requests*, metric=count:port, split=application.keyword:100 ))
).multiply(100)
While it might be a bit weird to use the port to count the number of requests, a count function needs a numeric input, which in this example could be anything. This expression will result in following graph:
The created visualization can be placed in a nice dashboard where other visualizations of logs are shown.
It’s possible that you'll see an error message while trying this expression.
This can be solved by updating Elasticsearch.
If that’s not an option, you can create a graph that contains the ratio of failing requests over the total amount of all requests. This means that the percentage for a single application might drop because the total number of requests over all applications rises, Still, the graph will show you the trends and allows you to spot potential problems.
We can use the following Timelion expression:
If that’s not an option, you can create a graph that contains the ratio of failing requests over the total amount of all requests. This means that the percentage for a single application might drop because the total number of requests over all applications rises, Still, the graph will show you the trends and allows you to spot potential problems.
We can use the following Timelion expression:
(
.es(index=prod-logs-*-requests*, metric=count:port, q="*:* AND NOT status: 200" , split=application.keyword:100)
.label("$1", "^.* > application.keyword:(\S+) > .*")
.divide(.es(index=prod-logs-*-requests*, metric=count:port))
).multiply(100)
This expression will count the number of requests that didn’t end in a 200, split and label it per application, divide the number by the total amount of requests for all applications and multiply the result by 100.
The formula would look like this:
The result will give us the percentage of failing requests over the total amount of requests. By using a ratio we’ll ensure that we spot trends. For example: if there only was a single request and it’s status code was not 200, we get a ratio of 100%. If there were 50 requests and only one of them failed, our ratio is 2%.
Conclusion
By sending application logs to Elasticsearch we solved the problem of volatile application instances. We created a shared library, so there is no excuse anymore for any application to not send logs. We also created a visualization to monitor our requests.
Having our logs stored in Elasticsearch enables us to act fast when an application is having problems. It also enables other possibilities: we could use the logs to let our application react automatically. For example: the error ratio of the application could be monitored by Spring Boot Actuator, and when the ratio would be too high the health status could go down, causing Kubernetes or another orchestrator to react to the unhealthy status and replace the instance. Or we could send messages to our Slack channels when an application has problems, like how we do it with our Zabbix alerts.
The possibilities are pretty much endless and can greatly improve the level of monitoring of your applications.