Monday, September 14, 2015

How I Ditched Logstash

Ever since Logstash came on the scene several years ago, ELK (Elasticsearch Logstash Kibana) has become the de facto standard of log processing for me and pretty much everybody I talk to. Like any product that is used frequently, over time, you just can't help but developing a love-and-hate relationship with it. But lately there has be a lot more hate than love, at least for me.

Our setup is pretty simple and straight forward: a cluster of load balanced logstash processor nodes sits in front of a cluster of elasticsearch nodes and we use logstash-forwarder to push logs from our servers to logstash.

The first issue we've encountered was that our logstash 1.4 cluster does not scale well and start to crash under high lumberjack connection counts. The issue was addressed in logstash 1.5 with the implementation of the "Circuit Breakers". While it solves the problem of logstash crashing (we jokingly call it "shark attack" as the graph of memory utilization and active connection counts looks like a bunch of shark fins as it climbs up slowly and drops sharply when logstash crashes), it does not however solve the problem of logstash-forwarder connections constantly being dropped and has to reconnect all the time.

Because the bottleneck is how logstash handles lumberjack connection, not CPU load (during the "shark attack" CPU was running 95% idle), the obvious and sensible solution is to have a buffering or queuing service to handle the incoming messages and let logstash pull it at its own pace. However, logstash-forwarder's outright refusal of implementing such feature (https://github.com/elastic/logstash-forwarder/issues/18) forced our hand to move away from logstash-forwarder all together. (And no, running an actual logstash in agent mode is not going to work because it is extremely resource hungry).

After smooth sailing for a couple of months, thinking all logstash problem is now behind us because we implemented a pretty robust Beaver + SQS solution that would scale logstash cluster using the count of SQS visible message counts, an unexpected problem hit us again.

This time, we got alerted that our SQS queue has grown alarmingly long, but all of our logstash instances (including the ones that are launched by auto scaling) are just sitting idle. After some digging around, we noticed tons of 400 responses from elasticsearch before logstash completely quits processing anything with NO ERROR MESSAGES about how and why the log processor stopped. What's worse, it's not an easily detectable crash as logstash process still sits there and running, just not doing anything. A restart of logstash would make it start processing log again, but it just quits after about 5 minutes. User has reported the problem (https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/144) but it seems nobody knows the exact underlying cause nor is there a plan to fix it.

We limped along using cron to restart logstash every 5 minutes (yeah, it's that bad) while furiously looking for a fix or an alternative solution. Fortunately, while all these were going down, we were looking for a solution to log docker container activities, and I was in the process of evaluating fluent (http://www.fluentd.org/). Since we pretty much hit a brick wall with logstash, I start to experiment with letting fluent pull messages off SQS and process them then push it directly to elasticsearch. What I found is simply astonishing:

As soon as we brought a single fluent instance online to pull messages off the same SQS queue, using the same type of EC2 instance as logstash nodes, the single fluent instance manages to handle about 90% of log messages off the SQS queue while three logstash instances combine are only processing the other 10%. Basically fluent manages to out process logstash on a ratio of 30 to 1. On top of that, from our observation, fluent's memory footprint is tiny (about 40MB, about the size fluent web site claims to be), compare logstash where 300~400MB of memory footprint is quite common.

At this point, we have replace majority of our logstash cluster with fluent and only leaving several behind to process some legacy nodes that is still utilizing logstash-forward/lumberjack protocol as fluent doesn't have a lumberjack input plugin. We replace the cluster of 8 logstash instances with 3 fluent instances and it is handling as much or even more throughput as logstash used to handle. In about a month, once the legacy nodes that still uses logstash-forwarder is switched over to using fluent pushing logs to SQS, we would ditch logstash for good.