Solr jvm memory monitor

Solr requires sufficient memory for the Java Heap and the free memory for the OS disk cache. One of the most driving factors for Solr performance is RAM. Zookeeper is very stable and it may fail only due to network resources, or better said the lack of it. This will produce inconsistent states among the running replicas and the one trying to recover ends up in a long loop that might last hours. One common issue if the replicas are recovering too frequently is that the cluster state might be out of sync from Zookeeper. The only job Apache Zookeeper has in this environment is keeping the cluster state available for all the nodes as accurate as possible. We have reached ~210K updates per hour (peak traffic) on our key markets. Otherwise this is going to be too extensive. I will leave the analysis of this question for another post. Is it possible that we have an overkill index / update process? Given the result of our experience, it is not overkilling. This reduces the cache and disk size and improves index process. When we have multiple shards we divide the total number of documents by the shard count. By the time we have an issue with one shard and the other shards could respond anyway, the time response or blocker would be the slowest shard. Having multiple shards for one collection does not necessarily result in a more resilient Solr. Initially your disk space is going to take at least this: And let’s say the average document size is 2 kb. Is it really worth the effort? No, you will find more info as you read further. It is possible to tune Lucene, only if you are willing to sacrifice the structure of your document. Is it possible to do the math for Lucene and check the settings? I can share an approximate result based on a lot of documentation and forum readings, however its configuration is not as heavy as math for Solr.

Lucene is the engine behind all the calculation and makes the magic for rankings and Faceting.

Finding that IOPS reaches 100% on some Solr EBS volumesĪs part of the analysis, we came out with the following topics Lucene settingsĪpache Solr is a widely used search and ranking engine, very well thought and designed with Lucene under the hood (shared also with ElasticSearch).

Response Time increasing from ~30 ms to ~1500 ms.

SearchExecutor thread throwing exception on caches warm up (LRUCache.warm).

SearchExecutor thread running on top of CPU, as well as Garbage Collector.

Full garbage collector running often (old and young generation).

Doubts on the “Index / Update Service” because reducing its traffic to Solr stops the replicas going down or into recovering mode.

Leaders being under too much load (both from index, queries and replicas syncing) which prevented them from functioning correctly and led to shard crashes.

Errors in replicas not reaching the leader because they were too busy.

High ratio of replicas going into recovering and taking too long to recover.

The timeouts were generated by apparently random issues with Solr replicas taking too long to respond and these problems became more often affecting the front-end clients without information to show. The services were responding with acceptable response times, Solr clients were doing pretty good until they started opening some circuit breakers due to timeouts. Finally we agreed we could handle the traffic with some looseness. We trusted that Solr was well configured so that the team worked on improving performance on the clients and setting higher timeouts against Solr. We did it with an internal tool for stress tests and we could roughly get the desired traffic. The baselineĪfter deploying Solr in our largest Market we then had to test it. At the time of writing, we are proud to mention that the API serves ~150K requests per minute and sends ~210K updates per hour to Solr in our largest region. We are using SolrCloud (v 7.7) within AWS on Openshift using Zookeeper. The goal was to keep it with the flawless performance and stability we already had with smaller countries. I remember that a few months after joining the team the next challenge was to be able to provide excellent services for larger key countries.

We do it through a set of microservices that provide three public endpoints, the Home Feed, Search and Related items API. We are in charge of providing “personalized and relevant content to the user” based on rankings and machine learning. I’ve had the great pleasure of working with the Personalization and Relevance Team during the last 10 months. This is a short story of how we managed to overcome the stability and performance issues of our Search and Relevance stack.