Solr requires sufficient memory for the Java Heap and the free memory for the OS disk cache. One of the most driving factors for Solr performance is RAM. Zookeeper is very stable and it may fail only due to network resources, or better said the lack of it. This will produce inconsistent states among the running replicas and the one trying to recover ends up in a long loop that might last hours. One common issue if the replicas are recovering too frequently is that the cluster state might be out of sync from Zookeeper. The only job Apache Zookeeper has in this environment is keeping the cluster state available for all the nodes as accurate as possible. We have reached ~210K updates per hour (peak traffic) on our key markets. Otherwise this is going to be too extensive. I will leave the analysis of this question for another post. Is it possible that we have an overkill index / update process? Given the result of our experience, it is not overkilling. This reduces the cache and disk size and improves index process. When we have multiple shards we divide the total number of documents by the shard count. By the time we have an issue with one shard and the other shards could respond anyway, the time response or blocker would be the slowest shard. Having multiple shards for one collection does not necessarily result in a more resilient Solr. Initially your disk space is going to take at least this: And let’s say the average document size is 2 kb. Is it really worth the effort? No, you will find more info as you read further. It is possible to tune Lucene, only if you are willing to sacrifice the structure of your document. Is it possible to do the math for Lucene and check the settings? I can share an approximate result based on a lot of documentation and forum readings, however its configuration is not as heavy as math for Solr.
Lucene is the engine behind all the calculation and makes the magic for rankings and Faceting.
We do it through a set of microservices that provide three public endpoints, the Home Feed, Search and Related items API. We are in charge of providing “personalized and relevant content to the user” based on rankings and machine learning. I’ve had the great pleasure of working with the Personalization and Relevance Team during the last 10 months. This is a short story of how we managed to overcome the stability and performance issues of our Search and Relevance stack.