This is the documentation for Cloudera Search CDH 5 Beta 2 and 1.2.0 for CDH 4.
Documentation for other versions is available at Cloudera Documentation.

Tuning the Solr Server

Solr performance tuning is a complex task. The following sections provide more details.

General information on Solr caching is available here on the SolrCaching page on the Solr Wiki.

Information on issues that influence performance is available on the SolrPerformanceFactors page on the Solr Wiki.

Solr and HDFS - the Block Cache

Cloudera Search enables Solr to store indexes in an HDFS filesystem. To maintain performance, an HDFS block cache has been implemented using LRU semantics. This enables Solr to cache HDFS index files on read and write, storing the portions of the file in JVM "direct memory" (meaning off heap) by default or optionally in the JVM heap. Direct memory is preferred as it is not affected by garbage collection.

Batch jobs typically do not make use of the cache, while Solr servers (when serving queries or indexing documents) should. When running indexing using MapReduce, the MR jobs themselves do not make use of the block cache. Block caching is turned off by default and should be left disabled.

Tuning of this cache is complex and best practices are continually being refined. In general, allocate a cache that is about 10-20% of the amount of memory available on the system. For example, when running HDFS and Solr on a host with 50 GB of memory, typically allocate 5-10 GB of memory using solr.hdfs.blockcache.slab.count. As index sizes grow you may need to tune this parameter to maintain optimal performance.
  Note: Block cache metrics are currently unavailable.

Configuration

The following parameters control caching. They can be configured at the Solr process level by setting the respective system property or by editing the solrconfig.xml directly.

parameter

default

description

solr.hdfs. \
blockcache.enabled

true

Enable the blockcache.

solr.hdfs. \
blockcache.read.enabled

true

Enable the read cache.

solr.hdfs. \
blockcache.write.enabled

true

Enable the write cache.

solr.hdfs. \
blockcache.direct.memory.allocation

true

Enable direct memory allocation. If this is false, heap is used.

solr.hdfs. \
blockcache.slab.count

1

Number of memory slabs to allocate. Each slab is 128 MB in size.

  Note:

Increasing the direct memory cache size may make it necessary to increase the maximum direct memory size allowed by the JVM. Add the following to /etc/default/solr to do so. You must also replace MAXMEM with a reasonable upper limit. A typical default JVM value for this is 64 MB.

CATALINA_OPTS="-XX:MaxDirectMemorySize=MAXMEMg -XX:+UseLargePages"

Restart Solr servers after editing this parameter.

Solr HDFS optimizes caching when performing NRT indexing using Lucene's NRTCachingDirectory.

Lucene caches a newly created segment if both of the following conditions are true

  • The segment is the result of a flush or a merge and the estimated size of the merged segment is <= solr.hdfs.nrtcachingdirectory.maxmergesizemb.
  • The total cached bytes is <= solr.hdfs.nrtcachingdirectory.maxcachedmb.

The following parameters control NRT caching behavior:

parameter

default

description

solr.hdfs. \
nrtcachingdirectory.enable

true

Whether to enable the NRTCachingDirectory.

solr.hdfs. \
nrtcachingdirectory.maxcachedmb

192

Size of the cache in megabytes.

solr.hdfs. \
nrtcachingdirectory.maxmergesizemb

16

Maximum segment size to cache.

Here is an example of solrconfig.xml with defaults:

 <directoryFactory name="DirectoryFactory" class="org.apache.solr.core.HdfsDirectoryFactory">
<bool name="solr.hdfs.blockcache.enabled"> \
${solr.hdfs.blockcache.enabled:true}</bool>
<int name="solr.hdfs.blockcache.slab.count"> \
${solr.hdfs.blockcache.slab.count:1}</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation"> \
${solr.hdfs.blockcache.direct.memory.allocation:true}</bool>
<int name="solr.hdfs.blockcache.blocksperbank"> \
${solr.hdfs.blockcache.blocksperbank:16384}</int>
<bool name="solr.hdfs.blockcache.read.enabled"> \
${solr.hdfs.blockcache.read.enabled:true}</bool>
<bool name="solr.hdfs.blockcache.write.enabled"> \
${solr.hdfs.blockcache.write.enabled:true}</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable"> \
${solr.hdfs.nrtcachingdirectory.enable:true}</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb"> \
${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb"> \
${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}</int>
</directoryFactory>

The following example illustrates passing Java options by editing the /etc/default/solr configuration file:

CATALINA_OPTS="-Xmx10g -XX:MaxDirectMemorySize=20g -XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=100"

For better performance, Cloudera recommends disabling the Linux swap space on all Solr server nodes as shown below:

# minimize swapiness
sudo sysctl vm.swappiness=0
sudo bash -c 'echo "vm.swappiness=0">> /etc/sysctl.conf'
# disable swap space until next reboot:
sudo /sbin/swapoff -a

Solr Query Performance

The ImproveSearchingSpeed on the Lucene-java Wiki highlights some areas to consider for improving query performance.

Solr Indexing Performance

The ImproveIndexingSpeed on the Lucene-java Wiki highlights some areas to consider for improving indexing performance.

Resource Management with Cloudera Manager

Resource Management describes how to use Cloudera Manager to manage resources, for example with Linux cgroups.