This is the documentation for CDH 5.1.0.
Documentation for other versions is available at Cloudera Documentation.

Tuning the Solr Server

Solr performance tuning is a complex task. The following sections provide more details.

Tuning to Complete During Setup

Some tuning is best completed during the setup of you system or may require some reindexing.

Configuring Lucene Version Requirements

You can configure Solr to use a specific version of Lucene. This can help ensure that the Lucene version that Search uses includes the latest features and bug fixes. At the time that a version of Solr ships, Solr is typically configured to use the appropriate Lucene version, in which case there is no need to change this setting. If a subsequent Lucene update occurs, you can configure the Lucene version requirements by directly editing the luceneMatchVersion element in the solrconfig.xml file. Versions are typically of the form x.y, such as 4.4. For example, to specify version 4.4, you would ensure the following setting exists in solrconfig.xml:
<luceneMatchVersion>4.4</luceneMatchVersion>

Designing the Schema

When constructing a schema, use data types that most accurately describe the data that the fields will contain. For example:
  • Use the tdate type for dates. Do this instead of representing dates as strings.
  • Consider using the text type that applies to your language, instead of using String. For example, you might use text_en. Text types support returning results for subsets of an entry. For example, querying on "john" would find "John Smith", whereas with the string type, only exact matches are returned.
  • For IDs, use the string type.

General Tuning

The following tuning categories can be completed at any time. It is less advantageous to implement these changes before beginning to use your system.

General Tips

  • If your environment does not require Near Real Time (NRT), turn off soft auto-commit in solrconfig.xml.
  • In most cases, do not change the default batch size setting of 1000. If you are working with especially large documents, you may consider decreasing the batch size.
  • To help identify any garbage collector (GC) issues, enable GC logging in production. The overhead is low and the JVM supports GC log rolling as of 1.6.0_34.
    • The minimum recommended GC logging flags are: -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails.
    • To rotate the GC logs: -Xloggc: -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles= -XX:GCLogFileSize=.

Solr and HDFS - the Block Cache

Cloudera Search enables Solr to store indexes in an HDFS filesystem. To maintain performance, an HDFS block cache has been implemented using Least Recently Used (LRU) semantics. This enables Solr to cache HDFS index files on read and write, storing the portions of the file in JVM "direct memory" (meaning off heap) by default or optionally in the JVM heap. Direct memory is preferred as it is not affected by garbage collection.

Batch jobs typically do not make use of the cache, while Solr servers (when serving queries or indexing documents) should. When running indexing using MapReduce, the MR jobs themselves do not make use of the block cache. Block caching is turned off by default and should be left disabled.

Tuning of this cache is complex and best practices are continually being refined. In general, allocate a cache that is about 10-20% of the amount of memory available on the system. For example, when running HDFS and Solr on a host with 50 GB of memory, typically allocate 5-10 GB of memory using solr.hdfs.blockcache.slab.count. As index sizes grow you may need to tune this parameter to maintain optimal performance.
  Note: Block cache metrics are currently unavailable.

Configuration

The following parameters control caching. They can be configured at the Solr process level by setting the respective system property or by editing the solrconfig.xml directly.

Parameter

Default

Description

solr.hdfs.blockcache.enabled

true

Enable the blockcache.

solr.hdfs.blockcache.read.enabled

true

Enable the read cache.

solr.hdfs.blockcache.write.enabled

false

Enable the write cache.

solr.hdfs.blockcache.direct.memory.allocation

true

Enable direct memory allocation. If this is false, heap is used.

solr.hdfs.blockcache.slab.count

1

Number of memory slabs to allocate. Each slab is 128 MB in size.

solr.hdfs.blockcache.global true If enabled, a single HDFS block cache is used for all SolrCores on a node. If blockcache.global is disabled, each SolrCore on a node creates its own private HDFS block cache. Enabling this parameter simplifies managing HDFS block cache memory.
  Note:

Increasing the direct memory cache size may make it necessary to increase the maximum direct memory size allowed by the JVM. Add the following to /etc/default/solr to do so. You must also replace MAXMEM with a reasonable upper limit. A typical default JVM value for this is 64 MB.

CATALINA_OPTS="-XX:MaxDirectMemorySize=MAXMEMg -XX:+UseLargePages"

Restart Solr servers after editing this parameter.

Solr HDFS optimizes caching when performing NRT indexing using Lucene's NRTCachingDirectory.

Lucene caches a newly created segment if both of the following conditions are true:

  • The segment is the result of a flush or a merge and the estimated size of the merged segment is <= solr.hdfs.nrtcachingdirectory.maxmergesizemb.
  • The total cached bytes is <= solr.hdfs.nrtcachingdirectory.maxcachedmb.

The following parameters control NRT caching behavior:

Parameter

Default

Description

solr.hdfs.nrtcachingdirectory.enable

true

Whether to enable the NRTCachingDirectory.

solr.hdfs.nrtcachingdirectory.maxcachedmb

192

Size of the cache in megabytes.

solr.hdfs.nrtcachingdirectory.maxmergesizemb

16

Maximum segment size to cache.

Here is an example of solrconfig.xml with defaults:

 <directoryFactory name="DirectoryFactory" class="org.apache.solr.core.HdfsDirectoryFactory">
    <bool name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}</bool>
    <int name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1}</int>
    <bool name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:true}</bool>
    <int name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384}</int>
    <bool name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true}</bool>
    <bool name="solr.hdfs.blockcache.write.enabled">${solr.hdfs.blockcache.write.enabled:true}</bool>
    <bool name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true}</bool>
    <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}</int>
    <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}</int>
</directoryFactory>

The following example illustrates passing Java options by editing the /etc/default/solr configuration file:

CATALINA_OPTS="-Xmx10g -XX:MaxDirectMemorySize=20g -XX:+UseLargePages -Dsolr.hdfs.blockcache.slab.count=100"

For better performance, Cloudera recommends disabling the Linux swap space on all Solr server nodes as shown below:

# minimize swapiness
sudo sysctl vm.swappiness=0
sudo bash -c 'echo "vm.swappiness=0">> /etc/sysctl.conf'
# disable swap space until next reboot:
sudo /sbin/swapoff -a

Threads

Configure the Tomcat server to have more threads per Solr instance. Note that this is only effective if your hardware is sufficiently powerful to accommodate the increased threads. 10,000 threads is a reasonable number to try in many cases.

To change the maximum number of threads, add a maxThreads element to the Connector definition in the Tomcat server's server.xml configuration file. For example, if you installed Search for CDH 5 using parcels installation, you would modify the Connector definition in the <parcel path>/CDH/etc/solr/tomcat-conf.dist/conf/server.xml file so this:
    <Connector port="${solr.port}" protocol="HTTP/1.1" 
               connectionTimeout="20000" 
               redirectPort="8443" />
Becomes this:
    <Connector port="${solr.port}" protocol="HTTP/1.1" 
               maxThreads="10000" 
               connectionTimeout="20000" 
               redirectPort="8443" />

Garbage Collection

Choose different garbage collection options for best performance in different environments. Some garbage collection options typically chosen include:

  • Concurrent low pause collector: Use this collector in most cases. This collector attempts to minimize "Stop the World" events. Avoiding these events can reduce connection timeouts, such as with ZooKeeper, and may improve user experience. This collector is enabled using -XX:+UseConcMarkSweepGC.
  • Throughput collector: Consider this collector if raw throughput is more important than user experience. This collector typically uses more "Stop the World" events so this may negatively affect user experience and connection timeouts. This collector is enabled using -XX:+UseParallelGC.

You can also affect garbage collection behavior by increasing the Eden space to accommodate new objects. With additional Eden space, garbage collection does not need to run as frequently on new objects.

Replicas

If you have sufficient additional hardware, add more replicas for a linear boost of query throughput. Note that adding replicas may slow write performance on the first replica, but otherwise this should have minimal negative consequences.

Shards

In some cases, oversharding can help improve performance including intake speed. If your environment includes massively parallel hardware and you want to use these available resources, consider oversharding. You might increase the number of replicas per node from 1 to 2 or 3. Making such changes creates complex interactions, so you should continue to monitor your system's performance to ensure that the benefits of oversharding do not outweigh the costs.

Commits

Changing commit values may improve performance in some situation. These changes result in tradeoffs and may not be beneficial in all cases.

  • For hard commit values, the default value of 60000 (60 seconds) is typically effective, though changing this value to 120 seconds may improve performance in some cases. Note that setting this value to higher values, such as 600 seconds may result in undesirable performance tradeoffs.
  • Consider increasing the auto-commit value from 15000 (15 seconds) to 120000 (120 seconds).
  • Enable soft commits and set the value to the largest value that meets your requirements. The default value of 1000 (1 second) is too aggressive for some environments.

Other Resources

  • General information on Solr caching is available on the SolrCaching page on the Solr Wiki.
  • Information on issues that influence performance is available on the SolrPerformanceFactors page on the Solr Wiki.
  • Resource Management describes how to use Cloudera Manager to manage resources, for example with Linux cgroups.
  • For information on improving querying performance, see ImproveSearchingSpeed.

  • For information on improving indexing performance, see ImproveIndexingSpeed.