This is the documentation for CDH 5.1.0.
Documentation for other versions is available at Cloudera Documentation.

Flume Near Real-Time Indexing Reference

The Flume Solr Sink is a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT) oriented system for processing a continuous stream of records into live search indexes. Latency from the time of data arrival to the time of data showing up in search query results is on the order of seconds and is tunable.

Data flows from one or more sources through one or more Flume nodes across the network to one or more Flume Solr Sinks. The Flume Solr Sinks extract the relevant data, transform it, and load it into a set of live Solr search servers, which in turn serve queries to end users or search applications.

The ETL functionality is flexible and customizable using chains of arbitrary morphline commands that pipe records from one transformation command to another. Commands to parse and transform a set of standard data formats such as Avro, CSV, Text, HTML, XML, PDF, Word, or Excel. are provided out of the box, and additional custom commands and parsers for additional file or data formats can be added as morphline plug-ins. This is done by implementing a simple Java interface that consumes a record such as a file in the form of an InputStream plus some headers plus contextual metadata. This record is used to generate output of zero or more records. Any kind of data format can be indexed and any Solr documents for any kind of Solr schema can be generated, and any custom ETL logic can be registered and executed.

Routing to multiple Solr collections is supported to improve multi-tenancy. Routing to a SolrCloud cluster is supported to improve scalability. Flume SolrSink servers can be either co-located with live Solr servers serving end user queries, or Flume SolrSink servers can be deployed on separate industry standard hardware for improved scalability and reliability. Indexing load can be spread across a large number of Flume SolrSink servers for improved scalability. Indexing load can be replicated across multiple Flume SolrSink servers for high availability, for example using Flume features such as Load balancing Sink Processor.

This system provides low latency data acquisition and low latency querying. It complements (rather than replaces) use-cases based on batch analysis of HDFS data using MapReduce. In many use cases, data flows simultaneously from the producer through Flume into both Solr and HDFS using Flume features such as optional replicating channels to replicate an incoming flow into two output flows. Both near real time ingestion as well as batch analysis tools are used in practice.

For a more comprehensive discussion of the Flume Architecture see Large Scale Data Ingestion using Flume.

Once Flume is configured, start Flume as detailed in Flume Installation.

See the Cloudera Search Tutorial for exercises that configure and run a Flume SolrSink to index documents.