As an administrator, you monitor Impala's use of resources and take action when necessary to keep Impala running smoothly and avoid conflicts with other Hadoop components running on the same cluster. When you detect that an issue has happened or could happen in the future, you reconfigure Impala or other components such as HDFS or even the hardware of the cluster itself to resolve or avoid problems.
- Using YARN Resource Management with Impala (CDH 5 Only)
- Managing Disk Space for Impala Data
- Setting Timeout Periods for Daemons, Queries, and Sessions
- Using Impala through a Proxy for High Availability
As an administrator, you can expect to perform installation, upgrade, and configuration tasks for Impala on all machines in a cluster. See Installing Impala, Upgrading Impala, and Configuring Impala for details.
For additional security tasks typically performed by administrators, see Impala Security.
For a detailed example of configuring a cluster to share resources between Impala queries and MapReduce jobs, see Setting up a Multi-tenant Cluster for Impala and MapReduce
Using YARN Resource Management with Impala (CDH 5 Only)
You can limit the CPU and memory resources used by Impala, to manage and prioritize workloads on clusters that run jobs from many Hadoop components. (Currently, there is no limit or throttling on the I/O for Impala queries.) Impala uses the underlying Apache Hadoop YARN resource management framework, which allocates the required resources for each Impala query. Impala estimates the resources required by the query on each node of the cluster, and requests the resources from YARN. Requests from Impala to YARN go through an intermediary service Llama (Low Latency Application Master). When the resource requests are granted, Impala starts the query and places all relevant execution threads into the CGroup containers and sets up the memory limit on each node. If sufficient resources are not available, the Impala query waits until other jobs complete and the resources are freed. While the waits for resources might make individual queries seem less responsive on a heavily loaded cluster, the resource management feature makes the overall performance of the cluster smoother and more predictable, without sudden spikes in utilization due to memory paging, saturated I/O channels, CPUs pegged at 100%, and so on.
The Llama Daemon
Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if resource management is enabled in Impala.
By default, YARN allocates resources bit-by-bit as needed by MapReduce jobs. Impala needs all resources available at the same time, so that intermediate results can be exchanged between cluster nodes, and queries do not stall partway through waiting for new resources to be allocated. Llama is the intermediary process that ensures all requested resources are available before each Impala query actually begins.
For Llama installation instructions, see Llama installation.
Checking Resource Estimates and Actual Usage
To make resource usage easier to verify, the output of the EXPLAIN SQL statement now includes information about estimated memory usage, whether table and column statistics are available for each table, and the number of virtual cores that a query will use. You can get this information through the EXPLAIN statement without actually running the query. The extra information requires setting the query option EXPLAIN_LEVEL=verbose; see EXPLAIN Statement for details. The same extended information is shown at the start of the output from the PROFILE statement in impala-shell. The detailed profile information is only available after running the query. You can take appropriate actions (gathering statistics, adjusting query options) if you find that queries fail or run with suboptimal performance when resource management is enabled.
How Resource Limits Are Enforced
- CPU limits are enforced by the Linux CGroups mechanism. YARN grants resources in the form of containers that correspond to CGroups on the respective machines.
- Memory is enforced by Impala's query memory limits. Once a reservation request has been granted, Impala sets the query memory limit according to the granted amount of memory before executing the query.
Enabling Resource Management for Impala
To enable resource management for Impala, first you set up the YARN and Llama services for your CDH cluster. Then you add startup options and customize resource management settings for the Impala services.
Required CDH Setup for Resource Management with Impala
YARN is the general-purpose service that manages resources for many Hadoop components within a CDH cluster. Llama is a specialized service that acts as an intermediary between Impala and YARN, translating Impala resource requests to YARN and coordinating with Impala so that queries only begin executing when all needed resources have been granted by YARN.
impalad Startup Options for Resource Management
- -enable_rm: Whether to enable resource management or not, either true or false. The default is false. None of the other resource management options have any effect unless -enable_rm is turned on.
- -llama_host: Hostname or IP address of the Llama service that Impala should connect to. The default is 127.0.0.1.
- -llama_port: Port of the Llama service that Impala should connect to. The default is 15000.
- -llama_callback_port: Port that Impala should start its Llama callback service on. Llama reports when resources are granted or preempted through that service.
- -cgroup_hierarchy_path: Path where YARN and Llama will create CGroups for granted resources. Impala assumes that the CGroup for an allocated container is created in the path 'cgroup_hierarchy_path + container_id'.
impala-shell Query Options for Resource Management
Limitations of Resource Management for Impala
Currently, the beta versions of CDH 5 and Impala have the following limitations for resource management of Impala queries:
- The resource management feature is not available for a cluster that uses Kerberos authentication.
- Table statistics are required, and column statistics are highly valuable, for Impala to produce accurate estimates of how much memory to request from YARN. See Table Statistics and Column Statistics for instructions on gathering both kinds of statistics, and EXPLAIN Statement for the extended EXPLAIN output where you can check that statistics are available for a specific table and set of columns.
If the Impala estimate of required memory is lower than is
actually required for a query, Impala will cancel the query when
it exceeds the requested memory size. This could happen in some
cases with complex queries, even when table and column statistics
are available. You can see the actual memory usage after a failed
query by issuing a PROFILE command in
impala-shell. Specify a larger memory figure
with the MEM_LIMIT query option and re-try the
Currently, there are known bugs that could cause the maximum memory usage reported by the PROFILE command to be lower than the actual value.
- The MEM_LIMIT query option, and the other resource-related query options, are not currently settable through the ODBC or JDBC interfaces.
Managing Disk Space for Impala Data
Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques to minimize space consumption and file duplication.
Use DROP TABLE statements to remove data files for
"internal ") tables. Use DESCRIBE FORMATTED to check if a particular table is internal (managed by Impala) or external. See DROP TABLE Statement and DESCRIBE Statement for details.
- Be aware of the HDFS trashcan. See DROP TABLE Statement and the FAQ entry Why is space not freed up when I issue DROP TABLE? for details. See User Account Requirements for permissions needed for the HDFS trashcan to operate correctly.
- Use the DESCRIBE FORMATTED statement to see the physical location in HDFS of the data files for a table. See DESCRIBE Statement for details.
- Use external tables to reference HDFS data files in their original location. With this technique, you avoid copying the files, and you can map more than one Impala table to the same set of data files. When you drop the Impala table, the data files are left undisturbed. See External Tables for details.
- Use the LOAD DATA statement to move HDFS files under Impala control. See LOAD DATA Statement for details.
- Drop all tables in a database before dropping the database itself. See DROP DATABASE Statement for details.
- Use compact binary file formats where practical. See How Impala Works with Hadoop File Formats for details, especially Using the Parquet File Format with Impala Tables.
- Clean up temporary files after failed INSERT statements.
Setting Timeout Periods for Daemons, Queries, and Sessions
Depending on how busy your CDH cluster is, you might increase or decrease various timeout values.
Increasing the Statestore Timeout
If you have an extensive Impala schema, for example with hundreds of databases, tens of thousands of tables, and so on, you might encounter timeout errors during startup as the Impala catalog service broadcasts metadata to all the Impala nodes using the statestore service. To avoid such timeout errors on startup, increase the statestore timeout value from its default of 10 seconds. Specify the timeout value using the -statestore_subscriber_timeout_seconds option for the statestore service, using the configuration instructions in Modifying Impala Startup Options. The symptom of this problem is messages in the impalad log such as:
Connection with state-store lost Trying to re-register with state-store
Setting the Idle Query and Idle Session Timeouts for impalad
To keep long-running queries or idle sessions from tying up cluster resources, you can set timeout intervals for both individual queries, and entire sessions. Specify the following startup options for the impalad daemon:
- The --idle_query_timeout option specifies the time in seconds after which an idle query is cancelled. This could be a query whose results were all fetched but was never closed, or one whose results were partially fetched and then the client program stopped requesting further results. This condition is most likely to occur in a client program using the JDBC or ODBC interfaces, rather than in the interactive impala-shell interpreter. Once the query is cancelled, the client program cannot retrieve any further results.
- The --idle_session_timeout option specifies the time in seconds after which an idle session is expired. A session is idle when no activity is occurring for any of the queries in that session, and the session has not started any new queries. Once a session is expired, you cannot issue any new query requests to it. The session remains open, but the only operation you can perform is to close it. The default value of 0 means that sessions never expire.
For instructions on changing impalad startup options, see Modifying Impala Startup Options.
Using Impala through a Proxy for High Availability
For a busy, heavily loaded cluster, you might set up a proxy server to relay requests to and from Impala. This configuration has the following advantages:
- Applications connect to a single well-known host and port, rather than keeping track of the hosts where the impalad daemon is running.
- If any host running the impalad daemon becomes unavailable, application connection requests will still succeed because you always connect to the proxy server.
"coordinator node "for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. The proxy server can issue queries using round-robin scheduling, so that each query uses a different coordinator node. This load-balancing technique lets the Impala nodes share this additional work, rather than concentrating it on a single machine.
The following setup steps are a general outline that apply to any load-balancing proxy software.
- Download the load-balancing proxy software. It should only need to be installed and configured on a single host.
- Configure the software (typically by editing a configuration file). Set up a port that the load balancer will listen on to relay Impala requests back and forth.
- Specify the host and port settings for each Impala node. These are the hosts that the load balancer will choose from when relaying each Impala query. See Appendix A - Ports Used by Impala for when to use port 21000, 21050, or another value depending on what type of connections you are load balancing.
- Run the load-balancing proxy server, pointing it at the configuration file that you set up.
Special Proxy Considerations for Clusters Using Kerberos
In a cluster using Kerberos, applications check host credentials to verify that the host they are connecting to is the same one that is actually processing the request, to prevent man-in-the-middle attacks. To clarify that the load-balancing proxy server is legitimate, perform these extra Kerberos setup steps:
- This section assumes you are starting with a Kerberos-enabled cluster. See Enabling Kerberos Authentication for Impala for instructions for setting up Impala with Kerberos. See the CDH Security Guide for general steps to set up Kerberos: CDH 4 instructions or CDH 5 instructions.
- Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should already have an entry impala/proxy_host@realm in its keytab. If not, go back over the initial Kerberos configuration steps. to the keytab on each host running the impalad daemon.
- Copy the keytab file from the proxy host to all other hosts in the cluster that run the impalad daemon. (For optimal performance, impalad should be running on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.
- Add an entry impala/actual_hostname@realm to the keytab on each host running the impalad daemon. (Only non-CM.)
For each impalad node, merge the existing keytab with the proxy’s keytab using ktutil, producing a new keytab file. For example:
$ ktutil ktutil: read_kt proxy.keytab ktutil: read_kt impala.keytab ktutil: write_kt proxy_impala.keytab ktutil: quit
- Make sure that the impala user has permission to read this merged keytab file.
Modify the impalad startup parameters of each host that participate in the load balancing.
See Modifying Impala Startup Options for the procedure to modify
the startup options.
In the impalad option definition or the Cloudera Manager safety valve field, add:
--principal=impala/proxy_host@realm --be_principal=impala/actual_host@realm --key_tabfile=path_to_merged_keytabNote
:Every host has a different --be_principal because the actual host name is different on each host.
- Restart the impalad daemons on all hosts in the cluster. (In Cloudera Manager, restart the Impala service.)
Example of Configuring HAProxy Load Balancer for Impala
If you are not already using a load-balancing proxy, you can experiment with HAProxy a free, open source load balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise Linux system.
Install the load balancer: yum install haproxy
Set up the configuration file: /etc/haproxy/haproxy.cfg See Example of Configuring HAProxy Load Balancer for Impala for a sample configuration file for one particular load balancer (HAProxy).
Run the load balancer (on a single host, preferably one not running impalad): /usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg
In impala-shell, JDBC applications, or ODBC applications, connect to haproxy_host:25003, rather than port 25000 on a host actually running impalad.
This is the sample haproxy.cfg used in this example.
global # To have these messages end up in /var/log/haproxy.log you will # need to: # # 1) configure syslog to accept network log events. This is done # by adding the '-r' option to the SYSLOGD_OPTIONS in # /etc/sysconfig/syslog # # 2) configure local2 events to go to the /var/log/haproxy.log # file. A line like the following can be added to # /etc/sysconfig/syslog # # local2.* /var/log/haproxy.log # log 127.0.0.1 local0 log 127.0.0.1 local1 notice chroot /var/lib/haproxy pidfile /var/run/haproxy.pid maxconn 4000 user haproxy group haproxy daemon # turn on stats unix socket #stats socket /var/lib/haproxy/stats #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block # # You might need to adjust timing values to prevent timeouts. #--------------------------------------------------------------------- defaults mode http log global option httplog option dontlognull option http-server-close option forwardfor except 127.0.0.0/8 option redispatch retries 3 maxconn 3000 contimeout 5000 clitimeout 50000 srvtimeout 50000 # # This sets up the admin page for HA Proxy at port 25002. # listen stats :25002 balance mode http stats enable stats auth username:password # This is the setup for Impala. Impala client connect to load_balancer_host:25003. # HAProxy will balance connections among the list of servers listed below. # The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver. # For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000. listen impala :25003 mode tcp option tcplog balance leastconn server symbolic_name_1 impala-host-1.example.com:21000 server symbolic_name_2 impala-host-2.example.com:21000 server symbolic_name_3 impala-host-3.example.com:21000 server symbolic_name_4 impala-host-4.example.com:21000