Impala 1.1 adds a fine-grained authorization framework for Hadoop, by integrating the Sentry open source project. Together with the existing Kerberos authentication framework, Impala 1.1 takes Hadoop security to a new level needed for the requirements of highly regulated industries such as healthcare, financial services, and government. Impala 1.1.1 fills in the security feature set even more by adding an auditing capability; Impala generates the audit data, the Cloudera Navigator product consolidates the audit data from all nodes in the cluster, and Cloudera Manager lets you filter, visualize, and produce reports.
The security features of Cloudera Impala have several objectives. At the most basic level, security prevents accidents or mistakes that could disrupt application processing, delete or corrupt data, or reveal data to unauthorized users. More advanced security features and practices can harden the system against malicious users trying to gain unauthorized access or perform other disallowed operations. The auditing feature provides a way to confirm that no unauthorized access occurred, and detect whether any such attempts were made. This is a critical set of features for production deployments in large organizations that handle important or sensitive data. It sets the stage for multi-tenancy, where multiple applications run concurrently and are prevented from interfering with each other.
The material in this section presumes that you are already familiar with administering secure Linux systems. That is, you should know the general security practices for Linux and Hadoop, and their associated commands and configuration files. For example, you should know how to create Linux users and groups, manage Linux group membership, set Linux and HDFS file permissions and ownership, and designate the default permissions and ownership for new files. You should be familiar with the configuration of the nodes in your Hadoop cluster, and know how to apply configuration changes or run a set of commands across all the nodes.
The security features are divided into these broad categories:
- Which users are allowed to access which resources, and what operations are they allowed to perform? Impala uses the OS user ID of the user who runs impala-shell or other client program, and associates various privileges with each user. Impala relies on the open source Sentry project for authorization.
- How does Impala verify the identity of the user to confirm that they really are allowed to exercise the privileges assigned to that user? Impala relies on the Kerberos subsystem for authentication.
- What operations were attempted, and did they succeed or not? This feature provides a way to look back and diagnose whether attempts were made to perform unauthorized operations. You use this information to track down suspicious activity, and to see where changes are needed in authorization policies. The audit data produced by this feature is collected by the Cloudera Manager product and then presented in a user-friendly form by the Cloudera Manager product.
The following sections lead you through the various security-related features of Impala.
Security Guidelines for Impala
The following are the major steps to harden a cluster running Impala against accidents and mistakes, or malicious attackers trying to access sensitive data:
- Secure the root account. The root user can tamper with the impalad daemon, read and write the data files in HDFS, log into other user accounts, and access other system services that are beyond the control of Impala.
- Restrict membership in the sudoers list (in the /etc/sudoers file). The users who can run the sudo command can do many of the same things as the root user.
- Ensure the Hadoop ownership and permissions for Impala data files are restricted.
- Ensure the Hadoop ownership and permissions for Impala log files are restricted.
- Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected. See Securing the Impala Web User Interface for details.
- Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups (which by default map to Linux OS groups). Create the associated Linux groups using the groupadd command if necessary.
- The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for background information, see the CDH HDFS Permissions Guide. Set up users and assign them to groups at the OS level, corresponding to the different categories of users with different access levels for various databases, tables, and HDFS locations (URIs). Create the associated Linux users using the useradd command if necessary, and add them to the appropriate groups with the usermod command.
- Design your databases, tables, and views with database and table structure to allow policy rules to specify simple, consistent rules. For example, if all tables related to an application are inside a single database, you can assign privileges for that database and use the * wildcard for the table name. If you are creating views with different privileges than the underlying base tables, you might put the views in a separate database so that you can use the * wildcard for the database containing the base tables, while specifying the precise names of the individual views. (For specifying table or database names, you either specify the exact name or * to mean all the databases on a server, or all the tables and views in a database.)
- Enable authorization by running the impalad daemons with the -server_name and -authorization_policy_file options on all nodes. (The authorization feature does not apply to the statestored daemon, which has no access to schema objects or data files.)
- Set up authentication using Kerberos, to make sure users really are who they say they are.
Securing Impala Data and Log Files
One aspect of security is to protect files from unauthorized access at the filesystem level. For example, if you store sensitive data in an Impala table, you specify permissions on the associated files and directories in HDFS to restrict read and write permissions to the appropriate users and groups. If you issue queries containing sensitive values in the WHERE clause, such as financial account numbers, those values are stored in Impala log files in the Linux filesystem and you must secure those files also.
For the locations of Impala log files, see Using Impala Logging.
Installation Considerations for Impala Security
Impala 1.1 comes set up with all the software and settings needed to enable security when you run the impalad daemon with the new security-related options (-server_name and -authorization_policy_file). You do not need to change any environment variables or install any additional JAR files. In a cluster managed by Cloudera Manager, you do not need to change any settings in Cloudera Manager.
Securing the Hive Metastore Database
It is important to secure the Hive metastore, so that users cannot access the names or other information about databases and tables the through the Hive client or by querying the metastore database. Do this by turning on Hive metastore security, using the instructions in the CDH4 Security Guide (or the CDH 5 equivalent) for securing different Hive components:
- Secure the Hive Metastore.
- In addition, allow access to the metastore only from the HiveServer2 server, and then disable local access to the HiveServer2 server.
Securing the Impala Web User Interface
The instructions in this section presume you are familiar with the .htpasswd mechanism commonly used to password-protect pages on web servers.
Password-protect the Impala web UI that listens on port 25000 by default. Set up a .htpasswd file in the $IMPALA_HOME directory, or start both the impalad and statestored daemons with the --webserver_password_file option to specify a different location (including the filename).
This file should only be readable by the Impala process and machine administrators, because it contains (hashed) versions of passwords. The username / password pairs are not derived from Unix usernames, Kerberos users, or any other system. The domain field in the password file must match the domain supplied to Impala by the new command-line option --webserver_authentication_domain. The default is mydomain.com.
Impala also supports using HTTPS for secure web traffic. To do so, set --webserver_certificate_file to refer to a valid .pem SSL certificate file. Impala will automatically start using HTTPS once the SSL certificate has been read and validated. A .pem file is basically a private key, followed by a signed SSL certificate; make sure to concatenate both parts when constructing the .pem file.
If Impala cannot find or parse the .pem file, it prints an error message and quits.
If the private key is encrypted using a passphrase, Impala will ask for that passphrase on startup, which is not useful for a large cluster. In that case, remove the passphrase and make the .pem file readable only by Impala and administrators.
When you turn on SSL for the Impala web UI, the associated URLs change from http:// prefixes to https://. Adjust any bookmarks or application code that refers to those URLs.
Enabling Kerberos Authentication for Impala
Impala supports Kerberos authentication. For more information on enabling Kerberos authentication, see the topic on Configuring Hadoop Security in CDH4 in the CDH4 Security Guide (or the CDH 5 equivalent). Impala currently does not support application data wire encryption. When using Impala in a managed environment, Cloudera Manager automatically completes Kerberos configuration. In an unmanaged environment, create a Kerberos principal for each host running impalad or statestored. Cloudera recommends using a consistent format, such as impala/_HOST@Your-Realm, but you can use any three-part Kerberos server principal.
sudo yum install python-devel openssl-devel python-pip sudo pip-python install ssl
If you plan to use Impala in your cluster, you must configure your KDC to allow tickets to be renewed, and you must configure krb5.conf to request renewable tickets. Typically, you can do this by adding the max_renewable_life setting to your realm in kdc.conf, and by adding the renew_lifetime parameter to the libdefaults section of krb5.conf. For more information about renewable tickets, see the Kerberos documentation.
Currently, you cannot use the resource management feature in CDH 5 on a cluster that has Kerberos authentication enabled.
Start all impalad and statestored daemons with the --principal and --keytab-file flags set to the principal and full path name of the keytab file containing the credentials for the principal. Impala supports the Cloudera ODBC driver and the Kerberos interface provided. To use Kerberos through the ODBC driver, the host type must be set depending on the level of the ODBD driver:
- SecImpala for the ODBC 1.0 driver.
- SecBeeswax for the ODBC 1.2 driver.
- Blank for the ODBC 2.0 driver or higher, when connecting to a secure cluster.
- HS2NoSasl for the ODBC 2.0 driver or higher, when connecting to a non-secure cluster.
To enable Kerberos in the Impala shell, start the impala-shell command using the -k flag.
To enable Impala to work with Kerberos security on your Hadoop cluster, make sure you perform the installation and configuration steps in the topic on Configuring Hadoop Security in the CDH4 Security Guide (or the CDH 5 equivalent). Also note that when Kerberos security is enabled in Impala, a web browser that supports Kerberos HTTP SPNEGO is required to access the Impala web console (for example, Firefox, Internet Explorer, or Chrome).
If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager, NodeManagers, HttpFS, Oozie, Impala, or Impala statestore services are configured to use Kerberos HTTP SPNEGO authentication, and two or more of these services are running on the same host, then all of the running services must use the same HTTP principal and keytab file used for their HTTP endpoints.
Configuring Impala to Support Kerberos Security
Enabling Kerberos authentication for Impala involves steps that can be summarized as follows:
- Creating service principals for Impala and the HTTP service. Principal names take the form: serviceName/fully.qualified.domain.name@KERBEROS.REALM
- Creating, merging, and distributing key tab files for these principals.
- Editing /etc/default/impala (in cluster not managed by Cloudera Manager), or editing the Security settings in the Cloudera Manager interface, to accommodate Kerberos authentication.
To enable Kerberos for Impala:
Create an Impala service principal, specifying the name of the OS user that the Impala daemons run under,
the fully qualified domain name of each node running impalad, and the realm name. For example:
$ kadmin kadmin: addprinc -requires_preauth -randkey impala/impala_host.example.com@TEST.EXAMPLE.COM
Create an HTTP service principal. For example:
kadmin: addprinc -randkey HTTP/impala_host.example.com@TEST.EXAMPLE.COMNote: The HTTP component of the service principal must be uppercase as shown in the preceding example.
Create keytab files with both principals. For example:
kadmin: xst -k impala.keytab impala/impala_host.example.com kadmin: xst -k http.keytab HTTP/impala_host.example.com kadmin: quit
Use ktutil to read the contents of the two keytab files and
then write those contents to a new file. For example:
$ ktutil ktutil: rkt impala.keytab ktutil: rkt http.keytab ktutil: wkt impala-http.keytab ktutil: quit
(Optional) Test that credentials in the merged keytab file are valid. For example:
$ klist -e -k -t impala-http.keytab
Copy the impala-http.keytab file to the Impala
configuration directory. Change the permissions to be only read for the file owner and change the file owner
to the impala user. By default, the Impala user and group are both named impala. For example:
$ cp impala-http.keytab /etc/impala/conf $ cd /etc/impala/conf $ chmod 400 impala-http.keytab $ chown impala:impala impala-http.keytab
Add Kerberos options to the Impala defaults file,
/etc/default/impala. Add the options for both the
statestored daemons, using the
IMPALA_STATE_STORE_ARGS variables. For example, you might
-kerberos_reinit_interval=60 -kerberos_ticket_life=36000 -maxrenewlife 7days -principal=impala_1/impala_host.example.com@TEST.EXAMPLE.COM -keytab_file=/var/run/cloudera-scm-agent/process/3212-impala-IMPALAD/impala.keytab
For more information on changing the Impala defaults specified in /etc/default/impala, see Modifying Impala Startup Options.
Using a Web Browser to Access a URL Protected by Kerberos HTTP SPNEGO
Your web browser must support Kerberos HTTP SPNEGO. For example, Chrome, Firefox, or Internet Explorer.
To configure Firefox to access a URL protected by Kerberos HTTP SPNEGO:
- Open the advanced settings Firefox configuration page by loading the about:config page.
- Use the Filter text box to find network.negotiate-auth.trusted-uris.
- Double-click the network.negotiate-auth.trusted-uris preference and enter the hostname or the domain of the web server that is protected by Kerberos HTTP SPNEGO. Separate multiple domains and hostnames with a comma.
- Click OK.
Auditing Impala Operations
To monitor how Impala data is being used within your organization, ensure that your Impala authorization and authentication policies are effective, and detect attempts at intrusion or unauthorized access to Impala data, you can use the auditing feature in Impala 1.1.1 and higher:
- Enable auditing by including the option -audit_event_log_dir=directory_path in your impalad startup options. The path refers to a local directory on the server, not an HDFS directory.
- Decide how many queries will be represented in each log files. By default, Impala starts a new log file every 5000 queries. To specify a different number, include the option -max_audit_event_log_file_size=number_of_queries in the impalad startup options. Limiting the size lets you manage disk space by archiving older logs, and reduce the amount of text to process when analyzing activity for a particular period.
- Configure the Cloudera Navigator product to collect and consolidate the audit logs from all the nodes in the cluster.
- Use the Cloudera Manager product to filter, visualize, and produce reports based on the audit data. (The Impala auditing feature works with Cloudera Manager 4.7 or higher.) Check the audit data to ensure that all activity is authorized and/or detect attempts at unauthorized access.
Durability and Performance Considerations for Impala Auditing
The auditing feature only imposes performance overhead while auditing is enabled.
Because any Impala node can process a query, enable auditing on all nodes where the impalad daemon runs. Each node stores its own log files, in a directory in the local filesystem. The log data is periodically flushed to disk (through an fsync() system call) to avoid loss of audit data in case of a crash.
The runtime overhead of auditing applies to whichever node serves as the coordinator for the query, that is, the node you connect to when you issue the query. This might be the same node for all queries, or different applications or users might connect to and issue queries through different nodes.
To avoid excessive I/O overhead on busy coordinator nodes, Impala syncs the audit log data (using the fsync() system call) periodically rather than after every query. Currently, the fsync() calls are issued at a fixed interval, every 5 seconds.
By default, Impala avoids losing any audit log data in the case of an error during a logging operation (such as a disk full error), by immediately shutting down the impalad daemon on the node where the auditing problem occurred. You can override this setting by specifying the option -abort_on_failed_audit_event=false in the impalad startup options.
Format of the Audit Log Files
The audit log files represent the query information in JSON format, one query per line. Typically, rather than looking at the log files themselves, you use the Cloudera Navigator product to consolidate the log data from all Impala nodes and filter and visualize the results in useful ways. (If you do examine the raw log data, you might run the files through a JSON pretty-printer first.)
All the information about schema objects accessed by the query is encoded in a single nested record on the same line. For example, the audit log for an INSERT ... SELECT statement records that a select operation occurs on the source table and an insert operation occurs on the destination table. The audit log for a query against a view records the base table accessed by the view, or multiple base tables in the case of a view that includes a join query. Every Impala operation that corresponds to a SQL statement is recorded in the audit logs, whether the operation succeeds or fails. Impala records more information for a successful operation than for a failed one, because an unauthorized query is stopped immediately, before all the query planning is completed.
Impala records more information for a successful operation than for a failed one, because an unauthorized query is stopped immediately, before all the query planning is completed.
The information logged for each query includes:
Client session state:
- Session ID
- User name
- Network address of the client connection
- SQL statement details:
- Query ID
- Statement Type - DML, DDL, and so on
- SQL statement text
- Execution start time, in local time
- Execution Status - Details on any errors that were encountered
- Target Catalog Objects:
- Object Type - Table, View, or Database
- Fully qualified object name
- Privilege - How the object is being used (SELECT, INSERT, CREATE, and so on)
Which Operations Are Audited
The kinds of SQL queries represented in the audit log are:
- Queries that are prevented due to lack of authorization.
- Queries that Impala can analyze and parse to determine that they are authorized. The audit data is recorded immediately after Impala finishes its analysis, before the query is actually executed.
The audit log does not contain entries for queries that could not be parsed and analyzed. For example, a query that fails due to a syntax error is not recorded in the audit log. The audit log also does not contain queries that fail due to a reference to a table that does not exist, if you would be authorized to access the table if it did exist.
Certain statements in the impala-shell interpreter, such as CONNECT, PROFILE, SET, and QUIT, do not correspond to actual SQL queries, and these statements are not reflected in the audit log.
Reviewing the Audit Logs
You typically do not review the audit logs in raw form. The Cloudera Manager agent periodically transfers the log information into a back-end database where it can be examined in consolidated form. See the Cloudera Navigator documentation for details.
|<< Previous: Administering Impala||Next: Starting Impala >>|