Impala 1.1 adds a fine-grained authorization framework for Hadoop, by integrating the Sentry open source project. Together with the existing Kerberos authentication framework, Impala 1.1 takes Hadoop security to a new level needed for the requirements of highly regulated industries such as healthcare, financial services, and government. Impala 1.1.1 fills in the security feature set even more by adding an auditing capability; Impala generates the audit data, the Cloudera Navigator product consolidates the audit data from all nodes in the cluster, and Cloudera Manager lets you filter, visualize, and produce reports.
The security features of Cloudera Impala have several objectives. At the most basic level, security prevents accidents or mistakes that could disrupt application processing, delete or corrupt data, or reveal data to unauthorized users. More advanced security features and practices can harden the system against malicious users trying to gain unauthorized access or perform other disallowed operations. The auditing feature provides a way to confirm that no unauthorized access occurred, and detect whether any such attempts were made. This is a critical set of features for production deployments in large organizations that handle important or sensitive data. It sets the stage for multi-tenancy, where multiple applications run concurrently and are prevented from interfering with each other.
The material in this section presumes that you are already familiar with administering secure Linux systems. That is, you should know the general security practices for Linux and Hadoop, and their associated commands and configuration files. For example, you should know how to create Linux users and groups, manage Linux group membership, set Linux and HDFS file permissions and ownership, and designate the default permissions and ownership for new files. You should be familiar with the configuration of the nodes in your Hadoop cluster, and know how to apply configuration changes or run a set of commands across all the nodes.
The security features are divided into these broad categories:
- Which users are allowed to access which resources, and what operations are they allowed to perform? Impala relies on the open source Sentry project for authorization. By default (when authorization is not enabled), Impala does all read and write operations with the privileges of the impala user, which is suitable for a development/test environment but not for a secure production environment. When authorization is enabled, Impala uses the OS user ID of the user who runs impala-shell or other client program, and associates various privileges with each user. See Enabling Sentry Authorization for Impala for details about setting up and managing authorization.
- How does Impala verify the identity of the user to confirm that they really are allowed to exercise the privileges assigned to that user? Impala relies on the Kerberos subsystem for authentication. See Enabling Kerberos Authentication for Impala for details about setting up and managing authentication.
- What operations were attempted, and did they succeed or not? This feature provides a way to look back and diagnose whether attempts were made to perform unauthorized operations. You use this information to track down suspicious activity, and to see where changes are needed in authorization policies. The audit data produced by this feature is collected by the Cloudera Manager product and then presented in a user-friendly form by the Cloudera Manager product. See Auditing Impala Operations for details about setting up and managing auditing.
This introductory section covers the general security guidelines for Impala:
- Security Guidelines for Impala
- Securing Impala Data and Log Files
- Installation Considerations for Impala Security
- Securing the Hive Metastore Database
- Securing the Impala Web User Interface
These separate topics cover how Impala integrates with security frameworks such as Kerberos, Sentry, and LDAP:
Security Guidelines for Impala
The following are the major steps to harden a cluster running Impala against accidents and mistakes, or malicious attackers trying to access sensitive data:
- Secure the root account. The root user can tamper with the impalad daemon, read and write the data files in HDFS, log into other user accounts, and access other system services that are beyond the control of Impala.
- Restrict membership in the sudoers list (in the /etc/sudoers file). The users who can run the sudo command can do many of the same things as the root user.
- Ensure the Hadoop ownership and permissions for Impala data files are restricted.
- Ensure the Hadoop ownership and permissions for Impala log files are restricted.
- Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected. See Securing the Impala Web User Interface for details.
- Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups (which by default map to Linux OS groups). Create the associated Linux groups using the groupadd command if necessary.
- The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for background information, see the CDH HDFS Permissions Guide. Set up users and assign them to groups at the OS level, corresponding to the different categories of users with different access levels for various databases, tables, and HDFS locations (URIs). Create the associated Linux users using the useradd command if necessary, and add them to the appropriate groups with the usermod command.
- Design your databases, tables, and views with database and table structure to allow policy rules to specify simple, consistent rules. For example, if all tables related to an application are inside a single database, you can assign privileges for that database and use the * wildcard for the table name. If you are creating views with different privileges than the underlying base tables, you might put the views in a separate database so that you can use the * wildcard for the database containing the base tables, while specifying the precise names of the individual views. (For specifying table or database names, you either specify the exact name or * to mean all the databases on a server, or all the tables and views in a database.)
- Enable authorization by running the impalad daemons with the -server_name and -authorization_policy_file options on all nodes. (The authorization feature does not apply to the statestored daemon, which has no access to schema objects or data files.)
- Set up authentication using Kerberos, to make sure users really are who they say they are.
Securing Impala Data and Log Files
One aspect of security is to protect files from unauthorized access at the filesystem level. For example, if you store sensitive data in HDFS, you specify permissions on the associated files and directories in HDFS to restrict read and write permissions to the appropriate users and groups.
If you issue queries containing sensitive values in the WHERE clause, such as financial account numbers, those values are stored in Impala log files in the Linux filesystem and you must secure those files also. For the locations of Impala log files, see Using Impala Logging.
All Impala read and write operations are performed under the filesystem privileges of the impala user. The impala user must be able to read all directories and data files that you query, and write into all the directories and data files for INSERT and LOAD DATA statements. At a minimum, make sure the impala user is in the hive group so that it can access files and directories shared between Impala and Hive. See User Account Requirements for more details.
Setting file permissions is necessary for Impala to function correctly, but is not an effective security practice by itself:
- The way to ensure that only authorized users can submit requests for databases and tables they are allowed to access is to set up Sentry authorization, as explained in Enabling Sentry Authorization for Impala. With authorization enabled, the checking of the user ID and group is done by Impala, and unauthorized access is blocked by Impala itself. The actual low-level read and write requests are still done by the impala user, so you must have appropriate file and directory permissions for that user ID.
- You must also set up Kerberos authentication, as described in Enabling Kerberos Authentication for Impala, so that users can only connect from trusted hosts. With Kerberos enabled, if someone connects a new host to the network and creates user IDs that match your privileged IDs, they will be blocked from connecting to Impala at all from that host.
Installation Considerations for Impala Security
Impala 1.1 comes set up with all the software and settings needed to enable security when you run the impalad daemon with the new security-related options (-server_name and -authorization_policy_file). You do not need to change any environment variables or install any additional JAR files. In a cluster managed by Cloudera Manager, you do not need to change any settings in Cloudera Manager.
Securing the Hive Metastore Database
It is important to secure the Hive metastore, so that users cannot access the names or other information about databases and tables the through the Hive client or by querying the metastore database. Do this by turning on Hive metastore security, using the instructions in the CDH 4 Security Guide or the CDH 5 Security Guide for securing different Hive components:
- Secure the Hive Metastore.
- In addition, allow access to the metastore only from the HiveServer2 server, and then disable local access to the HiveServer2 server.
Securing the Impala Web User Interface
The instructions in this section presume you are familiar with the .htpasswd mechanism commonly used to password-protect pages on web servers.
Password-protect the Impala web UI that listens on port 25000 by default. Set up a .htpasswd file in the $IMPALA_HOME directory, or start both the impalad and statestored daemons with the --webserver_password_file option to specify a different location (including the filename).
This file should only be readable by the Impala process and machine administrators, because it contains (hashed) versions of passwords. The username / password pairs are not derived from Unix usernames, Kerberos users, or any other system. The domain field in the password file must match the domain supplied to Impala by the new command-line option --webserver_authentication_domain. The default is mydomain.com.
Impala also supports using HTTPS for secure web traffic. To do so, set --webserver_certificate_file to refer to a valid .pem SSL certificate file. Impala will automatically start using HTTPS once the SSL certificate has been read and validated. A .pem file is basically a private key, followed by a signed SSL certificate; make sure to concatenate both parts when constructing the .pem file.
If Impala cannot find or parse the .pem file, it prints an error message and quits.
If the private key is encrypted using a passphrase, Impala will ask for that passphrase on startup, which is not useful for a large cluster. In that case, remove the passphrase and make the .pem file readable only by Impala and administrators.
When you turn on SSL for the Impala web UI, the associated URLs change from http:// prefixes to https://. Adjust any bookmarks or application code that refers to those URLs.