This is the documentation for Cloudera Impala 1.2.4.
Documentation for other versions is available at Cloudera Documentation.

Using the Avro File Format with Impala Tables

Cloudera Impala supports using tables whose data files use the Avro file format. Impala can query Avro tables, but currently cannot create them or insert data into them. For those operations, use Hive, then switch back to Impala to run queries.

Table 1. Avro Format Support in Impala
File Type Format Compression Codecs Impala Can CREATE? Impala Can INSERT?
Avro Structured Snappy, GZIP, deflate, BZIP2 No, create using Hive. No. Load data through LOAD DATA on data files already in the right format, or use INSERT in Hive.

Continue reading:

Creating Avro Tables

To create a new table using the Avro file format, use the CREATE TABLE statement with the STORED AS AVRO clause in the Impala shell. For information about loading data into Avro tables through Hive, see Avro page on the Hive wiki.

Once the table is created in Hive, switch back to impala-shell and issue an INVALIDATE METADATA statement. Then you can run queries for that table through impala-shell.

The following example demonstrates creating an Avro table in Hive:
hive> CREATE TABLE new_table
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    > TBLPROPERTIES ('avro.schema.literal'='{
    >    "name": "my_record",
    >    "type": "record",
    >    "fields": [
    >       {"name":"bool_col", "type":"boolean"},
    >       {"name":"int_col", "type":"int"},
    >       {"name":"long_col", "type":"long"},
    >       {"name":"float_col", "type":"float"},
    >       {"name":"double_col", "type":"double"},
    >       {"name":"string_col", "type":"string"},
    >       {"name": "nullable_int", "type": ["null", "int"]]}');
OK
Time taken: 6.372 seconds

Each field of the record becomes a column of the table. Note that any other information, such as the record name, is ignored.

  Note: For nullable columns, make sure to put the "null" entry before the actual type name.

Using a Hive-Created Avro Table in Impala

Once you have an Avro table created through Hive, you can use it in Impala as long as it contains only Impala-compatible data types. (For example, it cannot contain nested types such as array, map, or struct.) Because Impala and Hive share the same metastore database, Impala can directly access the table definitions and data for tables that were created in Hive.

After you create an Avro table in Hive, issue an INVALIDATE METADATA the next time you connect to Impala through impala-shell. This is a one-time operation to make Impala aware of the new table. If you issue Impala queries through more than one node, issue the INVALIDATE METADATA statement on each node, the first time you connect after creating the new table in Hive.

After you load new data into an Avro table, either through a Hive LOAD DATA or INSERT statement, or by manually copying or moving files into the data directory for the table, issue a REFRESH table_name statement the next time you connect to Impala through impala-shell. If you issue Impala queries through more than one node, issue the REFRESH statement on each node, the first time you connect after loading new data without using Impala.

Impala only supports fields of type boolean, int, long, float, double, and string, or unions of these types with null; for example, ["string", "null"]. Unions with null essentially create a nullable type.

Specifying the Avro Schema through JSON

While you can embed a schema directly in your CREATE TABLE statement, as shown above, column width restrictions in the Hive metastore limit the length of schema you can specify. If you encounter problems with long schema literals, try storing your schema as a JSON file in HDFS instead. Specify your schema in HDFS using table properties similar to the following:

tblproperties ('avro.schema.url'='hdfs//your-name-node:port/path/to/schema.json');

Enabling Compression for Avro Tables

To enable compression for Avro tables, specify settings in the Hive shell to enable compression and to specify a codec, then issue a CREATE TABLE statement as in the preceding examples. Impala supports the snappy and deflate codecs for Avro tables.

For example:

hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=snappy;

How Impala Handles Avro Schema Evolution

Starting in Impala 1.1, Impala can deal with Avro data files that employ schema evolution, where different data files within the same table use slightly different type definitions. (You would perform the schema evolution operation by issuing an ALTER TABLE statement in the Hive shell.) The old and new types for any changed columns must be compatible, for example a column might start as an int and later change to a bigint or float.

As with any other tables where the definitions are changed or data is added outside of the current impalad node, ensure that Impala loads the latest metadata for the table if the Avro schema is modified through Hive. Issue a REFRESH table_name or INVALIDATE METADATA table_name statement. REFRESH reloads the metadata immediately, INVALIDATE METADATA reloads the metadata the next time the table is accessed.

When Avro data files or columns are not consulted during a query, Impala does not check for consistency. Thus, if you issue SELECT c1, c2 FROM t1, Impala does not return any error if the column c3 changed in an incompatible way. If a query retrieves data from some partitions but not others, Impala does not check the data files for the unused partitions.

In the Hive DDL statements, you can specify an avro.schema.literal table property (if the schema definition is short) or an avro.schema.url property (if the schema definition is long, or to allow convenient editing for the definition).

For example, running the following SQL code in the Hive shell creates a table using the Avro file format and puts some sample data into it:

CREATE TABLE avro_table (a string, b string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
  'avro.schema.literal'='{
    "type": "record",
    "name": "my_record",
    "fields": [
      {"name": "a", "type": "int"},
      {"name": "b", "type": "string"}
    ]}');

INSERT OVERWRITE TABLE avro_table SELECT 1, "avro" FROM functional.alltypes LIMIT 1;

Once the Avro table is created and contains data, you can query it through the impala-shell command:

-- [localhost:21000] > select * from avro_table;
-- Query: select * from avro_table
-- Query finished, fetching results ...
-- +---+------+
-- | a | b    |
-- +---+------+
-- | 1 | avro |
-- +---+------+

Now in the Hive shell, you change the type of a column and add a new column with a default value:

-- Promote column "a" from INT to FLOAT (no need to update Avro schema)
ALTER TABLE avro_table CHANGE A A FLOAT;

-- Add column "c" with default
ALTER TABLE avro_table ADD COLUMNS (c int);
ALTER TABLE avro_table SET TBLPROPERTIES (
  'avro.schema.literal'='{
    "type": "record",
    "name": "my_record",
    "fields": [
      {"name": "a", "type": "int"},
      {"name": "b", "type": "string"},
      {"name": "c", "type": "int", "default": 10}
    ]}');

Once again in impala-shell, you can query the Avro table based on its latest schema definition. Because the table metadata was changed outside of Impala, you issue a REFRESH statement first so that Impala has up-to-date metadata for the table.

-- [localhost:21000] > refresh avro_table;
-- Query: refresh avro_table
-- Query finished, fetching results ...

-- Returned 0 row(s) in 0.23s
-- [localhost:21000] > select * from avro_table;
-- Query: select * from avro_table
-- Query finished, fetching results ...
-- +---+------+----+
-- | a | b    | c  |
-- +---+------+----+
-- | 1 | avro | 10 |
-- +---+------+----+
-- Returned 1 row(s) in 0.14s