Joshua Fennessy

Introduction to Hive Complex Data Types – Part 1: Array

In addition to supporting the standard scalar data types, Hive also supports three complex data types:

  • Arrays are collections of related items that are all the same scalar data type
  • Structs are object-like collections wherein each item is made up of multiple pieces of data each with its own data type
  • Maps are key/value pair collections

This first of three related articles will introduce the Array data type that Hive offers.

Hive supports single dimension arrays which are specified during table creation. For the examples in this article, the following table definition will be used:

Defining Array data types and loading data

CREATE TABLE Products
(
id INT,
ProductName STRING,
ProductColorOptions ARRAY<STRING>
);

A simple INSERT statement can be used to get some sample data in the table. Note the ARRAY function is used to load complex data types into the table.INSERT INTO TABLE default.products

SELECT 1, 'Widgets', array('Red', 'Blue', 'Green')
UNION ALL
SELECT 2, 'Cogs', array('Blue', 'Green', 'Yellow');

In a production environment it’s common to use portable formats like JSON to store array values. If TEXTFILES are used for source data storage, the table can be defined using the COLLECTION ITEMS TERMINATED BY option to specify how the array values should be split.

Querying tables with Arrays

A simple data query can be written that will return all values of the array in a single row:

SELECT id, productname, productcoloroptions FROM default.products

Screen Shot 2015-07-08 at 11.23.14 PM

Specific positions can also be referenced when using an array data type to return a single value of the array:

SELECT id, productname, productcoloroptions[0] FROM default.products;

Screen Shot 2015-07-08 at 11.24.51 PM

Finally, the entire array can be flattened using the LATERAL VIEW EXPLODE statement.  This statement returns a single row for each array value similar to how CROSS JOIN works. The syntax for the query is below

SELECT
p.id
,p.productname
,colors.colorselection
FROM default.products P
LATERAL VIEW EXPLODE(p.productcoloroptions) colors as colorselection;

Screen Shot 2015-07-08 at 11.27.26 PM

Conclusion

Arrays are powerful data constructs that are often found in LOB applications. With Hive, the array does not need to be broken apart in ETL/ELT process. Native support for arrays is a powerful feature that separates HiveQL from standard T-SQL implementations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: