Joshua Fennessy

Hive Patterns — 4 Common Ways to Load Data

Screen Shot 2015-06-25 at 8.40.31 PMIt’s often difficult to remember that Hive isn’t a relational database platform. For those of us coming from RBDMS-land, Hive almost feels TOO familiar.

When we create a table in Hive, the data doesn’t live *in* the table, it remains stored in HDFS either as text files, or ORCfiles, or another format specified when the table was created. Hive tables are really nothing more than metadata instructing the processing engine how to parse the files, and what order to expect the columns in.

So, the data doesn’t *live* in the tables, but there has to be a way to get it there? Indeed! There are 4 common ways.

Table creation over resting data

Sometimes, you might get lucky enough that your big data is already well formatted — maybe it’s a set of comma or tab separated files from a data warehouse extract.

If that’s the case, then ‘loading’ your Hive table can be as simple as specific the file location at the time the table is created. This method is often used with the EXTERNAL keyword, to ensure that the data remain in place if the table were to be dropped.

Imagine if you will, a well-formatted comma separated file stored in HDFS at the path /user/hue/data/sales

To create the table, run the following DDL command

CREATE EXTERNAL TABLE sales
(
ProductKey INT
,DateKey INT
,SalesAmount DOUBLE
,TaxAmt DOUBLE
,DiscountAmount DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION '/user/hue/data/sales';

When creating this table, simply specify the location to the directory where the data is stored. When new data is copied into that directory, it will automatically be accessible by Hive.

INSERT INTO

In some cases during a ELT workflow, you’ll find yourself needed to insert data into a table that already exists. Consider the table [SalesByProduct] that contains the following columns ProductKey, DateKey, SalesAmount.

Part of our ELT workflow could be to run a query to aggregate the latest raw sales transaction table and insert them into this aggregate table. To do this, a simple HQL query can be run to insert data into the already existing table.

INSERT INTO TABLE SalesByProduct
SELECT
ProductKey
,DateKey
,SUM(SalesAmount) AS SalesAmount
FROM
sales
GROUP BY ProductKey, DateKey

Alternatively, the OVERWRITE keyword can be used (in replacement of the INTO keyword) to replace ALL of the data in the table with the new data specified in the SELECT statement.

This query will first run the job to collect all of the data, then insert it into the target table. Note that it is very important that the output of the SELECT statement match the structure of the target table or the job may fail. An alternative, but similar approach is to use a CTE (Common Table Expression) that I will explore in a future article.

LOAD DATA INPATH

Commonly, data that lands in HDFS does so in a staging directory. Sometimes, that data is already structured, but the table defined by the structure lives in a different directory than the staging. In this case, the data can be loaded into the table simply by running a command to MOVE it from one location to another. While this move could happen within HDFS, it is sometimes a good idea to let Hive do this, especially with Hive managed tables.

This is also a common scenario when the data to be loaded to a table is stored on the local disk (not HDFS). This statement supports moving data from LOCAL disk or HDFS.

The LOAD DATA command also support APPENDS and OVERWRITES. If the OVERWRITE option is used, the files in the target directory are first deleted, then the new files moved in. Without the OVERWRITE keyword, the files are just moved in — effectively appending to the existing table.

LOAD DATA LOCAL INPATH './data/sales/isales.csv' OVERWRITE INTO TABLE sales;

One final important note is that, in HDFS, the source data is MOVED to the new location, not copied. The source data will no longer exist in its previous location after running this operation. For LOCAL data, it will remain on the local disk after the move.

CREATE TABLE AS (CTAS)

The final common pattern is the CREATE TABLE AS statement. This is a very handy way to create a new table based on the output of a SELECT statement, without having to execute two operations. This is a common way to convert data stored in TEXTFILE format to a table stored in ORC format to help improve query performance.

To complete this method, a single query can be executed. Table options can be specified in the CREATE clause. The table will be created with the same column names and data types in the output of the SELECT statement.

CREATE TABLE SalesByDate STORED AS ORC
AS
SELECT
CAST(DateKey / 100 AS INT) AS MonthKey
,SUM(SalesAmount) AS SalesAmount
,SUM(DiscountAmount) AS DiscountAmount
FROM sales
GROUP BY CAST(DateKey / 100 AS INT)


Conclusion

In this post, 4 common methods of populating Hive tables with data were shown. Each method has a use case in a Big Data solution workflow.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: