Joshua Fennessy

The Anatomy of a Hive CREATE Table Statement

Creating tables in Hive is very much like creating tables in popular RDBMS like SQL Server. There are some key differences, however, that it is important to understand. In this short article, we’ll take a look at what makes up a Hive CREATE TABLE statement, and discuss some of the basic options.

First, let’s take a look at a generic CREATE TABLE statement in HIVE. Note that this sample create statement does not include all options for creating table, merely the most commonly used.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] (
{col1},
{col2},
{…}
)
[ROW FORMAT] row_format
[STORED AS] file_format
[LOCATION] hdfs_path

The optional [EXTERNAL] keyword will create a non-managed  table in the Hive metastore.  By default, tables created in Hive are called managed tables. These tables are stored in the Hive warehouse location in HDFS (unless otherwise specified) and have data that is managed by Hive.  When a managed table is dropped, the data is dropped along with it.  By contrast, when an external table is dropped, the data remains in HDFS.  This is a great feature that can be used to create structure around data that you don’t need Hive to manage.

The [ROW FORMAT] option tells Hive how the data in the file described by the table will be stored.  The default storage format is DELIMITED, allowing for common options such as comma or tab separated files.  DELIMITED row formats can also be used to store MAP data or COLLECTION ITEMS. Alternatively, the [ROW FORMAT] option can be use with custom SERDE (serialize/deserialize) formats like JSON.  Typically these formats need to be installed from a custom JAR file.

The [STORED AS] option will direct Hive as to how to store the files. By default all data will be stored in TEXTFILE format.  Alternative formats include SEQUENCEFILE for storing compressed text data  and ORCFILE (Optimized Row Columnar) for column optimized data.

[LOCATION]  is an optional clause that points Hive to where the data will be, or is, stored. If left empty, the data will be stored in the default location; usually /user/hive/warehouse/[db.name]/[table.name].  This option is commonly used when defining EXTERNAL tables, since the data is usually already resting in HDFS. For CTAS tables, it is common to not specify a location and just let Hive manage and direct the data where to live.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: