Joshua Fennessy

Apache Hive: The Keystone of many Big Data Solutions

Apache Hive LogoIt’s hard to avoid terms like unstructured data in the growing world of Hadoop. In many ways, all of the hype around unstructured data is very true. The Hadoop ecosystem is VERY good at ingesting data in any form. An analyst can store absolutely anything that can be translated into 1’s and 0’s in the HDFS file system. Doing something interesting with that information, however, is another matter.

Not defining structure makes the process of loading data so much easier! Analyzing data without structure is not an easy task, however. There are many tools that work with the Hadoop ecosystem that allow an analyst to apply structure to that data and reveal insights from pure information. None of those tools is as prolific as Apache Hive. Hive is the one single tool that nearly EVERY Hadoop project uses in some form. In this article, we’ll investigate why:

Hive uses a common query language

At it’s core, Apache Hive is a SQL on Hadoop project. It is based on the ANSI SQL-92 standard — although not fully compliant — and includes several extensions to the standard that allow analysts to use a common language that is used in many different data platforms. An analyst well versed in SQL programming will find him or herself right at home interacting with data stored in HDFS using Apache Hive.

Hive also supports ACID transactions. So, while Hive is happily presenting data in any format to you, it’s also making sure that the data you’re processing is valid and safe from other analysts who might be interacting with the same files.

Many common applications can already connect to Apache Hive

Because HiveQL is based on SQL, there are many tools that can already speak to it. Tools like Excel, PowerPivot, Tableau, and Microstrategy already know how to speak SQL and therefore can talk in some fashion to Hive. Many of the major big data vendors have ODBC drivers available that allow these applications to connect to a Hadoop cluster through Apache Hive. This gives analysts accustomed to working on their favorite visualization tool a great advantage. Many times, there is nothing new to learn to import data into a powerful analytic tool like the Power BI suite, and begin delivering self-service BI applications directly onto of Hadoop and HDFS.

Reduced user training is just another great benefit that Hive provides to solutions built using Hadoop.

Hive adds a much needed security layer to Hadoop

One of the major obstacles facing many enterprise level Hadoop projects is the lack of easily administrable user level security options within Hadoop applications and HDFS. The latest version of Apache Hive support granular object level security. Administrators can grant access on any specific table or view from none to read only to read-write. Using the security layer in Apache Hive can reduce the complexity of securing HDFS directly.

Hive supports more than just rows and columns

Hive is more than just a SQL engine sitting on top of Hadoop: it is an application that is designed to consume data in many different forms using an industry standard language. Hive is great at consuming data that is stored in comma separated files. It’s also great as parsing JSON files (a common XML based format) and allowing analysts to easily query data in these files without needing to learn how to parse XML. Is your data stored in cyclic or acyclic graphs? Hive can consume that too: it knows how to follow the links and provide the answers that analysts seek.

Is your data in a non-standard format that a developer created 20 years ago? Hive can consume it…with some extra work on your part. Hive also support custom SERDE (SERializer/DEserializer) formats that can be custom built to work with the needs of your data format.

Hive really can consume nearly any data that you need to analyze.

Paired with HCatalog, Hive provides a cross-application data model sharing platform

HCatalog is a metadata management tool that exposes Hive definitions to other applications like such as Pig. This tool allows Pig developers to use and interact with Hive tables without the need to request data definitions or recreate data models in PigLatin.

This is a very important feature for enterprise deployments that involve a variety of the applications that are available in the Hadoop ecosystem.

Hive provides options for higher performing operations that avoid MapReduce

Although Hive is originally based on MapReduce, it is not exclusive to that processing framework. Hortonworks, in partnership with Microsoft, released Tez in 2014 which Hive is complaint with. In addition to Tez, Microsoft engineers contributed to the performance improvement goals by adding vectorization features.

Upcoming version of Hive will support Apache Spark — a new(ish) MPP processing framework that is boasting greatly improved performance to legacy MapReduce.

Hive is fun!

Well…it is! Writing SQL (and HiveQL) is both challenging and rewarding. Hive is especially so in that when a particularly complex query written over data that may or may not be ‘rectangular’ — that is, arranged in rows and columns — just works, you’ll get a great sense of accomplishment.

Let’s face it; none of us want to write MapReduce anymore. Hive is one of the many tools in our Big Data toolbox that helps to make it fun!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: