Joshua Fennessy

5 Questions with David Klee

I recently had an opportunity to sit down with David Klee (blog | twitter | linkedin), SQL Server MVP, VMWare V-Expert, and all around nice guy.  David

David Klee

and his family are currently undergoing a cross-country move — so I greatly appreciate the time he was able to spend in his response.

Without further hesitation, here’s David:

Josh: What new technology are you most excited about right now?

David: Oh wow where do I start? My biggest challenge is that I want to learn everything about everything. From a technology side, I’m most excited about the adoption of flash storage into the datacenter. Used right, it has hte power to completely change the performance characteristics of any SQL Server. It can be used for I/O caching underneath them. Workload characteristics can completely changed by utilizing it. Licensing costs can be reduced through using it wisely. Business problems can be solved with it. The impact to the infrastructure itself, from elevated CPU consumption, to storage interconnect utilization, and the challenges of adjusting the way SQL Server is *using* storage to make the system even more efficient and perform better, is amazing. I want to learn more about it!  (Yeah, I’m a serious geek, I know.)

Josh: We all have a million things to do. How do you organize your tasks?

David: I’m a huge proponent of lists. If I didn’t have my lists, I’d miss so many things that need to get done that I’d probably fall apart. I have three lists – to do today, to do this week, and to do soon. The lists are good old pen and paper. Every morning I re-write each list so that I can re-prioritize what’s most important, add things that have come up, and remove things that I got done yesterday. Re-writing it makes me think about it as I write it, and I can re-prioritize as I write. I then use the list of things to do today to help me manage my time for that day. Come tomorrow, rinse and repeat.

Josh: What new things are you learning?

David: I’m starting to work on learning some advanced statistics for high performance computing and data visualization techniques. Every IT platform and system, databases included, have a number of performance challenges. Measuring them is just the start of the process to help get the meaning of the data out of the raw numbers, because most folks just live with “it’s slow” and have trouble putting objective metrics to subjective observations. Even if they have performance monitoring systems out there, many of the utilities today provide a watered-down view of the data that removes much of the meaning of the data. Showing correlation of system resource consumption and the visual impact on other systems (and not just saying “it’s slow”), is an area that is coming even more critical in today’s datacenter infrastructures but an area that is rarely performed in a meaningful manner. I’m working to fix this!

Josh: Out of all of the training events and/or conferences you’ve attended which has left the most impact?

David: The PASS Summit conference from 2012 was life changing, more so than any other technical or training event that I’ve ever had. I was fortunate enough to co-present a session called “Managing SQL Server in a Virtual World”  with one of my mentors, the legendary Kevin Kline, to a packed room of over 500 people. Having never spoken at any conference before before, and being a novice in the SQL Saturday speaking scene, Kevin had sat in on my very first SQL Saturday session ever in October of the previous year. After that, I started to get to know him through the various events where I bumped into him, and that email about the session changed everything. I gained my “sea legs”, so to speak, and the confidence that he had in me to do a good job in the session was enough to give me the confidence to say yes and to actually do it. The session came and went, and I had an absolute blast and people seemed to get a lot out of it!

Since then, the confidence gained from that session with Kevin, and the thrill (strangely enough instead of fear) of speaking in front of that number of people made me want to keep it up. I’ll never assume that I’ll get picked for any other event, but I’ve at least got the confidence to submit and know that I can present the content so that others can learn from my experiences. I’ll always be indebted to Kevin for giving me that first chance at the big leagues!

Josh: Was does SQLFamily mean to you?

David: The SQLFamily means quite a bit to me, and has some very personal meanings and impact to me personally and professionally. The worldwide SQL Server community is arguably the most tight-knit and most warmly inviting group technology enthusiasts in the world. Quite a while back, I started attending the SQL Server Users Group in Omaha, after having tried some other user groups. The other user groups were not especially cold, but I just did not feel welcomed, and the mindset was that while they wanted to share certain aspects of the technology, they just were not really passionate about it. To me, technology is not just a job. It’s a lifestyle. It’s my professional livelihood. It’s my primary hobby that I pursue nights and weekends.

At the first SQL Server users group, I spent two additional hours after the user group wrapped up talking shop with the people who stuck around. I got an email the next day asking me if I was going to be at the next one. At the next meeting, the camaraderie was even stronger and I was asked if I could go to lunch with a group just to hang out. It just grew from there.

To be a bit cliche, the thought that hit me was “I’m home.” Every time I go to a SQL Saturday or PASS Summit, it feels like a family reunion (and I mean that in a good way). This is a group of people who are as passionate about technology as I’ve always been, and have the same mindset of being the best technologists they can be. I’m very proud to be a part of it, and encourage everyone that I meet to become a part of the family.

I’ll put it this way… if it were not for the SQLFamily, I would not have started my own company. Period. This community has supported my dream of starting my own business in ways not possible with other groups. I have some of the best business and personal mentors on the planet from this group. I’ve gotten enough work from word of mouth from this community so that I can pay the bills, and am now scouting for technical people at the various events that I go to so that I know who I want to hire when the time is right. It’s been an amazing journey over the last couple of years, and the SQLFamily made it possible. I want to give back to this community as much as I’ve gotten from it.

Introduction to Hive Complex Data Types – Part 1: Array

In addition to supporting the standard scalar data types, Hive also supports three complex data types:

  • Arrays are collections of related items that are all the same scalar data type
  • Structs are object-like collections wherein each item is made up of multiple pieces of data each with its own data type
  • Maps are key/value pair collections

This first of three related articles will introduce the Array data type that Hive offers.

Hive supports single dimension arrays which are specified during table creation. For the examples in this article, the following table definition will be used:

Defining Array data types and loading data

CREATE TABLE Products
(
id INT,
ProductName STRING,
ProductColorOptions ARRAY<STRING>
);

A simple INSERT statement can be used to get some sample data in the table. Note the ARRAY function is used to load complex data types into the table.INSERT INTO TABLE default.products

SELECT 1, 'Widgets', array('Red', 'Blue', 'Green')
UNION ALL
SELECT 2, 'Cogs', array('Blue', 'Green', 'Yellow');

In a production environment it’s common to use portable formats like JSON to store array values. If TEXTFILES are used for source data storage, the table can be defined using the COLLECTION ITEMS TERMINATED BY option to specify how the array values should be split.

Querying tables with Arrays

A simple data query can be written that will return all values of the array in a single row:

SELECT id, productname, productcoloroptions FROM default.products

Screen Shot 2015-07-08 at 11.23.14 PM

Specific positions can also be referenced when using an array data type to return a single value of the array:

SELECT id, productname, productcoloroptions[0] FROM default.products;

Screen Shot 2015-07-08 at 11.24.51 PM

Finally, the entire array can be flattened using the LATERAL VIEW EXPLODE statement.  This statement returns a single row for each array value similar to how CROSS JOIN works. The syntax for the query is below

SELECT
p.id
,p.productname
,colors.colorselection
FROM default.products P
LATERAL VIEW EXPLODE(p.productcoloroptions) colors as colorselection;

Screen Shot 2015-07-08 at 11.27.26 PM

Conclusion

Arrays are powerful data constructs that are often found in LOB applications. With Hive, the array does not need to be broken apart in ETL/ELT process. Native support for arrays is a powerful feature that separates HiveQL from standard T-SQL implementations.

5 Questions with Jes Borland

One of the great things about the SQL Server community is the open communication we maintain on Twitter. Jes Borland, a friend across Lake Michigan, mentioned that she was looking to kick-start her blog again and wanted to maybe interview someone. I had been wanting to start an interview ‘pen-pal’ club as well, so we connected and made the plan.

Jes Borland (blog | twitter | linkedin) is a Microsoft SQL Server MVP, founder of the Tech-on-Tap Training Series, blogger at LessThanDot.com, and SQL Server Engineer at Concurrecy. I recently asked Jes a few questions about work, life, and career. Here are her responses.

Jes Borland
Josh: You know that moment when your brain freezes up and you just can’t continue?  How do you get past it?

Jes: I take a walk (or, if it’s really bad, go for a run). If I continue to sit and my desk and stew on it, I’ll end up going over to Twitter and wasting half an hour, or getting lost on some news site. That doesn’t help clear anything up, and then I feel as if I’ve wasted time. But, if I can get away from the computer, move, and think through what I’ve been doing, it helps – a lot!


Josh: We all have a million things to do. How do you organize your tasks?

Jes: I love lists. Love, love, love. But instead of making un-ending lists, I have two. Every day I make a “Three Things” list for the three things I need to do that day. Three items is manageable. Three things is progress. Three things are not overwhelming. Then I have my “For Later” list, which lists other things I”ll work on as time allows.

Josh: If you could change one thing about your office, what would it be?  What if you could change two things?
Jes: If I could change one thing…I’m on the lookout for a new desk. I want a lot of things out of it, though. I’d like it to be an adjustable sit-down/stand-up/walking desk. I’d also like it to have room for two laptop/dock setups (one for work, one for personal). Then, I need drawers. I hate having things sitting out on the desk where I can see them, unless I”m actively working on or with them, so drawers are necessary.

If I could change a second thing about my at-home office, I’d want it to be a separate room above the garage, with its own bathroom and a small kitchenette. After over three years of working from home, I’ve learned to ignore the distractions of a pile of dirty dishes, the book I didn’t finish reading last night, or starting dinner – most of the time. But not all! A separate area to work would be fantastic.


Josh: What new things are you learning?

Jes: In an effort to earn my MCSA in SQL Server, I’m hard at work on Administering Windows Server 2012. I have a background in Windows setup and administration, but there’s been a lot of new stuff that has emerged or matured in the last five years – particularly when it comes to doing tasks with PowerShell. So I’m spending time focusing on server roles, Active Directory, and PowerShell – it’s a ton of fun!


Josh: We all love SQL Saturday. Do you have a memorable moment from a particular event that stands out?

Jes: Gosh. There are so many. So many. I’m going to call out SQL Saturday Chicago 2011, which was the third I attended and the first I spoke at. My session was, “Make Your Voice Heard!”, a session on using LinkedIn, Twitter, and forums to increase your SQL Server community presence and network. I was in the room next to Brent Ozar, and his filled up. so I got some overflow. There were a few MVPs and well-known speakers that drifted over. But there were also a few people new to the community that had never blogged or spoken. Over the last four years, I’ve watched people from that session grow into active bloggers and speakers, and I keep up with them on a regular basis and I couldn’t be happier. That session meant so much to me.


Jes, thank you for taking some time out of your schedule to share with me, and the rest of the community. I do hope you find your ‘perfect desk’ soon, and GOOD LUCK ON THE MSCA!

p.s.  Jes also interviewed me.  You can read those responses here.

Configure Apache Hive to Recursively Search Directories for Files

It is common, such as when using Flume to collect log data for example, that files end up inside subdirectories in HDFS.

By default, Hive will only look for files in the root of directory specified, but with a couple of tweaks, it can be configured to look recursively through subdirectories.

Consider the following file structure in HDFS:


root hdfs     133205 2015-06-30 02:14 /test/000000_0
root hdfs          0 2015-06-30 02:14 /test/test_child_directory
root hdfs     133205 2015-06-30 02:14 /test/test_child_directory/000000_0

Running this query:

SELECT COUNT(*) FROM test_table;

Provides this result:No Recursion 2541 rows

By default, an external Hive table created using the directory /test as the table location, will contain 2541 rows. The total number of rows found in a single instance of the file 000000_0. This proves that Hive does not traverse into subdirectories to read files.

The Solution

With the addition of a couple of property assignment statements the default behavior of Hive can be modified to allow recursion.

Consider the following query:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

SELECT COUNT(*) FROM test_table;
Recursion Enabled - 5082 rows
Note that the only change to the previous query is the addition of the two property statements that instruct Hive to traverse subdirectories.

The number rows returned has doubled, as expected. In this query, Hive successfully traversed to the sub-directory and read the second instance of the file.

Adding these properties to each query that should use recursion is a good solution to this project, but it’s not the only solution.

If you find yourself wanting to use recursion by default in your Hive installation, you can add these two properties to your hive-site.xml file. Using Ambari, you would want to enter these properties as new keys in the Custom hive-site.xml section.

Recursion properties in Ambari

Hive Patterns — 4 Common Ways to Load Data

Screen Shot 2015-06-25 at 8.40.31 PMIt’s often difficult to remember that Hive isn’t a relational database platform. For those of us coming from RBDMS-land, Hive almost feels TOO familiar.

When we create a table in Hive, the data doesn’t live *in* the table, it remains stored in HDFS either as text files, or ORCfiles, or another format specified when the table was created. Hive tables are really nothing more than metadata instructing the processing engine how to parse the files, and what order to expect the columns in.

So, the data doesn’t *live* in the tables, but there has to be a way to get it there? Indeed! There are 4 common ways.

Table creation over resting data

Sometimes, you might get lucky enough that your big data is already well formatted — maybe it’s a set of comma or tab separated files from a data warehouse extract.

If that’s the case, then ‘loading’ your Hive table can be as simple as specific the file location at the time the table is created. This method is often used with the EXTERNAL keyword, to ensure that the data remain in place if the table were to be dropped.

Imagine if you will, a well-formatted comma separated file stored in HDFS at the path /user/hue/data/sales

To create the table, run the following DDL command

CREATE EXTERNAL TABLE sales
(
ProductKey INT
,DateKey INT
,SalesAmount DOUBLE
,TaxAmt DOUBLE
,DiscountAmount DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION '/user/hue/data/sales';

When creating this table, simply specify the location to the directory where the data is stored. When new data is copied into that directory, it will automatically be accessible by Hive.

INSERT INTO

In some cases during a ELT workflow, you’ll find yourself needed to insert data into a table that already exists. Consider the table [SalesByProduct] that contains the following columns ProductKey, DateKey, SalesAmount.

Part of our ELT workflow could be to run a query to aggregate the latest raw sales transaction table and insert them into this aggregate table. To do this, a simple HQL query can be run to insert data into the already existing table.

INSERT INTO TABLE SalesByProduct
SELECT
ProductKey
,DateKey
,SUM(SalesAmount) AS SalesAmount
FROM
sales
GROUP BY ProductKey, DateKey

Alternatively, the OVERWRITE keyword can be used (in replacement of the INTO keyword) to replace ALL of the data in the table with the new data specified in the SELECT statement.

This query will first run the job to collect all of the data, then insert it into the target table. Note that it is very important that the output of the SELECT statement match the structure of the target table or the job may fail. An alternative, but similar approach is to use a CTE (Common Table Expression) that I will explore in a future article.

LOAD DATA INPATH

Commonly, data that lands in HDFS does so in a staging directory. Sometimes, that data is already structured, but the table defined by the structure lives in a different directory than the staging. In this case, the data can be loaded into the table simply by running a command to MOVE it from one location to another. While this move could happen within HDFS, it is sometimes a good idea to let Hive do this, especially with Hive managed tables.

This is also a common scenario when the data to be loaded to a table is stored on the local disk (not HDFS). This statement supports moving data from LOCAL disk or HDFS.

The LOAD DATA command also support APPENDS and OVERWRITES. If the OVERWRITE option is used, the files in the target directory are first deleted, then the new files moved in. Without the OVERWRITE keyword, the files are just moved in — effectively appending to the existing table.

LOAD DATA LOCAL INPATH './data/sales/isales.csv' OVERWRITE INTO TABLE sales;

One final important note is that, in HDFS, the source data is MOVED to the new location, not copied. The source data will no longer exist in its previous location after running this operation. For LOCAL data, it will remain on the local disk after the move.

CREATE TABLE AS (CTAS)

The final common pattern is the CREATE TABLE AS statement. This is a very handy way to create a new table based on the output of a SELECT statement, without having to execute two operations. This is a common way to convert data stored in TEXTFILE format to a table stored in ORC format to help improve query performance.

To complete this method, a single query can be executed. Table options can be specified in the CREATE clause. The table will be created with the same column names and data types in the output of the SELECT statement.

CREATE TABLE SalesByDate STORED AS ORC
AS
SELECT
CAST(DateKey / 100 AS INT) AS MonthKey
,SUM(SalesAmount) AS SalesAmount
,SUM(DiscountAmount) AS DiscountAmount
FROM sales
GROUP BY CAST(DateKey / 100 AS INT)


Conclusion

In this post, 4 common methods of populating Hive tables with data were shown. Each method has a use case in a Big Data solution workflow.

The Anatomy of a Hive CREATE Table Statement

Creating tables in Hive is very much like creating tables in popular RDBMS like SQL Server. There are some key differences, however, that it is important to understand. In this short article, we’ll take a look at what makes up a Hive CREATE TABLE statement, and discuss some of the basic options.

First, let’s take a look at a generic CREATE TABLE statement in HIVE. Note that this sample create statement does not include all options for creating table, merely the most commonly used.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] (
{col1},
{col2},
{…}
)
[ROW FORMAT] row_format
[STORED AS] file_format
[LOCATION] hdfs_path

The optional [EXTERNAL] keyword will create a non-managed  table in the Hive metastore.  By default, tables created in Hive are called managed tables. These tables are stored in the Hive warehouse location in HDFS (unless otherwise specified) and have data that is managed by Hive.  When a managed table is dropped, the data is dropped along with it.  By contrast, when an external table is dropped, the data remains in HDFS.  This is a great feature that can be used to create structure around data that you don’t need Hive to manage.

The [ROW FORMAT] option tells Hive how the data in the file described by the table will be stored.  The default storage format is DELIMITED, allowing for common options such as comma or tab separated files.  DELIMITED row formats can also be used to store MAP data or COLLECTION ITEMS. Alternatively, the [ROW FORMAT] option can be use with custom SERDE (serialize/deserialize) formats like JSON.  Typically these formats need to be installed from a custom JAR file.

The [STORED AS] option will direct Hive as to how to store the files. By default all data will be stored in TEXTFILE format.  Alternative formats include SEQUENCEFILE for storing compressed text data  and ORCFILE (Optimized Row Columnar) for column optimized data.

[LOCATION]  is an optional clause that points Hive to where the data will be, or is, stored. If left empty, the data will be stored in the default location; usually /user/hive/warehouse/[db.name]/[table.name].  This option is commonly used when defining EXTERNAL tables, since the data is usually already resting in HDFS. For CTAS tables, it is common to not specify a location and just let Hive manage and direct the data where to live.

Apache Hive: The Keystone of many Big Data Solutions

Apache Hive LogoIt’s hard to avoid terms like unstructured data in the growing world of Hadoop. In many ways, all of the hype around unstructured data is very true. The Hadoop ecosystem is VERY good at ingesting data in any form. An analyst can store absolutely anything that can be translated into 1’s and 0’s in the HDFS file system. Doing something interesting with that information, however, is another matter.

Not defining structure makes the process of loading data so much easier! Analyzing data without structure is not an easy task, however. There are many tools that work with the Hadoop ecosystem that allow an analyst to apply structure to that data and reveal insights from pure information. None of those tools is as prolific as Apache Hive. Hive is the one single tool that nearly EVERY Hadoop project uses in some form. In this article, we’ll investigate why:

Hive uses a common query language

At it’s core, Apache Hive is a SQL on Hadoop project. It is based on the ANSI SQL-92 standard — although not fully compliant — and includes several extensions to the standard that allow analysts to use a common language that is used in many different data platforms. An analyst well versed in SQL programming will find him or herself right at home interacting with data stored in HDFS using Apache Hive.

Hive also supports ACID transactions. So, while Hive is happily presenting data in any format to you, it’s also making sure that the data you’re processing is valid and safe from other analysts who might be interacting with the same files.

Many common applications can already connect to Apache Hive

Because HiveQL is based on SQL, there are many tools that can already speak to it. Tools like Excel, PowerPivot, Tableau, and Microstrategy already know how to speak SQL and therefore can talk in some fashion to Hive. Many of the major big data vendors have ODBC drivers available that allow these applications to connect to a Hadoop cluster through Apache Hive. This gives analysts accustomed to working on their favorite visualization tool a great advantage. Many times, there is nothing new to learn to import data into a powerful analytic tool like the Power BI suite, and begin delivering self-service BI applications directly onto of Hadoop and HDFS.

Reduced user training is just another great benefit that Hive provides to solutions built using Hadoop.

Hive adds a much needed security layer to Hadoop

One of the major obstacles facing many enterprise level Hadoop projects is the lack of easily administrable user level security options within Hadoop applications and HDFS. The latest version of Apache Hive support granular object level security. Administrators can grant access on any specific table or view from none to read only to read-write. Using the security layer in Apache Hive can reduce the complexity of securing HDFS directly.

Hive supports more than just rows and columns

Hive is more than just a SQL engine sitting on top of Hadoop: it is an application that is designed to consume data in many different forms using an industry standard language. Hive is great at consuming data that is stored in comma separated files. It’s also great as parsing JSON files (a common XML based format) and allowing analysts to easily query data in these files without needing to learn how to parse XML. Is your data stored in cyclic or acyclic graphs? Hive can consume that too: it knows how to follow the links and provide the answers that analysts seek.

Is your data in a non-standard format that a developer created 20 years ago? Hive can consume it…with some extra work on your part. Hive also support custom SERDE (SERializer/DEserializer) formats that can be custom built to work with the needs of your data format.

Hive really can consume nearly any data that you need to analyze.

Paired with HCatalog, Hive provides a cross-application data model sharing platform

HCatalog is a metadata management tool that exposes Hive definitions to other applications like such as Pig. This tool allows Pig developers to use and interact with Hive tables without the need to request data definitions or recreate data models in PigLatin.

This is a very important feature for enterprise deployments that involve a variety of the applications that are available in the Hadoop ecosystem.

Hive provides options for higher performing operations that avoid MapReduce

Although Hive is originally based on MapReduce, it is not exclusive to that processing framework. Hortonworks, in partnership with Microsoft, released Tez in 2014 which Hive is complaint with. In addition to Tez, Microsoft engineers contributed to the performance improvement goals by adding vectorization features.

Upcoming version of Hive will support Apache Spark — a new(ish) MPP processing framework that is boasting greatly improved performance to legacy MapReduce.

Hive is fun!

Well…it is! Writing SQL (and HiveQL) is both challenging and rewarding. Hive is especially so in that when a particularly complex query written over data that may or may not be ‘rectangular’ — that is, arranged in rows and columns — just works, you’ll get a great sense of accomplishment.

Let’s face it; none of us want to write MapReduce anymore. Hive is one of the many tools in our Big Data toolbox that helps to make it fun!

Follow

Get every new post delivered to your Inbox.

Join 1,058 other followers