Joshua Fennessy

5 Questions with Jes Borland

One of the great things about the SQL Server community is the open communication we maintain on Twitter. Jes Borland, a friend across Lake Michigan, mentioned that she was looking to kick-start her blog again and wanted to maybe interview someone. I had been wanting to start an interview ‘pen-pal’ club as well, so we connected and made the plan.

Jes Borland (blog | twitter | linkedin) is a Microsoft SQL Server MVP, founder of the Tech-on-Tap Training Series, blogger at LessThanDot.com, and SQL Server Engineer at Concurrecy. I recently asked Jes a few questions about work, life, and career. Here are her responses.

Jes Borland
Josh: You know that moment when your brain freezes up and you just can’t continue?  How do you get past it?

Jes: I take a walk (or, if it’s really bad, go for a run). If I continue to sit and my desk and stew on it, I’ll end up going over to Twitter and wasting half an hour, or getting lost on some news site. That doesn’t help clear anything up, and then I feel as if I’ve wasted time. But, if I can get away from the computer, move, and think through what I’ve been doing, it helps – a lot!


Josh: We all have a million things to do. How do you organize your tasks?

Jes: I love lists. Love, love, love. But instead of making un-ending lists, I have two. Every day I make a “Three Things” list for the three things I need to do that day. Three items is manageable. Three things is progress. Three things are not overwhelming. Then I have my “For Later” list, which lists other things I”ll work on as time allows.

Josh: If you could change one thing about your office, what would it be?  What if you could change two things?
Jes: If I could change one thing…I’m on the lookout for a new desk. I want a lot of things out of it, though. I’d like it to be an adjustable sit-down/stand-up/walking desk. I’d also like it to have room for two laptop/dock setups (one for work, one for personal). Then, I need drawers. I hate having things sitting out on the desk where I can see them, unless I”m actively working on or with them, so drawers are necessary.

If I could change a second thing about my at-home office, I’d want it to be a separate room above the garage, with its own bathroom and a small kitchenette. After over three years of working from home, I’ve learned to ignore the distractions of a pile of dirty dishes, the book I didn’t finish reading last night, or starting dinner – most of the time. But not all! A separate area to work would be fantastic.


Josh: What new things are you learning?

Jes: In an effort to earn my MCSA in SQL Server, I’m hard at work on Administering Windows Server 2012. I have a background in Windows setup and administration, but there’s been a lot of new stuff that has emerged or matured in the last five years – particularly when it comes to doing tasks with PowerShell. So I’m spending time focusing on server roles, Active Directory, and PowerShell – it’s a ton of fun!


Josh: We all love SQL Saturday. Do you have a memorable moment from a particular event that stands out?

Jes: Gosh. There are so many. So many. I’m going to call out SQL Saturday Chicago 2011, which was the third I attended and the first I spoke at. My session was, “Make Your Voice Heard!”, a session on using LinkedIn, Twitter, and forums to increase your SQL Server community presence and network. I was in the room next to Brent Ozar, and his filled up. so I got some overflow. There were a few MVPs and well-known speakers that drifted over. But there were also a few people new to the community that had never blogged or spoken. Over the last four years, I’ve watched people from that session grow into active bloggers and speakers, and I keep up with them on a regular basis and I couldn’t be happier. That session meant so much to me.


Jes, thank you for taking some time out of your schedule to share with me, and the rest of the community. I do hope you find your ‘perfect desk’ soon, and GOOD LUCK ON THE MSCA!

p.s.  Jes also interviewed me.  You can read those responses here.

Configure Apache Hive to Recursively Search Directories for Files

It is common, such as when using Flume to collect log data for example, that files end up inside subdirectories in HDFS.

By default, Hive will only look for files in the root of directory specified, but with a couple of tweaks, it can be configured to look recursively through subdirectories.

Consider the following file structure in HDFS:


root hdfs     133205 2015-06-30 02:14 /test/000000_0
root hdfs          0 2015-06-30 02:14 /test/test_child_directory
root hdfs     133205 2015-06-30 02:14 /test/test_child_directory/000000_0

Running this query:

SELECT COUNT(*) FROM test_table;

Provides this result:No Recursion 2541 rows

By default, an external Hive table created using the directory /test as the table location, will contain 2541 rows. The total number of rows found in a single instance of the file 000000_0. This proves that Hive does not traverse into subdirectories to read files.

The Solution

With the addition of a couple of property assignment statements the default behavior of Hive can be modified to allow recursion.

Consider the following query:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

SELECT COUNT(*) FROM test_table;
Recursion Enabled - 5082 rows
Note that the only change to the previous query is the addition of the two property statements that instruct Hive to traverse subdirectories.

The number rows returned has doubled, as expected. In this query, Hive successfully traversed to the sub-directory and read the second instance of the file.

Adding these properties to each query that should use recursion is a good solution to this project, but it’s not the only solution.

If you find yourself wanting to use recursion by default in your Hive installation, you can add these two properties to your hive-site.xml file. Using Ambari, you would want to enter these properties as new keys in the Custom hive-site.xml section.

Recursion properties in Ambari

Hive Patterns — 4 Common Ways to Load Data

Screen Shot 2015-06-25 at 8.40.31 PMIt’s often difficult to remember that Hive isn’t a relational database platform. For those of us coming from RBDMS-land, Hive almost feels TOO familiar.

When we create a table in Hive, the data doesn’t live *in* the table, it remains stored in HDFS either as text files, or ORCfiles, or another format specified when the table was created. Hive tables are really nothing more than metadata instructing the processing engine how to parse the files, and what order to expect the columns in.

So, the data doesn’t *live* in the tables, but there has to be a way to get it there? Indeed! There are 4 common ways.

Table creation over resting data

Sometimes, you might get lucky enough that your big data is already well formatted — maybe it’s a set of comma or tab separated files from a data warehouse extract.

If that’s the case, then ‘loading’ your Hive table can be as simple as specific the file location at the time the table is created. This method is often used with the EXTERNAL keyword, to ensure that the data remain in place if the table were to be dropped.

Imagine if you will, a well-formatted comma separated file stored in HDFS at the path /user/hue/data/sales

To create the table, run the following DDL command

CREATE EXTERNAL TABLE sales
(
ProductKey INT
,DateKey INT
,SalesAmount DOUBLE
,TaxAmt DOUBLE
,DiscountAmount DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION '/user/hue/data/sales';

When creating this table, simply specify the location to the directory where the data is stored. When new data is copied into that directory, it will automatically be accessible by Hive.

INSERT INTO

In some cases during a ELT workflow, you’ll find yourself needed to insert data into a table that already exists. Consider the table [SalesByProduct] that contains the following columns ProductKey, DateKey, SalesAmount.

Part of our ELT workflow could be to run a query to aggregate the latest raw sales transaction table and insert them into this aggregate table. To do this, a simple HQL query can be run to insert data into the already existing table.

INSERT INTO TABLE SalesByProduct
SELECT
ProductKey
,DateKey
,SUM(SalesAmount) AS SalesAmount
FROM
sales
GROUP BY ProductKey, DateKey

Alternatively, the OVERWRITE keyword can be used (in replacement of the INTO keyword) to replace ALL of the data in the table with the new data specified in the SELECT statement.

This query will first run the job to collect all of the data, then insert it into the target table. Note that it is very important that the output of the SELECT statement match the structure of the target table or the job may fail. An alternative, but similar approach is to use a CTE (Common Table Expression) that I will explore in a future article.

LOAD DATA INPATH

Commonly, data that lands in HDFS does so in a staging directory. Sometimes, that data is already structured, but the table defined by the structure lives in a different directory than the staging. In this case, the data can be loaded into the table simply by running a command to MOVE it from one location to another. While this move could happen within HDFS, it is sometimes a good idea to let Hive do this, especially with Hive managed tables.

This is also a common scenario when the data to be loaded to a table is stored on the local disk (not HDFS). This statement supports moving data from LOCAL disk or HDFS.

The LOAD DATA command also support APPENDS and OVERWRITES. If the OVERWRITE option is used, the files in the target directory are first deleted, then the new files moved in. Without the OVERWRITE keyword, the files are just moved in — effectively appending to the existing table.

LOAD DATA LOCAL INPATH './data/sales/isales.csv' OVERWRITE INTO TABLE sales;

One final important note is that, in HDFS, the source data is MOVED to the new location, not copied. The source data will no longer exist in its previous location after running this operation. For LOCAL data, it will remain on the local disk after the move.

CREATE TABLE AS (CTAS)

The final common pattern is the CREATE TABLE AS statement. This is a very handy way to create a new table based on the output of a SELECT statement, without having to execute two operations. This is a common way to convert data stored in TEXTFILE format to a table stored in ORC format to help improve query performance.

To complete this method, a single query can be executed. Table options can be specified in the CREATE clause. The table will be created with the same column names and data types in the output of the SELECT statement.

CREATE TABLE SalesByDate STORED AS ORC
AS
SELECT
CAST(DateKey / 100 AS INT) AS MonthKey
,SUM(SalesAmount) AS SalesAmount
,SUM(DiscountAmount) AS DiscountAmount
FROM sales
GROUP BY CAST(DateKey / 100 AS INT)


Conclusion

In this post, 4 common methods of populating Hive tables with data were shown. Each method has a use case in a Big Data solution workflow.

The Anatomy of a Hive CREATE Table Statement

Creating tables in Hive is very much like creating tables in popular RDBMS like SQL Server. There are some key differences, however, that it is important to understand. In this short article, we’ll take a look at what makes up a Hive CREATE TABLE statement, and discuss some of the basic options.

First, let’s take a look at a generic CREATE TABLE statement in HIVE. Note that this sample create statement does not include all options for creating table, merely the most commonly used.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] (
{col1},
{col2},
{…}
)
[ROW FORMAT] row_format
[STORED AS] file_format
[LOCATION] hdfs_path

The optional [EXTERNAL] keyword will create a non-managed  table in the Hive metastore.  By default, tables created in Hive are called managed tables. These tables are stored in the Hive warehouse location in HDFS (unless otherwise specified) and have data that is managed by Hive.  When a managed table is dropped, the data is dropped along with it.  By contrast, when an external table is dropped, the data remains in HDFS.  This is a great feature that can be used to create structure around data that you don’t need Hive to manage.

The [ROW FORMAT] option tells Hive how the data in the file described by the table will be stored.  The default storage format is DELIMITED, allowing for common options such as comma or tab separated files.  DELIMITED row formats can also be used to store MAP data or COLLECTION ITEMS. Alternatively, the [ROW FORMAT] option can be use with custom SERDE (serialize/deserialize) formats like JSON.  Typically these formats need to be installed from a custom JAR file.

The [STORED AS] option will direct Hive as to how to store the files. By default all data will be stored in TEXTFILE format.  Alternative formats include SEQUENCEFILE for storing compressed text data  and ORCFILE (Optimized Row Columnar) for column optimized data.

[LOCATION]  is an optional clause that points Hive to where the data will be, or is, stored. If left empty, the data will be stored in the default location; usually /user/hive/warehouse/[db.name]/[table.name].  This option is commonly used when defining EXTERNAL tables, since the data is usually already resting in HDFS. For CTAS tables, it is common to not specify a location and just let Hive manage and direct the data where to live.

Apache Hive: The Keystone of many Big Data Solutions

Apache Hive LogoIt’s hard to avoid terms like unstructured data in the growing world of Hadoop. In many ways, all of the hype around unstructured data is very true. The Hadoop ecosystem is VERY good at ingesting data in any form. An analyst can store absolutely anything that can be translated into 1’s and 0’s in the HDFS file system. Doing something interesting with that information, however, is another matter.

Not defining structure makes the process of loading data so much easier! Analyzing data without structure is not an easy task, however. There are many tools that work with the Hadoop ecosystem that allow an analyst to apply structure to that data and reveal insights from pure information. None of those tools is as prolific as Apache Hive. Hive is the one single tool that nearly EVERY Hadoop project uses in some form. In this article, we’ll investigate why:

Hive uses a common query language

At it’s core, Apache Hive is a SQL on Hadoop project. It is based on the ANSI SQL-92 standard — although not fully compliant — and includes several extensions to the standard that allow analysts to use a common language that is used in many different data platforms. An analyst well versed in SQL programming will find him or herself right at home interacting with data stored in HDFS using Apache Hive.

Hive also supports ACID transactions. So, while Hive is happily presenting data in any format to you, it’s also making sure that the data you’re processing is valid and safe from other analysts who might be interacting with the same files.

Many common applications can already connect to Apache Hive

Because HiveQL is based on SQL, there are many tools that can already speak to it. Tools like Excel, PowerPivot, Tableau, and Microstrategy already know how to speak SQL and therefore can talk in some fashion to Hive. Many of the major big data vendors have ODBC drivers available that allow these applications to connect to a Hadoop cluster through Apache Hive. This gives analysts accustomed to working on their favorite visualization tool a great advantage. Many times, there is nothing new to learn to import data into a powerful analytic tool like the Power BI suite, and begin delivering self-service BI applications directly onto of Hadoop and HDFS.

Reduced user training is just another great benefit that Hive provides to solutions built using Hadoop.

Hive adds a much needed security layer to Hadoop

One of the major obstacles facing many enterprise level Hadoop projects is the lack of easily administrable user level security options within Hadoop applications and HDFS. The latest version of Apache Hive support granular object level security. Administrators can grant access on any specific table or view from none to read only to read-write. Using the security layer in Apache Hive can reduce the complexity of securing HDFS directly.

Hive supports more than just rows and columns

Hive is more than just a SQL engine sitting on top of Hadoop: it is an application that is designed to consume data in many different forms using an industry standard language. Hive is great at consuming data that is stored in comma separated files. It’s also great as parsing JSON files (a common XML based format) and allowing analysts to easily query data in these files without needing to learn how to parse XML. Is your data stored in cyclic or acyclic graphs? Hive can consume that too: it knows how to follow the links and provide the answers that analysts seek.

Is your data in a non-standard format that a developer created 20 years ago? Hive can consume it…with some extra work on your part. Hive also support custom SERDE (SERializer/DEserializer) formats that can be custom built to work with the needs of your data format.

Hive really can consume nearly any data that you need to analyze.

Paired with HCatalog, Hive provides a cross-application data model sharing platform

HCatalog is a metadata management tool that exposes Hive definitions to other applications like such as Pig. This tool allows Pig developers to use and interact with Hive tables without the need to request data definitions or recreate data models in PigLatin.

This is a very important feature for enterprise deployments that involve a variety of the applications that are available in the Hadoop ecosystem.

Hive provides options for higher performing operations that avoid MapReduce

Although Hive is originally based on MapReduce, it is not exclusive to that processing framework. Hortonworks, in partnership with Microsoft, released Tez in 2014 which Hive is complaint with. In addition to Tez, Microsoft engineers contributed to the performance improvement goals by adding vectorization features.

Upcoming version of Hive will support Apache Spark — a new(ish) MPP processing framework that is boasting greatly improved performance to legacy MapReduce.

Hive is fun!

Well…it is! Writing SQL (and HiveQL) is both challenging and rewarding. Hive is especially so in that when a particularly complex query written over data that may or may not be ‘rectangular’ — that is, arranged in rows and columns — just works, you’ll get a great sense of accomplishment.

Let’s face it; none of us want to write MapReduce anymore. Hive is one of the many tools in our Big Data toolbox that helps to make it fun!

Creating and Configuring Certificates for Azure PowerShell Management

If you’ve worked at all with Microsoft (previously Windows) Azure, you’ll know that the Management Portal can only get you so far. Eventually, you’re going to run into the desire to script tasks, or the need to use PowerShell because the functionality doesn’t yet exist in the Management Portal.

Once you find yourself open the Azure PowerShell command prompt, you’ll first have to configure your session to connect to the Azure Subscription you want to manage. Once you have to do that more than once, you’ll be looking for ways to automate your subscription selection. Fortunately, there is an easy way to do that.

In this post, I’ll show you how to create a certificate, associate it with your subscription, and configure your machine using PowerShell to use that subscription by default. The steps I show here are designed to work in a development environment. Please be sure to review them to ensure that the security settings are compliant with your environment if you wish to do this process in production.

This post will cover, in brief, the following topics:

  • Creating a personal certificate
  • Exporting and uploading the certificate to Azure
  • Configuring Azure PowerShell environment on local machine

Creating a Personal Certificate

The easiest way to create a personal certificate is to use the makecert.exe utility which is installed with Visual Studio. Follow the steps below to create a certificate that is compatible with Windows Azure.

1.) Open a Visual Studio Command Prompt in Administrator Modeimage

2.) Enter the following command. Note that Azure-compatible certificates must have a key length of at least 2048 bits and should be stored in the Personal certificate store. Replace the <CertificateName> placeholder with the name of the desired certificate. For more information on creating and managing certificates compatible with Windows Azure, read this TechNet article (http://msdn.microsoft.com/en-us/library/azure/gg551722.aspx)

makecert -sky exchange -r -n “CN=<CertificateName>” -pe -a sha1 -len 2048 -ss My “<CertificateName>.cer”

3.) Upon receiving a Successful response from the command open the “Manage User Certificates” console.

image

4.) Browse to the Personal store and verify the certificate just created is in existence.image

Exporting and Uploading the Certificate to Azure

Once you have the certificate created, you’ll need to export a copy of it in .CER format to upload to Azure. By doing this, you’re allowing Azure to verify that a connection coming from your machine is valid and should be trusted.

The certificate that you export shouldn’t contain your private key, so be careful. If you export a copy of your certificate with a private key and someone were to get a hold of that certificate, they could pretend to be YOU in Azure’s eyes, opening up possibility for a security breach.

1.) From the Certificate Store console, right click on the certificate created in the above procedure and select All Tasks –> Export

image

2.) At the first step of the Certificate Export wizard read the information presented and click Next. The next step will allow you the choice to export the private key. Be sure to select No, do not export the private key.

image

3.) On the Export File Format screen, choose one of the options that results in an exported certificate in the .CER format. I chose the first option for my exported certificate.

image

4.) Provide a filename for the exported certification and click next. Review the options and click finish to complete the export of your certificate.

image

5.) To upload the certificate to your Azure subscription, first you’ll need to login to the Azure Management Portal. Once successfully logged in, open the Subscriptions menu and select Manage Subscriptions/Directory.

image

6.) On the Subscriptions Settings page, navigate to the Management Certificates section and click the Upload button on the bottom menu.

image

 

7.) Browse to the exported certificate created in the steps above and click the checkmark to upload the certificate.

image

Configuring Azure PowerShell Environment on Local Machine

With the personal certificate created and uploaded to your Windows Azure subscription the final step is to add this subscription to your PowerShell environment and configure options for it’s default use.

To complete the following steps you’ll need to have the Windows Azure PowerShell cmdlets installed. You can choose to use either the Windows Azure PowerShell command line or the PowerShell ISE for this next task.

1.) Before you being entering PowerShell code, you’ll need to collect two pieces of information. First the Certificate Thumbprint. This is found by Opening the certificate created earlier, browsing to the Details tab and looking for the Thumbprint property. Copy all of the characters from the thumbprint and remove the spaces. Set this aside for a minute.

image

2.) The second piece of information you need to collect is the Windows Azure Subscription ID. This is found on the Azure Management Portal in Subscription Settings – the same page you used above to upload the certificate. Copy the SubscriptionID and set it aside for a minute.

image

3.) Back to your PowerShell window. First you need to create a variable to hold your certificate and execute an Azure cmdlet to add the Subscription to your local configuration. The following two lines of PowerShell code take care of this.

Note that the path to the certificate assumes that you’ve created the certificate the same way as above, so it’s stored in your Personal certificate store. Modify the code below if you have it stored in a different location.

$cert = Get-Item “Cert:\CurrentUser\My\<CertificateThumbprint>”

Set-AzureSubscription -SubscriptionName <EnterSubscriptionName>
-SubscriptionId <SubscriptionID> -cert $cert

4.) To verify the subscription was added correctly, run the following command. This command will return a list of all subscriptions configured on your machine.

Get-AzureSubscription

5.) With the subscription now added, you need to select it and optionally set it as default. If set as default, it will be active upon opening a new Azure PowerShell window – a great option if you want to be able to open a command prompt and begin administering Azure right away. To do this, use the following command.

Note, you can add the –default option to make this the default subscription as described above

Select-AzureSubscription –SubscriptionName <EnterSubscriptionName>

6.) Now that the new Subscription is created, selected, it’s time to test. You can run a command like the following to verify that your new certificate enabled subscription is working properly.

Get-AzureVM

or

Get-AzureAccount

Conclusion

With the certificates now installed and your PowerShell environment configured it will be easier than ever to administer your Azure environment. You can even import multiple personal certificates to your Azure environment, so if you have multiple machines you can configure them all for easy access to your cloud environment.

PASS Business Analytics Conference – Keynote Day 1 Recap

Pass Keynote panorama

The PASS Business Analytics conference has officially begun this morning. Yesterday wrapped with a successful round of pre-conference seminars and the Welcome Reception – the chorizo was pretty amazing, #thankslarock.20140508_145852565_iOS

I was honored to be asked to sit at a special reserved table for bloggers and tweeters to broadcast keynote announcements live to the cloud!  I was joined by many fellow bloggers and friends include fellow BlueGranite coworker Melissa Coates (b | @sqlchick)

Today starts off with a keynote presentation featuring PASS president Thomas LaRock as well as Microsoft superstarts Amir Netz and Kahmal Hahti.

Thomas is the first to take the stage and kick off the second PASS Business Analytics conference. Thomas welcomes us and urges us to take hold of our passions for data and use that to give back to the community. PASS is one of the avenues that we can use to give back to. Join a local user group, join a virtual user group, start a user group if one doesn’t exist near you. Get involved. PASS has grown to over 100,000 members and operates largely on volunteer’s hard work and effort.

John Whitaker from Dell, one of the Platinum Partner sponsors of the PASS Business Analtyic ocnference this year joined the stage to help give us some insights into Big Data and how it’s penetrating the mid-market tier.  Big takeaways from this session are

1.) Big Data Projects are not just for enterprise

2.) One of the largest primary factors of successful implementation is IT to Business alignment

3.) Most successful implementations currently have been Real Time Analytics and Predictive Analyticsly

Overall John’s message is great and it’s an important one to hear. It’s also one we’ve heard before. Business has several challenges with data analytics: data complexity, data volume, and budget. While these challenges aren’t new, Big Data tools (like Hadoop) can help address these complexities in new ways that weren’t possible with traditional data analytics tools (like SSAS).

Collaboration between business and IT is also a key to success, which again, is nothing new, but as both tools and data become more complex that collaboration becomes ever more important.

Amir Netz is a Microsoft Technical Fellow and one of those great enigmatic speakers that really expresses his enthusiasm for Excel and analytics through his presentation.

Kamal Hathi is a director of engineering for Microsoft BI. Kamal is in charge of of the project teams that have created great new tools like Power BI.

Amir and Kamal really took charge of the room and drove up the excitement by announcing some AMAZING new features coming to Power BI. Seriously, there was a LOT of information during this keynote. Here is a brief recap:

The product updates came fast and furious during the demo

  •   Updates to natural language queries to make it even easier for users to get access to data
  • The Field list is now shown by default in Power Q&A
  • SSRS will be available in Power BI this summer
  • Data will be able to be hosted on premise, but reports surfaced in the cloud. Direct connection possible, no data refresh necessary
  • Power BI iOS application will be available this summer
  • Interactivity in browser of hosted Power View reports. Users will be able to interact and edit hosted reports without downloading to local machine first
  • New chart types available INCLUDING TREE MAPS!!!
  • Natural language dashboard creation built into Power BI and Power Query
  • Forecasting algorithms built into Power View allow for easy forecasting of current data
  • New chart modification features that include the ability to combine charts, filter specific values out of one chart, move chart features from one to another.

20140508_155838863_iOS 20140508_160419111_iOS 20140508_160810155_iOS

    As you can see, there were a massive amount of updates announced today. I’m sure I missed one or two, so be sure to watch the Power BI site for updates, as well as the #passbac hashtag for live conference updates.

The keynote at today’s BA Conference opening included some great new information about PASS, the Business Analytics community, and great new features avaialble within the Microsoft BI ecosystem.

I’ll be back tomorrow live blogging/tweeting the keynote presented by David McCandless – data visualization expert. Watch this blog and @joshuafennessy on twitter for updates.

Follow

Get every new post delivered to your Inbox.

Join 1,058 other followers