Joshua Fennessy

Making the case (or not) for Hadoop and Spark

For the past several years, we’ve been inundated with messages about how Hadoop and similar technologies, most recently, Spark will make our data processing jobs so much faster. There have been many controlled tests, reverse-sorting results, and  benchmarking tests that show that open-source-project-A is magnitudes oSpark-logo-192x100pxf orders faster than open-source-project-b, etc.

But, what about real world usage?  That’s exactly what I’m looking to find out. I’ve been told, and I HAVE told customers of mine that processing large data sets in a distributed environment is faster, but I’m not often given the opportunity to spend billable hours building two solutions to provide actual proof that those cases where Hadoop really is faster exist.

And, let’s be honest, it’s often a hard story to stand behind, especially when a customer loads up a bunch of data in Hive and runs a query, then waits for minutes (or even hours) for that data to be returned; the same query of course, runs in seconds on SQL Server…boy does that make me look like a fool (thanks a lot, Hive!)

What’s the plan?

The plan, executed over the next few weeks, is to build a two ETL solutions, one on SMP SQL Server using SSIS, and one on Hadoop/Spark that meets the same requirements.

Why SQL Server?

Traditionally, I’m a BI guy. I deal with Data Warehouses (Marts) most of the time.  I rarely work on OLTP systems. Most of my customers are working on Microsoft based systems using SQL Server and SSAS.  So, because I want to be able to use this as a story for BlueGranite’s customers, I’m choosing to use a common system

Why Hadoop and Spark?

I’ve been focusing on Hadoop for nearly 3 years now — as a company, BlueGranite, as been culturing a budding Big Data practice for two years.  We’ve seen massive growth of this practice in 2016 already — and it’s showing no sign of slowing down.  Hadoop is maturing to a level that our customers are ready to trust it. Spark is also maturing and is very quickly taking over the batch processing activities in Hadoop.  I’m making it my goal to write zero PigLatin in 2016 (and beyond)

Did I just say “requirements” to myself?

Yup. I did.  Requirements are important for any project.  For this effort, I’m going to be using a dataset available from NOAA. Specially, then NOAA daily dataset. Why this one?  Well, frankly, it’s a pretty easy dataset to work with, and for this test, I’m not concerned with the complexity of the data processing.  It’s also probably big enough — it’s about 24 GB compressed, about 2.4 billion rows. I think that should be big enough to show a difference between SMP and MPP processing.

Ok, identifying the dataset is great, but those aren’t requirements.  Right, well, here are my requirements:

  • Source data is stored in delimited files
  • The data needs to be stored as a single row per station per day
    • Measurements should be stored as columns for easy consumption with visualization tools
  • The resulting dataset need to include geographical information
    • It’s weather, so a map visualization kind of makes sense
  • Time series analysis must be possible with the resulting dataset
    • I’ll want to be able to do Year over Year comparisons, etc
  • The data must end up in a SQL Server table

What about environment, where am I going to be building this stuff?

I bet you think I’m going to say the Cloud, right? Well, I normally probably would, but in this case, I’m not.  A couple of reasons why:

  • I want to test this on bigger hardware than my meager Azure credit can afford
  • I happen to have this hardware in my physical home office, and am excited to have a use case for it 🙂

I don’t have a huge server farm in my room, but I have a few decent boxes that I’m going to be using for this

SQL Server

My SQL Server is going to be running on a 16 core/96GB RAM TrendMicro server that I have running. It has about 6TB of disks configured into a couple of RAID 10 arrays.  It won’t be completely production compliant, but I’ll be able to at least have data and logs on different spindles, and have enough space to allocated database appropriately.  This machine will also run the SSIS ETL

Hadoop/Spark

My Hadoop cluster is made up of two other servers, first is a smaller Dell R410 (8 core/64GB) that runs a virtualized namenode, and a virtualized MySQL database server for Hadoop metadata.  My data nodes are virtualized and running on one of the ill-fated Dell QuickStart Data Warehouse machines.  I normally would try to have separate physical data nodes, but this server allows for some interesting Hadoop virtualization.  This QSDW box is a 16 core/96GB server with 4NICS and 27 500GB 10000RPM SCIS drives — why is this good for Hadoop? Well, it means I can run 3 data nodes, with each having a dedicated NIC, and 6 dedicated drives for HDFS. I can get true parallel HDFS reads/writes with virtualized servers. While not as good as true physical servers, it ends up to be a pretty good environment to run Hadoop on.

What sort of timeline is this going to happen in?

Great question — I hope to be able to start when I get back home in early April.  I’m travelling around quite a bit right now, so I imagine it will take me a few weeks to complete all of the development and testing.

What will I find out?

I don’t know! That the best and/or scariest part.  I might find out that SSIS/SQL Server is way better for processing GB of files. I may find out that we should run Hadoop on our phones because it’s so great!  I really have very few expectations. I do feel like the Hadoop/Spark solution is going to be faster than the SQL solution for this case — but not enough to bet on it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: