Joshua Fennessy

A Few Tips for Getting Started with Apache Spark

As, primarily, I’ve been a Microsoft SQL BI architect for the last few years, I struggle to call myself a developer.  Close to 20 years ago, I wrote Java, 15 years, .NET — but since then, the amount of code I’ve *really* slung have been pretty minimal.  Sure, I’ve written lots of SQL, and some scripting here and there — but it’s all been comparitively light.

For the past 3 years, I’ve been focused on Hadoop, and that’s required me to dip my toes into the development arena here and there — but still, I’ve primarily stayed in the scripting language camps — until now.

I’ve seen the light, and the light is Apache Spark — I’ve said it before, and I’ll probably say it many more times, Apache Spark is going to revolutionize how we build Big Data solutions, and how we approach Modern Data Warehouse projects in the future.

I’m really excited about Apache Spark, and I hope you are too.  But, if you’re like me, transitioning from a primarly SQL focused mindset to a programming framework like Spark isn’t going to be easy.  If you really want to be working with Spark, you’ll need to pick up some Scala.  Spark also supports Python and Java, but Scala is the de facto language of Spark — you can probably get away without learning it for quite a while if you know python — but I bet you’ll come across something that is going to require some Scala.

Let me be clear about something up front. I don’t know Scala.  At least, I don’t know enough of it to call myself proficient. Sure, I’ve written a couple of Spark applications — but in my mind, they are really simple, and I wouldn’t feel comfortable getting at Scala tattoo quite yet.

Tip #1 – Don’t be scared

With the latest release of Apache Spark, 1.6.1, you’ll have access to a mature DataFrames API — what does this mean? Well, for the most part, it means that the Apache Spark engine has matured enough to include a built in optimizer, so you don’t have to be a Scala guru to write well-performing code. Yes, all of you Scala masters, I realize that this may introduce some sloppy code to the ecosystem, but I think opening up the APIs to be accessible to a larger group of developers is a good thing. In my opinion, the more people we can get using Apache Spark, the better the world will be.

Secondly, using the Spark DataFrame API means that the code that you will write will be somewhat readable by a TSQL expert. It’s still Scala code, but since DataFrames don’t use lambda functions, it’s much more readable to someone that is used to writing SQL code.

Speaking of writing SQL code — if you wanted to, you could totally write your Spark DataFrame application in SQL if you wanted to — the optimizer engine will handle making sure that runs just as well as anything else.

Tip #2 – Ditch the IDE

When I first started looking at Spark, my first action was to download Eclipse. My second action was to stare at the Eclipse splash screen and think “well, what the hell do I do now?”  It was a mistake.

If you’re not already a Java or Scala developer, you don’t’ need an IDE to work with Spark when you are just learning; remember, learning Spark really means that you’re learning Scala too. Actually, I’d postulate that you will have a MUCH better experience learning if you ignore the need for an IDE and just focus on your data.  Getting to the IDE will come in time, and by the time you find you’ll need it, you’ll be comfortable enough with Spark and Scala that setting it up won’t be a big deal.

There are a couple of great options to interact with spark without building a custom JAR and spark-submit. The first one comes with your Spark installation: spark-shell.

Spark-shell is a command line tool that drops you into a Spark command prompt. The prompt will accept Scala line by line, and will evaluate each line as you enter it. This is great when you’re just learning the syntax.  Additionally, the spark-shell takes care of a bunch of housekeeping stuff — all of which you’d have to do on your own in an IDE.   The downside with using spark-shell is that the code you enter isn’t saved anywhere (except the shell history) so you may find yourself reentering lines of code between sessions.

Another great option is to use a notebook.  If you’re using a Hadoop image, like Hortonworks HDP Sandbox, then you’ll probably have access to Jupyter.  Jupyter is a web based tool for building living documents.  Basically you get a web page, with an open cell. In this cell you can enter in all sorts of code, Scala, Markdown, Python, SQL, bash, etc — and it will be evaluated and presented in browser.  Depending on your Jupyter installation, you may also have access to a web-based terminal, allowing direct machine access into your cluster.

Tip #3 – Focus on the right content

The Spark API is not small — there are multiple paths to look at.  Based on a recent class I attended at Strata+Hadoop World in San Jose, My recommendations are:

  • Spend 95% of your learning time on DataFrames. This will get you exposed to the DataFrame API and Spark SQL. Use SQL when you can — the Spark execution engine will handle optimizations for you.  RDDs are not where your time should be focused. While DataFrames don’t support lambda functions (a function executed for each row of data in the set), you can use DataSets to use that functionality with DataFrames without resorting back to RDDs.  RDDs are considered too low level for most developers to need to worry about at this point in Spark’s maturity
  • Use HiveContext, org.apache.spark.sql.hive.HiveContext, instead of the Spark SQL context.  The Hive SQL parser is better than Spark’s. You don’t need to be running Hive to use the Hive Context, it will work without it just fine. Unofficially, the rumor is the Spark SQL context in it’s current form will go away and be replaced with HiveContext in the future.
  • Marketing loves to talk about Spark Streaming, but in practice, it’s not quite yet mature. Look at it, but don’t focus on it as the only streaming solution available. Spark’s bread-and-butter use case is still ETL and batch data processing.

Tip #4 – Work through tutorials and get to know the documentation

There are many Spark tutorials available — all of the major Hadoop vendors, Hortonworks, Cloudera, MapR, have Spark tutorials to work through. They are fine places to start and get to use some data.  But the real content to focus on is the API documentation  Spark has a really good documentation, and the API docs will be very helpful as you explore how to use DataFrames to process data.

Additionally, make a bookmark for spark-packages.org. As you work with different types of data, you’ll want to often check here to make sure that the activity you’re trying to figure out hasn’t already been done by someone else.  There is a vibrant community at spark-packages, chances are that you’ll find what you need there.

Tip #5 — Don’t give up!

If you’re not a developer (like me), it will be hard to get started, but keep at it!  Spark isn’t nearly as daunting as it looks at first glance — follow these tips above and have a better experience than I did when I got started.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: