Author Archive: datafoam
The Department of Homeland Security issued guidance for securing the Internet of Things. In-brief: a little more than a month after revealing that it was working on guidelines for securing The Internet of Things, the Department of Homeland Security (DHS) on Tuesday published a set of…
It should come as no surprise that having the right strategy is the key for getting success from your data projects. And this strategy goes well beyond the data itself. It involves having the right people, process, and technology in…
Cyber threats used to be something that humans could handle with the right tools, but today’s threats have grown too big, too fast, and too complex for existing solutions or methodology to handle. Cybersecurity has become a machine-scale problem,…
All businesses have assets that depreciate over time. It’s intuitive that the extended use of a given piece of a equipment decreases its value, and we see this in our daily lives with the cars we drive or the old…
We know that Big Data, analytics and, yes, even Apache Hadoop, have become meaningless buzzwords and jargon for many people. Where there’s jargon, there is misinformation and where there is misinformation there are myths. Our new eBook goes into the…
The technology industry is an essential driver of the US and global economy. Technology is driven by innovation. Over many decades, Silicon Valley has built a powerful engine for new ideas, new companies and new products to change the world…
In a blog posted earlier this week, my esteemed colleague Sean Anderson laid out a powerful argument for machine learning (ML) as a way to fuel recommendation engines, churn reduction engines, and IoT workflows. Leveraging components like Apache Spark, and…
While my motorbike is hibernating, I re-watched Ewan McGregor and Charley Boorman’s Long Way Round as the next best thing to getting to ride myself. The mini-series documents the 19,000 mile journey both actors undertook from start to finish, riding…
A guest blog post from Scott Schlesinger, Principal, America’s EY Advisory EY and Hortonworks formed a strategic business alliance in August 2015 that is focused on helping our valued clients turn big data challenges into big business opportunities. Recognizing that big data is transforming business and technology is driving that change, EY plays a significant role in […]
The post EY shares Key Observations from Hadoop Summit 2016 appeared first on Hortonworks.
Everyone around the internet is constantly talking about the bright future of Apache Spark. How cool it is, how innovative it is, how fast it is moving, how big its community is, how big the investments into it are, etc. But what is really hiding behind this enthusiasm of Spark adepts, and what is the real future of Apache Spark?
In this article I show you the real data and real trends, trying to be as agnostic and unbiased as possible. This article is not affiliated with any vendor.
Let’s start with official position of Databricks on the shiny future of Apache Spark. Here is the slide from Databricks presentation on Apache Spark 2.0, the major new release of this tool:
You can see that 2 out of 3 new major features are related to SQL: SQL 2003 compliance and Tungsten Phase 2, that was targeted to greatly speed up SparkSQL by delivering a big number of performance optimizations. The last improvement is streaming, but again – structured streaming, which would underneath reuse parts of the code introduced for SparkSQL (same presentation):
So it is getting interesting – all the 3 major improvements introduced in Spark 2.0 are about SQL!
Spark Survey 2015
So far so good, let’s take a look at the Spark Survey handled by Databricks one year ago. Most interesting parts are this:
You can see that 69% of the customers are using SparkSQL, and 62% using DataFrames, which essentially use the same processing layer with SparkSQL (Catalyst optimizer and in-memory columnar storage). Also, two biggest use cases for Apache Spark are Business Intelligence (68%) and Data Warehousing (52%), both of them are pure SQL areas.
Apache Spark Code
Again, what was the original idea of Apache Spark, when it was introduced by AMPLab of Berkley? Let’s take a look at Matei Zaharia’s presentation on Apache Spark from Spark Summit 2013:
One of the biggest Apache Spark advantages is simplicity. Its code is compact, and everything is based on the Core engine, introducing RDDs and DAGs. But what about now? We can easily check this, as Apache Spark is an open source project. Here you can find some statistics built on its source code:
Left group of columns represents Apache Spark v1.0.0, approximately the same release Matei was speaking about on the Spark Summit 2013. The right columns represent the current master branch which is approximately the same as Spark v2.0.0 (6 commits ahead). Take a look at who is leading now – the biggest traction in community is caused by SparkSQL and MLlib! Streaming is growing times slower, while GraphX has almost nothing new, its code base has grown by roughly 1000 lines of code.
Apache Spark JIRA
Good, now let’s turn from historical perspective to the future perspectives of Apache Spark. Let’s take a look at the open issues in Apache Spark JIRA, splitting them by the component:
It is no longer a surprise for you, but SQL component is related to 34% of the issues, while Core is only 15%.
Apache Spark Contributions
I will drop some more charts before moving to the conclusions. Here is the number of commits to Apache Spark per month since the project was established:
Orange line is a moving average over the past 6 months, it is used to normalize the contribution peaks and show the general contribution trend. We can see from this chart, that Databricks works on Apache Spark using 3-months development cycles, and the missing peak of 2016’Feb corresponds to the time they were working on Apache Spark 2.0 release.
And now another graph, number of unique contributors to Apache Spark a month:
Again, orange line is a trend line showing moving average across the last 6 months.
Apache Spark was introduced by AMPLab as a general-purpose distributed data processing framework. Databricks was formed from the AMPLab people who worked on Apache Spark, to make this engine a huge commercial success, and this is when the things went wrong. Corporates can vote for the project direction with their money, while everything community can offer is limited individual contributions. Little-by-little Apache Spark is moving from being general purpose execution engine, to the corporate space, where SQL is the main and only standard for data processing. Apache Spark starts to compete with MPP solutions (Teradata, HP Vertica, Pivotal Greenplum, IBM Netezza, etc.) and SQL-on-Hadoop solutions (Cloudera Impala, Apache HAWQ, Apache Hive, etc.). At the moment Apache Spark is not positioned as their competitor because of the obvious fact – in its current state it will lose this battle. But it is getting closer and closer to its real competitors, and here is where the things are getting interesting: enterprises want the functionality they used to, and it is shaping the future of Apache Spark, putting it into the category of solutions where it cannot efficiently compete. Databricks team is putting tremendous efforts in making it a good competitor in SQL space, but it has a very low chance of winning this battle against 30-years veterans like Teradata and 40-years like Oracle.
And here are the promised conclusions:
SparkSQL is the future of Apache Spark. Apache Spark competes in SQL space against MPP databases and SQL-on-Hadoop solutions, and the battle is tough
Apache Spark is getting substantially bigger (650’000 LOC already) and more complex, increasing the entry barrier for new contributors
Enterprise investments to Apache Spark turn out to be the investments in making it capable of integrating with their products (Spark on IBM Mainframe, Spark-Netezza Connector, Spark on Azure, Spark in Power BI, etc.), not really making Apache Spark better
My personal perspective on this is the following:
In 1 year Spark would start being officially competitive with MPP and SQL-on-Hadoop solutions
In 2 years Spark would lose the battle against MPP and MPP-on-Hadoop solutions and take a niche of Hive in Hadoop ecosystem
In 2 years it will lose the market share in stream processing to specialized solutions like Apache Heron and …read more