By email@example.com Source:: https://community.mapr.com/thread/19068-book-discussion-big-data-all-stars
By firstname.lastname@example.org Source:: https://community.mapr.com/thread/19066-book-discussion-architects-guide-to-implementing-a-digital-transformation
By email@example.com Source:: https://community.mapr.com/thread/19065-book-discussion-bi-and-analytics-on-a-data-lake-the-definitive-guide
By firstname.lastname@example.org Source:: https://community.mapr.com/community/exchange/blog/2016/09/01/setting-up-spark-dynamic-application-on-mapr
A guest blog post from Scott Schlesinger, Principal, America’s EY Advisory EY and Hortonworks formed a strategic business alliance in August 2015 that is focused on helping our valued clients turn big data challenges into big business opportunities. Recognizing that big data is transforming business and technology is driving that change, EY plays a significant role in […]
The post EY shares Key Observations from Hadoop Summit 2016 appeared first on Hortonworks.
Everyone around the internet is constantly talking about the bright future of Apache Spark. How cool it is, how innovative it is, how fast it is moving, how big its community is, how big the investments into it are, etc. But what is really hiding behind this enthusiasm of Spark adepts, and what is the real future of Apache Spark?
In this article I show you the real data and real trends, trying to be as agnostic and unbiased as possible. This article is not affiliated with any vendor.
Let’s start with official position of Databricks on the shiny future of Apache Spark. Here is the slide from Databricks presentation on Apache Spark 2.0, the major new release of this tool:
You can see that 2 out of 3 new major features are related to SQL: SQL 2003 compliance and Tungsten Phase 2, that was targeted to greatly speed up SparkSQL by delivering a big number of performance optimizations. The last improvement is streaming, but again – structured streaming, which would underneath reuse parts of the code introduced for SparkSQL (same presentation):
So it is getting interesting – all the 3 major improvements introduced in Spark 2.0 are about SQL!
Spark Survey 2015
So far so good, let’s take a look at the Spark Survey handled by Databricks one year ago. Most interesting parts are this:
You can see that 69% of the customers are using SparkSQL, and 62% using DataFrames, which essentially use the same processing layer with SparkSQL (Catalyst optimizer and in-memory columnar storage). Also, two biggest use cases for Apache Spark are Business Intelligence (68%) and Data Warehousing (52%), both of them are pure SQL areas.
Apache Spark Code
Again, what was the original idea of Apache Spark, when it was introduced by AMPLab of Berkley? Let’s take a look at Matei Zaharia’s presentation on Apache Spark from Spark Summit 2013:
One of the biggest Apache Spark advantages is simplicity. Its code is compact, and everything is based on the Core engine, introducing RDDs and DAGs. But what about now? We can easily check this, as Apache Spark is an open source project. Here you can find some statistics built on its source code:
Left group of columns represents Apache Spark v1.0.0, approximately the same release Matei was speaking about on the Spark Summit 2013. The right columns represent the current master branch which is approximately the same as Spark v2.0.0 (6 commits ahead). Take a look at who is leading now – the biggest traction in community is caused by SparkSQL and MLlib! Streaming is growing times slower, while GraphX has almost nothing new, its code base has grown by roughly 1000 lines of code.
Apache Spark JIRA
Good, now let’s turn from historical perspective to the future perspectives of Apache Spark. Let’s take a look at the open issues in Apache Spark JIRA, splitting them by the component:
It is no longer a surprise for you, but SQL component is related to 34% of the issues, while Core is only 15%.
Apache Spark Contributions
I will drop some more charts before moving to the conclusions. Here is the number of commits to Apache Spark per month since the project was established:
Orange line is a moving average over the past 6 months, it is used to normalize the contribution peaks and show the general contribution trend. We can see from this chart, that Databricks works on Apache Spark using 3-months development cycles, and the missing peak of 2016’Feb corresponds to the time they were working on Apache Spark 2.0 release.
And now another graph, number of unique contributors to Apache Spark a month:
Again, orange line is a trend line showing moving average across the last 6 months.
Apache Spark was introduced by AMPLab as a general-purpose distributed data processing framework. Databricks was formed from the AMPLab people who worked on Apache Spark, to make this engine a huge commercial success, and this is when the things went wrong. Corporates can vote for the project direction with their money, while everything community can offer is limited individual contributions. Little-by-little Apache Spark is moving from being general purpose execution engine, to the corporate space, where SQL is the main and only standard for data processing. Apache Spark starts to compete with MPP solutions (Teradata, HP Vertica, Pivotal Greenplum, IBM Netezza, etc.) and SQL-on-Hadoop solutions (Cloudera Impala, Apache HAWQ, Apache Hive, etc.). At the moment Apache Spark is not positioned as their competitor because of the obvious fact – in its current state it will lose this battle. But it is getting closer and closer to its real competitors, and here is where the things are getting interesting: enterprises want the functionality they used to, and it is shaping the future of Apache Spark, putting it into the category of solutions where it cannot efficiently compete. Databricks team is putting tremendous efforts in making it a good competitor in SQL space, but it has a very low chance of winning this battle against 30-years veterans like Teradata and 40-years like Oracle.
And here are the promised conclusions:
SparkSQL is the future of Apache Spark. Apache Spark competes in SQL space against MPP databases and SQL-on-Hadoop solutions, and the battle is tough
Apache Spark is getting substantially bigger (650’000 LOC already) and more complex, increasing the entry barrier for new contributors
Enterprise investments to Apache Spark turn out to be the investments in making it capable of integrating with their products (Spark on IBM Mainframe, Spark-Netezza Connector, Spark on Azure, Spark in Power BI, etc.), not really making Apache Spark better
My personal perspective on this is the following:
In 1 year Spark would start being officially competitive with MPP and SQL-on-Hadoop solutions
In 2 years Spark would lose the battle against MPP and MPP-on-Hadoop solutions and take a niche of Hive in Hadoop ecosystem
In 2 years it will lose the market share in stream processing to specialized solutions like Apache Heron and …read more
Welcome to Greener Host . This is your first post. Edit or delete it, then start blogging!
Debugging distributed systems can be difficult largely because they are designed to run on many (possibly thousands) of hosts in a cluster. This process typically involves monitoring and analyzing log files spread across the cluster, and if the necessary information is not being logged, service restarts and job redeployment may be required. Not only is […]
The post What’s New in Apache Storm 1.0 – Part 1 – Enhanced Debugging appeared first on Hortonworks.
Dinah Washington sang “what a difference a day makes” and having lived in London for a year this month, I’m feeling that multiplied by 365! And what a year it has been…. I joined Hortonworks back in 2012 when the company was barely 8 months old and moved to be part of the International team […]
The post Hortonworks’ Customers in Europe Power the Future of Data appeared first on Hortonworks.
Part 1: A Little History In this series of blog posts, we will provide an in-depth look select features introduced with the release of Apache Storm (Storm) 1.0. To kick off the series, we’ll take a look how Storm has evolved over the years from its beginnings as an open source project, up to the […]