Latest Posts

the agile manifesto (it is still a good idea)

Blog post
added by
Lester

The term “agile” has really been on my mind lately and the simple & novel write-up still holds true.

http://agilemanifesto.org/

Manifesto for Agile Software Development

We are uncovering better ways of developing
software by doing it and helping others do it.
Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

That is, while there is value in the items on
the right, we value the items on the left more.

View Online
·
Add Comment

…read more

joining multiple datasets with pig (i/o courtesy of hcatloader & hcatstorer)

Blog post
added by
Lester

My blogging has been drying up lately as I’ve mostly been focused on trying to add value within the Hortonworks Community Connection (HCC) forums where I ran into this question; https://community.hortonworks.com/questions/50243/pig-inner-join-with-different-keys.html. This person was having trouble performing an inner join with Pig across four datasets. This blog post is here to show that this is relatively easy with Pig.

First, let’s create some datasets and load them into HDFS. On these four files, I’ll be joining on the first two columns of each so the output will only have three rows.

[root@sandbox hcc]# ls
T1.csv T2.csv T3.csv T4.csv
[root@sandbox hcc]# cat T1.csv
1,1,t1_A,t1_B
1,2,t1_A,t1_B
1,3,t1_A,t1_B
[root@sandbox hcc]# cat T2.csv
1,1,t2_C,t2_D
1,2,t2_C,t2_D
1,3,t2_C,t2_D
[root@sandbox hcc]# cat T3.csv
1,3,t3_E,t3_F
1,2,t3_E,t3_F
1,1,t3_E,t3_F
[root@sandbox hcc]# cat T4.csv
1,3,t4_G,t4_H
1,2,t4_G,t4_H
1,1,t4_G,t4_H
[root@sandbox hcc]#
[root@sandbox hcc]# hdfs dfs -mkdir /tmp/hcc
[root@sandbox hcc]# hdfs dfs -put T1.csv T2.csv T3.csv T4.csv /tmp/hcc
[root@sandbox hcc]# hdfs dfs -ls /tmp/hcc
Found 4 items
-rw-r–r– 1 root hdfs 42 2016-08-08 09:47 /tmp/hcc/T1.csv
-rw-r–r– 1 root hdfs 42 2016-08-08 09:47 /tmp/hcc/T2.csv
-rw-r–r– 1 root hdfs 42 2016-08-08 09:47 /tmp/hcc/T3.csv
-rw-r–r– 1 root hdfs 42 2016-08-08 09:47 /tmp/hcc/T4.csv

Since we’ll want to leverage HCatLoader (to pick up the schema definitions), let’s create & load some Hive tables in a quick way such as described in hadoop mini smoke test (VERY mini) and is shown below.

createLoadTables.hql
CREATE TABLE T1 (k1 int, k2 int, c1 string, c2 string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE;
LOAD DATA INPATH ‘/tmp/hcc/T1.csv’ INTO TABLE T1;
SELECT * FROM T1;

CREATE TABLE T2 (k1 int, k2 int, c1 string, c2 string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE;
LOAD DATA INPATH ‘/tmp/hcc/T2.csv’ INTO TABLE T2;
SELECT * FROM T2;

CREATE TABLE T3 (k1 int, k2 int, c1 string, c2 string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE;
LOAD DATA INPATH ‘/tmp/hcc/T3.csv’ INTO TABLE T3;
SELECT * FROM T3;

CREATE TABLE T4 (k1 int, k2 int, c1 string, c2 string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE;
LOAD DATA INPATH ‘/tmp/hcc/T4.csv’ INTO TABLE T4;
SELECT * FROM T4;

Let’s run those commands.

[root@sandbox hcc]# hive -f createLoadTables.hql

Loading data to table default.t1
Table default.t1 stats: [numFiles=1, totalSize=42]
OK
1 1 t1_A t1_B
1 2 t1_A t1_B
1 3 t1_A t1_B
Time taken: 0.748 seconds, Fetched: 3 row(s)

Loading data to table default.t2
Table default.t2 stats: [numFiles=1, totalSize=42]
OK
1 1 t2_C t2_D
1 2 t2_C t2_D
1 3 t2_C t2_D
Time taken: 0.583 seconds, Fetched: 3 row(s)

Loading data to table default.t3
Table default.t3 stats: [numFiles=1, totalSize=42]
OK
1 3 t3_E t3_F
1 2 t3_E t3_F
1 1 t3_E t3_F
Time taken: 0.607 seconds, Fetched: 3 row(s)

Loading data to table default.t4
Table default.t4 stats: [numFiles=1, totalSize=42]
OK
1 3 t4_G t4_H
1 2 t4_G t4_H
1 1 t4_G t4_H
Time taken: 0.159 seconds, Fetched: 3 row(s)

Now, let’s write some Pig Latin to join these things up!!

join.pig
t1 = LOAD ‘T1’ USING org.apache.hive.hcatalog.pig.HCatLoader();
t2 = LOAD ‘T2’ USING org.apache.hive.hcatalog.pig.HCatLoader();
t3 = LOAD ‘T3’ USING org.apache.hive.hcatalog.pig.HCatLoader();
t4 = LOAD ‘T4’ USING org.apache.hive.hcatalog.pig.HCatLoader();

joined = JOIN t1 BY (k1,k2), t2 BY (k1,k2),
t3 BY (k1,k2), t4 BY (k1,k2);

DUMP joined;
DESCRIBE joined;

Here’s the output of this simple script.

[root@sandbox hcc]# pig -x tez -useHCatalog join.pig

Input(s):
Successfully read 3 records (42 bytes) from: “T3”
Successfully read 3 records (42 bytes) from: “T4”
Successfully read 3 records (42 bytes) from: “T1”
Successfully read 3 records (42 bytes) from: “T2”

Output(s):
Successfully stored 3 records (279 bytes) in: “hdfs://sandbox:8020/tmp/temp-2066780368/tmp1329817571”

(1,1,t1_A,t1_B,1,1,t2_C,t2_D,1,1,t3_E,t3_F,1,1,t4_G,t4_H)
(1,2,t1_A,t1_B,1,2,t2_C,t2_D,1,2,t3_E,t3_F,1,2,t4_G,t4_H)
(1,3,t1_A,t1_B,1,3,t2_C,t2_D,1,3,t3_E,t3_F,1,3,t4_G,t4_H)

joined: {t1::k1: chararray,t1::k2: chararray,t1::c1: chararray,t1::c2: chararray,t2::k1: chararray,t2::k2: chararray,t2::c1: chararray,t2::c2: chararray,t3::k1: chararray,t3::k2: chararray,t3::c1: chararray,t3::c2: chararray,t4::k1: chararray,t4::k2: chararray,t4::c1: chararray,t4::c2: chararray}

As you’ll notice the schema’s column names gets pretty detailed and basically show the lineage of where they came from with format of origin_alias_name::original_column_name. If you’d like to store these results in a Hive table, then you will most likely need to project the join results into a relation that has the same schema of the Hive table you want to load into. Here’s a Hive table we can store this into.

createResultsTable.hql
CREATE TABLE JOINED_RESULTS ( k1 int,
t1c1 string, t1c2 string, t2c1 string, t2c2 string,
t3c1 string, t3c2 string, t4c1 string, t4c2 string )
PARTITIONED BY ( k2 int )
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE;

You’ll notice I created a partition on this table. I wasn’t trying to be nasty; I just wanted to also address the concern of the HCC question I was looking at where this person wanted to store the results into a partitioned table. In your Pig relation to be loaded into Hive, you’ll need to put the partition column(s) at the end of the “regular” columns which creates a somewhat weird looking order of the columns since I used the second join key (i.e. k2) as the partition column.

We can now add the following lines to the earlier join.pig file to project the joined results into something that can be loaded into the new Hive table.

(end of) joinAndSave.pig
prepIt = FOREACH joined GENERATE t1::k1 AS k1,
t1::c1 AS t1c1, t1::c2 AS t1c2, t2::c1 AS t2c1, t2::c2 AS t2c2,
…read more

storing dynamically created file names with pig (piggbank’s multistorage to the rescue)

Blog post
added by
Lester

A common use case that can be easily addressed with Pig is to break an input file into separate files based on one of the record’s attributes. An easy thing to visualize would be breaking this up on a date as I will show in my quick example, but it could be any relevant attribute such as sales region or originating country. So, let’s start with a simple input file to process.

/tmp/multistore/all.txt
2016-01-01,field1-01a,field2-01a,field3-01a
2016-01-02,field1-02a,field2-02a,field3-02a
2016-01-03,field1-03a,field2-03a,field3-03a
2016-01-01,field1-01b,field2-01b,field3-01b
2016-01-02,field1-02b,field2-02b,field3-02b
2016-01-03,field1-03b,field2-03b,field3-03b
2016-01-01,field1-01c,field2-01c,field3-01c
2016-01-02,field1-02c,field2-02c,field3-02c
2016-01-03,field1-03c,field2-03c,field3-03c

This file, which was placed in /tmp/multistore/all.txt on HDFS in my HDP 2.4 based Hortonworks Sandbox for testing, has nine records total and three records for each of the first three days in January of 2016. All I want to show how to do now is to create three separate files that have the same attributes (Pig can surely do a bunch of transformation before this!!) and use the very first attribute (again, the date) as the filename that will hold only those particular records. This is easy thanks to Piggybank’s MultiStorage class as shown in the following script.

testMultiStorage.pig
allTogether = LOAD ‘/tmp/multistore/all.txt’ USING PigStorage(‘,’) AS
(newFileName:chararray, field1:chararray, field2:chararray, field3:chararray);

STORE allTogether INTO ‘/tmp/multistore/splitUp’
USING org.apache.pig.piggybank.storage.MultiStorage(
‘/tmp/multistore/splitUp’, ‘0’);

As shown in the JavaDoc, that second parameter is asking for the “index of field whose values should be used to create directories and files”. You can also supply a specific delimiter in the output file, but I’ll let it use the default tab-delimited just to make it clear the individual attributes are being recognized instead of just a long string for the input record. Let’s run it!

[root@sandbox multistore]# pig -x tez -f testMultiStore.pig
Success!
Input(s):
Successfully read 9 records (396 bytes) from: “/tmp/multistore/all.txt”
Output(s):
Successfully stored 9 records (396 bytes) in: “/tmp/multistore/splitUp”
[root@sandbox multistore]#
[root@sandbox multistore]# hdfs dfs -ls -R /tmp/multistore/splitUp
drwxr-xr-x – root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-01
-rw-r–r– 1 root hdfs 132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-01/2016-01-01-0,000
drwxr-xr-x – root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-02
-rw-r–r– 1 root hdfs 132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-02/2016-01-02-0,000
drwxr-xr-x – root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-03
-rw-r–r– 1 root hdfs 132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-03/2016-01-03-0,000
-rw-r–r– 1 root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/_SUCCESS
[root@sandbox multistore]#
[root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-01/2016-01-01-0,000
2016-01-01 field1-01a field2-01a field3-01a
2016-01-01 field1-01b field2-01b field3-01b
2016-01-01 field1-01c field2-01c field3-01c
[root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-02/2016-01-02-0,000
2016-01-02 field1-02a field2-02a field3-02a
2016-01-02 field1-02b field2-02b field3-02b
2016-01-02 field1-02c field2-02c field3-02c
[root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-03/2016-01-03-0,000
2016-01-03 field1-03a field2-03a field3-03a
2016-01-03 field1-03b field2-03b field3-03b
2016-01-03 field1-03c field2-03c field3-03c
[root@sandbox multistore]#

Pretty sweet!

View Online
·
Add Comment

…read more

Delivering Business Impacts at BT

Innovating with Hadoop

With £18 billion (about US$30 billion) in revenue in 2015, BT is one of the largest telecommunications providers in the world. For BT, the key to achieving sustainable, profitable growth in today’s competitive landscape is its ability to broaden and deepen customer relationships. To support this goal, BT is using a Cloudera enterprise data hub (EDH) to accelerate data velocity and fast-track the delivery of new offerings to its customers.

Today, BT is utilizing Cloudera and the power of Apache Hadoop to drive a number of compelling use cases including – processing and integrating BT’s customer data faster, effectively bringing together network and performance data to minimize wasted truck rolls, rolling out cybersecurity offerings and positioning BT to take advantage of the Internet of Things (IoT).

Check out this new video where Phillip Radley, Chief Data Architect at BT, delves into some of the key use cases and the business impacts that BT has been able to realize by utilizing Hadoop.

Video

Accelerating Data Velocity

“We have over 2,000 registered business applications, and connecting any two of those is always a major cost exercise and a big undertaking. So being able to integrate that data in a single place is definitely a big reason for us choosing Hadoop.” – says Radley.

One of the first use cases BT utilized Hadoop for was to optimize their ETL processing. With nearly one billion records being compared and reconciled daily, BT’s legacy ETL platform, built on a traditional relational database, couldn’t keep up any more. It was taking more than 24 hours to process 24 hours of the data and at any given point in time, its business units were working with day-old data.

BT engaged Cloudera to install a production-ready Hadoop cluster that replaced the batch ETL application with MapReduce routines to increase their data velocity. “We are now taking 8 hours to process data that used to take 24. But, in addition to that, they’re processing five times the volume of data, so that’s a 15x improvement in throughput”- said Radley

Improving Network Performance & Powering IoT

BT is now utilizing Hadoop to bring together network and performance data to provide a deeper analysis of the of the network and help troubleshoot network issues, thereby minimizing wasted truck rolls. Having the ability to remotely identify network performance issues means sending fewer technicians to the field and this means big savings to the business – and over time, it certainly runs into millions.

Another key area where BT is utilizing Hadoop is to power its IoT (Internet of Things) journey with initiatives such as providing fleet vehicle analytics and telematics as a service. “One of the competitive edge features that we can offer is the ability to instrument those vehicles and collect data from them. Where we want to get to is to be able to predict faults, so we can identify a vehicle failing early” says Radley

With an RoI of 200 – 250%, just within the first year utilizing Hadoop, BT continues to innovate and unveil new and compelling use cases leveraging Cloudera Enterprise, and is now positioned to take on new projects faster at a lower cost.

Related Links:

For more details visit: http://www.cloudera.com/customers/bt.html

Read the detailed BT case study here: http://www.cloudera.com/content/dam/www/static/documents/casestudies/bt-casestudy.pdf

The post Delivering Business Impacts at BT appeared first on Cloudera VISION.

…read more

EY shares Key Observations from Hadoop Summit 2016

A guest blog post from Scott Schlesinger, Principal, America’s EY Advisory EY and Hortonworks formed a strategic business alliance in August 2015 that is focused on helping our valued clients turn big data challenges into big business opportunities. Recognizing that big data is transforming business and technology is driving that change, EY plays a significant role in […]

The post EY shares Key Observations from Hadoop Summit 2016 appeared first on Hortonworks.

…read more

Happy birthday, Cloudera!

Eight years ago today, on June 27, 2008, we filed the incorporation paperwork for Cloudera.

We’ve come a long way since Jeff Hammerbacher, Amr Awadallah, Christophe Bisciglia and I huddled around a borrowed conference table at the Admob offices in San Mateo. The company has grown, of course; we are world-wide, with a big and fast-growing enterprise software business, great customers, outstanding partners and a world-class team of employees. The market has kept pace; when we started, only a small group of hard-core technical people knew what Apache Hadoop was. We had to explain what “big data” was before we could explain why you might want some of it.

We’ve benefited enormously from the global open source community that has built the world standard platform for scale-out big data processing and analysis. We’re pleased and proud to be part of that community. We’ve gotten an enormous boost from our partners generally, and from our close relationship with Intel in particular. Intel’s systems roadmap will drive tremendous changes in scale-out software systems. We see the future more clearly because of their help. We’ve learned a lot from customers around the world, as they’ve embraced big data to solve business problems in virtually every domain.

If you use the founding of some of the leading database companies as a yardstick, the big data industry is still very young. Oracle, for example, was eight years old in 1985. The next thirty years were pretty interesting; I have high hopes for the decades in front of big data, as well!

Happy birthday, Cloudera. I’m a little bit biased, but I’m crazy about you.

The post Happy birthday, Cloudera! appeared first on Cloudera VISION.

…read more

Big Data Governance: Bridging the Gap between Mainframe and Apache Hadoop

As Apache Hadoop celebrates its 10th birthday this year, it has become the central component of the next generation data architecture. Many of the world’s largest organizations have several production workloads running on Hadoop for new revenue generating applications, to stay competitive and relevant in their industry and to become more agile and efficient. As enterprise adoption grew, so did the requirements for security and compliance.

Last year, Syncsort joined Cloudera to provide a unified foundation for open metadata and end-to-end visibility for governance. We helped our joint customers to secure and govern their data and meet regulatory compliance requirements with solutions leveraging Syncsort’s big data integration product, Syncsort DMX-h, tightly integrated with Cloudera Enterprise Data Hub (EDH), Cloudera Manager, Apache Sentry, and Cloudera Navigator.

Many of our joint customers are in banking, financial services, and healthcare. These industries have two things in common: they are heavily regulated and they have a high reliance on mainframes. Unfortunately, accessing and integrating mainframe data with Hadoop in a way that also meets compliance requirements is extremely challenging. So, we saw a great opportunity to help. But, before we get to that, you might be wondering how significant mainframes really are in the age of IoT and streaming data. So, let’s look at some data points:

70-80% of the world’s data either originates or is stored on mainframes
IBM z13 system can process up to 2.5 billion transactions per day
71% of Fortune 500 companies have mainframes

The significance is even more apparent in our daily lives. Every time you swipe your credit card, you are accessing a mainframe; every time you make a payment with your mobile phone, you are accessing a mainframe; and of course, your social security checks are generated based on data on mainframes.

New data sources are easily captured in modern enterprise data hubs, but businesses also need to reference customer or transaction history data to make sense of these newer sources. Sensor or mobile data streamed through Apache Kafka still needs to be enriched and integrated with the transaction history or customer reference data, which are often stored on the mainframes and legacy databases.

If we leave these critical data assets outside of the big data analytics platforms and exclude from the enterprise data hub, it is a missed opportunity. Making these data assets available for predictive and advanced analytics with Apache Spark opens up new business opportunities and significantly increases business agility.

From our experience in customer engagements around the world, we know this is easier said than done. This is a complex process, fraught with governance and compliance challenges. As mentioned above, some of the most promising data analytics insights and initiatives happen to be taking place in highly regulated industries. In order to use data such as personal health records or financial transactions for advanced analytics, enterprises must be able to access it in a secure way, maintain and archive a copy in its original mainframe file format and track where the data has been.

Breaking the data silos also means challenges around data governance. Security and lineage become critical for cross platform data access. To address the data governance and lineage requirements, Cloudera introduced Cloudera Navigator, the leading Hadoop-based metadata management solution, over three years ago. Due to Syncsort DMX-h‘s open source contributions and native integration in Hadoop, it seamlessly integrates with Cloudera Navigator, allowing users to search for DMX-h jobs across a unified metadata repository and view data lineage within the Cloudera Navigator user interface.

By using Syncsort DMX-h, one of the first data integration products that was certified on Cloudera Navigator and Apache Sentry, our joint customers can easily get end-to-end data lineage across platforms, accessing and processing their mainframe data in Hadoop or Spark, on premise or in the cloud. DMX-h securely accesses mainframe data, even in its original EBCDIC format, and makes it available to be processed in CDH, like any other data source. The Data Scientists do not need to worry about understanding mainframe data and can focus on the business insights. Syncsort DMX-h can make this data from hundreds of VSAM and sequential files, or from databases like DB2/z and IMS available in Hadoop. It can also map complex COBOL copybook metadata to the Hive metastore automatically.

Alternatively, the data can be kept in its original mainframe record format, fixed or variable, for archive purposes or for just leveraging the cluster for scalable and cost-effective computing. This data can then be written back to the mainframe without format changes – meeting audit and compliance requirements. In essence, Syncsort DMX-h makes mainframe data distributable for Hadoop and Spark processing. Syncsort DMX-h also secures the entire process with certified Apache Sentry integration, native Kerberos and LDAP support, and through secure connectivity. The delivery of these flexibility and strong capabilities were driven by the use cases of our joint customers.

We look forward to continue working with Cloudera to offer our customers best-of-breed data management solutions. Watch our video to see how you can easily access and integrate mainframe data into Cloudera EDH.

The post Big Data Governance: Bridging the Gap between Mainframe and Apache Hadoop appeared first on Cloudera VISION.

…read more

Turning big data into insights and actions using Cloudera

In my last post, we discussed the importance of data and insights – insights that help drive board level decisions. There is a plethora of new data sources that could be used by organizations to drive new and meaningful insights.

Many organizations start their big data journey by bringing in new and untapped data sources from the enterprise into Apache Hadoop. According to Forrester, 73% organizations aspire to be data driven. However only 29% of these organizations are using the data to take action on the data. Successful organizations change their approach to the data problem by taking an agile and iterative approach. This approach also focuses on outcomes and business decisions that would be made from the data vs. collecting the data without a clear vision of how that data will embedded in decision making.

In a sensor and app driven economy, instrumenting everything is the new paradigm. Organizations that are customer focused (or obsessed) start using the insights from this data in a timely and meaningful fashion. They start feeding the actionable insights into applications or systems so that the actions are taken in an automated fashion. Another level of sophistication is feeding the insights and learnings from the customer interactions in real-time to adjust the algorithms for optimizing the experience. The observations from the actions feed back to the algorithms for further tuning. New data points, frequency and the massive data tremendously helps optimize the algorithms.

Building an insights driven decision making is a team effort. Various teams across the organization, both IT and business, collaborate throughout the process – from instrumenting, collecting data, creating meaningful analysis of data to embedding insights from the data in the applications for automated actions.

Cloudera helps organizations achieve an insights driven culture through world class training, subject matter expertise (technical and industry specific) and through the enterprise data hub – a fast, easy, and secure platform for all your big data needs. A great customer example is Cerner Corporation—a technology company committed to the systemic improvement of healthcare who uses Cloudera to help save lives through early detection of Sepsis, with an end-of-end view of 2PB+ data.

>>Forrester + Cloudera Webinar: Moving from data to insights: How to effectively drive business decisions & gain competitive advantage

The post Turning big data into insights and actions using Cloudera appeared first on Cloudera VISION.

…read more

Apache Spark Future

Spark 2.0 Major Features

Everyone around the internet is constantly talking about the bright future of Apache Spark. How cool it is, how innovative it is, how fast it is moving, how big its community is, how big the investments into it are, etc. But what is really hiding behind this enthusiasm of Spark adepts, and what is the real future of Apache Spark?

In this article I show you the real data and real trends, trying to be as agnostic and unbiased as possible. This article is not affiliated with any vendor.

Official Position

Let’s start with official position of Databricks on the shiny future of Apache Spark. Here is the slide from Databricks presentation on Apache Spark 2.0, the major new release of this tool:

You can see that 2 out of 3 new major features are related to SQL: SQL 2003 compliance and Tungsten Phase 2, that was targeted to greatly speed up SparkSQL by delivering a big number of performance optimizations. The last improvement is streaming, but again – structured streaming, which would underneath reuse parts of the code introduced for SparkSQL (same presentation):

So it is getting interesting – all the 3 major improvements introduced in Spark 2.0 are about SQL!

Spark Survey 2015

So far so good, let’s take a look at the Spark Survey handled by Databricks one year ago. Most interesting parts are this:

And this:

You can see that 69% of the customers are using SparkSQL, and 62% using DataFrames, which essentially use the same processing layer with SparkSQL (Catalyst optimizer and in-memory columnar storage). Also, two biggest use cases for Apache Spark are Business Intelligence (68%) and Data Warehousing (52%), both of them are pure SQL areas.

Apache Spark Code

Again, what was the original idea of Apache Spark, when it was introduced by AMPLab of Berkley? Let’s take a look at Matei Zaharia’s presentation on Apache Spark from Spark Summit 2013:

One of the biggest Apache Spark advantages is simplicity. Its code is compact, and everything is based on the Core engine, introducing RDDs and DAGs. But what about now? We can easily check this, as Apache Spark is an open source project. Here you can find some statistics built on its source code:

Left group of columns represents Apache Spark v1.0.0, approximately the same release Matei was speaking about on the Spark Summit 2013. The right columns represent the current master branch which is approximately the same as Spark v2.0.0 (6 commits ahead). Take a look at who is leading now – the biggest traction in community is caused by SparkSQL and MLlib! Streaming is growing times slower, while GraphX has almost nothing new, its code base has grown by roughly 1000 lines of code.

Apache Spark JIRA

Good, now let’s turn from historical perspective to the future perspectives of Apache Spark. Let’s take a look at the open issues in Apache Spark JIRA, splitting them by the component:

It is no longer a surprise for you, but SQL component is related to 34% of the issues, while Core is only 15%.

Apache Spark Contributions

I will drop some more charts before moving to the conclusions. Here is the number of commits to Apache Spark per month since the project was established:

Orange line is a moving average over the past 6 months, it is used to normalize the contribution peaks and show the general contribution trend. We can see from this chart, that Databricks works on Apache Spark using 3-months development cycles, and the missing peak of 2016’Feb corresponds to the time they were working on Apache Spark 2.0 release.

And now another graph, number of unique contributors to Apache Spark a month:

Again, orange line is a trend line showing moving average across the last 6 months.

Conclusions

Apache Spark was introduced by AMPLab as a general-purpose distributed data processing framework. Databricks was formed from the AMPLab people who worked on Apache Spark, to make this engine a huge commercial success, and this is when the things went wrong. Corporates can vote for the project direction with their money, while everything community can offer is limited individual contributions. Little-by-little Apache Spark is moving from being general purpose execution engine, to the corporate space, where SQL is the main and only standard for data processing. Apache Spark starts to compete with MPP solutions (Teradata, HP Vertica, Pivotal Greenplum, IBM Netezza, etc.) and SQL-on-Hadoop solutions (Cloudera Impala, Apache HAWQ, Apache Hive, etc.). At the moment Apache Spark is not positioned as their competitor because of the obvious fact – in its current state it will lose this battle. But it is getting closer and closer to its real competitors, and here is where the things are getting interesting: enterprises want the functionality they used to, and it is shaping the future of Apache Spark, putting it into the category of solutions where it cannot efficiently compete. Databricks team is putting tremendous efforts in making it a good competitor in SQL space, but it has a very low chance of winning this battle against 30-years veterans like Teradata and 40-years like Oracle.

And here are the promised conclusions:

SparkSQL is the future of Apache Spark. Apache Spark competes in SQL space against MPP databases and SQL-on-Hadoop solutions, and the battle is tough
Apache Spark is getting substantially bigger (650’000 LOC already) and more complex, increasing the entry barrier for new contributors
Enterprise investments to Apache Spark turn out to be the investments in making it capable of integrating with their products (Spark on IBM Mainframe, Spark-Netezza Connector, Spark on Azure, Spark in Power BI, etc.), not really making Apache Spark better

My personal perspective on this is the following:

In 1 year Spark would start being officially competitive with MPP and SQL-on-Hadoop solutions
In 2 years Spark would lose the battle against MPP and MPP-on-Hadoop solutions and take a niche of Hive in Hadoop ecosystem
In 2 years it will lose the market share in stream processing to specialized solutions like Apache Heron and …read more

Industry, Academia and Public Sector unite in the battle against infectious diseases

With the third baby born in the US with microcephaly related to Zika virus, the disease is again making headlines. Researchers are working diligently to find prevention and treatment methods, and as is for any research, collaboration and data sharing are critical to this process. I learned recently at a Zika Hackathon, hosted by Cloudera Cares, public data sets related to Zika are hard to find. Hackers have resorted to developing code to scrape CDC, WHO and other websites to collect information for research when public data is not available. Open data sets for collaboration and research is something that DJ Patil US Chief Data Scientist has long been driving and we have seen results of this on www.data.gov. But what if we not only have open data, but systems that can store terabytes of data ready for researchers to collaborate with compute just waiting to accelerate analysis? This last point is a goal for the Texas Advanced Computing Center at the University of Austin.

On June 2nd the Texas Advanced Computing Center inaugurated its Advanced Computing Building and announced a $30 million award from the National Science Foundation (NSF) to build Stampede 2, a supercomputer that will surpass the performance of the current Stampede system, doubling the peak performance, memory, storage capacity, and bandwidth. The processors in the system will include a mix of upcoming Intel® Xeon Phi™ Processors, codenamed “Knights Landing,” and future-generation Intel® Xeon® processors, connected by Intel® Omni-Path Architecture. The last phase of the system will include integration of the upcoming 3D XPoint non-volatile memory technology.

The opportunities that an environment like Stampede 2 can unlock for research and science are exciting. We have already seen the results at small scale with our Zika Hackathon with big data compute and storage resources volunteered by TACC on its powerful Wrangler system. If a small group of volunteers can make an important step forward towards common goals in the battle against infections diseases, just think what the industry, academia and public sector can do at large. TACC is looking for hard problems to solve, they want to collect, store and publicly share big data sets that can be used for research and analytics. TACC makes building and running a Cloudera CDH cluster easy so any researcher that would like to run a model in R, Scala or Python can just run a few clicks and commands, this is huge as many times the challenges to research is building an environment from the ground up.

Cloudera is a strong supporter of data to improve health and quality of life and has joined President Barack Obama’s Precision Medicine Initiative (PMI) by providing training and software, and collaborating with academic and government research using data and analytics. Cloudera has committed to train 1,000 precision medicine researchers on big data science and technology, a three-year commitment in software, services and training.

Industry, academia and public sector are joining forces to unlock the potential of

data with advanced analytics, machine learning and data science in the fight against infectious diseases and other illnesses to improve health and quality of life. Become a data citizen and join this effort by learning more on how you can volunteer, individually or through your organization.

Get involved and learn more about the current state of Zika virus at the CDC, WHO and ECDC.
Learn about University of Texas’ involvement in Cloudera’s Academic Partnership (CAP) program.

Read other blogs from Cloudera:
Data Citizens Gather at Hackathon in Austin, TX to Fight Mosquito-Transmitted Disease
Zika and Big Data
Learning from Ebola — How Big Data Can Be Applied to Viral Epidemiology
Applying Big Data to Help Solve Zika
Genome Analysis Toolkit

For more information, contact: Makeda Easter, Texas Advanced Computing Center, 512-471-8217; Faith Singer-Villalobos, Texas Advanced Computing Center, 512-232-5771.

Media Contacts
Gera Jochum, NSF, (703) 292-8794, gjochum@nsf.gov
Faith Singer-Villalobos, University of Texas at Austin, (512) 232-5771, faith@tacc.utexas.edu

Program Contacts
Irene Qualters, NSF, (703) 292-2339, iqualter@nsf.gov
Robert Chadduck, NSF, (703) 292-224, rchadduc@nsf.gov

The post Industry, Academia and Public Sector unite in the battle against infectious diseases appeared first on Cloudera VISION.

…read more