Latest Posts

Community Choice Winner Blog: Overview of Apache Flink – The 4G of Big Data Analytics Frameworks

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Author: Slim Baltagi, Director of Big Data engineering, Capital One

I want to thank those of you who voted for my proposal and I look forward to meeting many of you in Dublin. I’ll be around for the conference and would gladly welcome any follow on conversations.

About me

I am currently a Director of Big Data engineering at Capital One. Capital One is a leading consumer and commercial banking institution conducting business in the US, Canada and the U.K.

I have over 18 years of IT and business experience and I spent the last 5 years of my life Hadooping and more recently Sparking and Flinking! I enjoy evangelizing Big Data technologies by speaking at Big Data events and maintaining a blog and a Knowledge Base on many Apache projects: Hadoop, Spark, Flink… With some fellow squirrels, I also run Apache Flink Meetups in New York City, Chicago, Washington DC, Dallas/Fort Worth, Boston and Paris.

My session is an introductory level talk about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing or stream processing-first in the open source. With the many technical innovations Apache Flink brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: Real-Time streaming, batch, machine learning and graph processing.

After attending my talk, you will know more about:

What is Apache Flink stack? Its streaming dataflow execution engine, APIs and domain-specific libraries for batch, streaming, machine learning and graph processing.

How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?

Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark?

How Apache Flink is used at Capital One and who else adopted Apache Flink?

Where to learn more about Apache Flink?
To get a preview of my session at the 2016 Hadoop Summit in Dublin, I would like to suggest a couple related talks that I gave in 2015:

2015 Big Data Scala By the Bay, San Francisco, US: Why Apache Flink is the 4G of Big Data Analytics Frameworks?

2015 Flink Forward, Berlin, Germany: Flink and Spark Similarities and Differences

I would like also to suggest slide decks of a few talks, which I gave about Apache Flink, at

My talk is an introductory talk open to technical and non-technical people alike. I look forward to meeting you at the Hadoop Summit that will take place in Dublin, Ireland in April 13-14, 2016.

The post Community Choice Winner Blog: Overview of Apache Flink – The 4G of Big Data Analytics Frameworks appeared first on Hortonworks.

…read more

Hadoop Capacity Planning

Welcome to 2016! As Hadoop races into prime time computing systems, Some of the issues such as how to do capacity planning, assessment and adoption of new tools, backup and recovery, and disaster recovery/continuity planning are becoming serious questions with…
Read more

The Object rEvolution


It’s our pleasure to host Ryan Peterson, Chief Solution Strategist at EMC, as a guest blogger to expand upon another great step in our partnership to deliver compelling customer solutions through joint engineering efforts. Follow Ryan @BigDataRyan.

Object storage isn’t a new concept and EMC’s been innovating around it since the beginning. Take our Centera and Atmos products as key examples. The first Centera was created around the idea that objects could store much higher quantities of data than a file system in a single store while the other aspect of Centera was a rich set of security and compliancy features file systems had not been able to achieve. Data shredding for example was a feature required by governments and law firms. We all know some politicians who need a Centera system Atmos on the other hand was designed with a completely different base requirement. The goal was to support a geo-parity environment mostly seen in large enterprise customers and with service providers. In the Atmos design, data written to one location would be protected by other locations and yet share a common namespace. The design inspired many large internet-scale companies you likely use today and some of them are even backed by an Atmos system.

But when you are innovating from scratch, you make design decisions that leave things out and you learn from the 25,000 current EMC object storage customers. So we started with a new baseline of code and added in many of the components of Centera and Atmos to create something new, exciting, and dare I say revolutionary. Enter Elastic Cloud Storage (ECS) which can scale from one rack in one data center to many racks in many data centers thus encompassing the design requirements of both Atmos and Centera and with new data protection features that increase performance from the original design such as local replica process, erasure coding for high performance, and geo-protection using XOR to reduce overhead. ECS changes the game!

But with the advent of new technologies in the world such as is near and dear to my heart, Hadoop, the design needed to include the capability to analyze the data in the entire global namespace and do it efficiently.

ECS includes a mapping to Hadoop using the Hadoop Compatible File System (HCFS) guidelines the same way you might see a Lustre or Gluster connect. The metadata controllers in ECS provide the namespace context and allow Hadoop to be able to see the data on that system the same way it would if it were looking at HDFS. In fact, it’s as simple as using a different URI string to connect and you don’t have to remove your HDFS DAS if you don’t want to. Simply take your existing Hadoop cluster and point to viprfs:// like shown below. Hadoop will automatically open a series of connections to access the data at the fastest possible rate.

Now before we wanted to go out and tell the world about this solution, we really wanted to enlist the support of the Hadoop distributions and we wanted to test it thoroughly. The picture below is a setup of 10 racks of ECS running the Hortonworks and Pivotal distributions of Hadoop. This is one of others like it that seek to simplify the implementation process, validate all things are functional, and provides us a place to test scenarios our customers bring to us.

Our friends at Hortonworks really did an amazing job going through all of the features of Hadoop and validating each and every line of Apache code works on ECS. Click here to see all of the certifications that have already been completed with our geo-scale object platform and Hortonworks.

So what? What does this mean to you? Let’s get serious and clear. Never before has there been an opportunity to purchase your own Analytics-Ready-Cloud-in-a-Box. So who are the customers that might care?

If you have a need for data to be spread across geographies such as Americas, Europe, and Asia; or even New York, Chicago, and Los Angeles, then relying on a single name space to support that environment while keeping the data in a state that can be quickly accessed and analyzed should be top of mind. Thus far, we’ve seen customers in the following segments (to name a few and not exhaustive):

Internet of Things (IoT) such as Connected Cars, Home Automation, Turbines, and Smartphone Backups
Geo-scale Archive – data that you might have sent to tape or offsited stays inexpensive and analytics accessible
Service Providers, Telcos, and Web 2.0 companies that need to service the application generation

Let’s compare this with the existing technologies used in Public Cloud providers not using ECS. Data is collected in multi-tenant object systems, is copied to another platform for analysis (a Cloud Data Lake so to speak) and the results pushed back into your primary system. Amazon’s S3 and EMR are a good example of that type of legacy cloud architecture. With ECS, we remove the need to move data by allowing analysis to happen against the data set where it sits. Now that’s Revolutionary!

If you have requirements that you believe are met with ECS, whether you want to host the equipment yourself or are looking for an ECS-enabled Public Cloud Service Provider, reach out to your EMC representative or discuss with our friends at Hortonworks. We can meet your needs with this rEvolutionary architecture.

For more information, you can watch this video of my colleagues Nikhil & Priya discussing the internals of the platform and how it works with Hadoop.

You can also download our Hadoop on ECS White Paper.

The post The Object rEvolution appeared first on Hortonworks.

…read more

Hadoop on Remote Storage


The question regarding running Hadoop on a remote storage rises again and again by many independent developers, enterprise users and vendors. And there are still many discussions in community, with completely opposite opinions. I’d like to state here my personal view on this complex problem.

In this article I would call remote storage “NAS” for simplicity. I would also take as a given that remote storage is not the same HDFS, but something completely different – from standard storage arrays with LUNs mounted to the servers to different distributed storage systems. For all these systems I assume that they are remote, because unlike HDFS they don’t allow you to run your custom code on the storage nodes. And they are mostly “storages”, so they are using some kind of erasure encoding to save the space and make this solution more competitive.

If you are reading my blog for a long time, you might mention that it is the second version of this article. During the last year I was constantly thinking on this problem, and my position has shifted a bit, mostly based on the real world practice and experience.

Read IO Performance. For most of the Hadoop clusters the limiting factor in performance is IO. The more IO bandwidth you have, the faster your cluster would work. You won’t be surprised if I tell you that the IO bandwidth mostly depends on the amount of disks you have and their type. For example, a single SATA HDD can deliver you somewhat 50MB/sec in sequential scans, SAS HDD can give you 90MB/sec and SSD might achieve 300MB/sec. This is a simple math to calculate the total platform bandwidth given these numbers. Comparing DAS with NAS does not make much sense in this context, because both NAS and cluster with DAS might have the same amount of disks and thus would deliver comparable bandwidth. So again, considering infinite network bandwidth with zero latency, same RAID controllers and same number and type of drives used, DAS and NAS solutions would deliver the same read IO performance.
Write IO Performance. Here the things are getting a bit more complicated, and you should understand how exactly your NAS solution work to be able to compare it with Hadoop on DAS. HDFS stores a number of exact copies of the data, 3 by default. So if you write X GB of data, in fact they would occupy 3*X GB of disk space. And of course, the process of writing 3 copies of the data is 3 times slower than the process of writing a single copy. How does the most NAS storages work? NAS is an old industry and they clearly understood that storing many exact copies of the data is very wasteful, so most of them use some kind of erasure coding (like Reed-Solomon one). This allows you to achieve similar redundancy with storing 3 exact copies of the data with only 40% overhead with RS(10,4). But everything comes at cost, and the cost here is performance. For writing a single block in HDFS you have to just write it 3 times. With RD(10,4) to write a single block you have to calculate erasure codes for it either by reading other 9 blocks and writing out 4 of them, or having some kind of a caching layer with replication and background compaction process. In short, writing to it would always be slower than writing to the cluster with replication, this is like comparing RAID10 with RAID5, same logic of replication vs erasure coding.
Read IO performance (degraded). In case you have lost a single machine or single drive in Hadoop cluster with DAS, your read performance is not affected – you read the same data from a different node that is still alive. But what happens in NAS with RS(10,4)? Right, to restore a single block with RS(10,4) you have to read up to 13 blocks, which would make your system up to 13 times slower! Of course, in most cases you encode sequential blocks and then read sequential blocks, so you can restore the missing one easier. But still, your performance would degrade 2x in best scenario and up to 13x in worst:

And if you think that the degraded case is not very relevant for you, here is the statistics of Facebook Hadoop cluster:

Data Recovery. When you are losing the node and repliacing it, how long does it take to recover the redundancy of your system? For HDFS with DAS you are just copying the data for under-replicated blocks to the new node. For RS(10,4) you have to restore the missing blocks by reading all the other blocks in its group and performing computations on top of them. Usually it is 5x-10x slower:

Network. When you run a Hadoop cluster with DAS, Hadoop framework itself tries to schedule executers as close to the data as possible, usually making a preference to local IO. In the cluster with NAS, your IO is always remote, with no exceptions. So the network becomes a big pain point – you should plan it very carefully with no oversubscription both between the compute nodes, and between compute and storage. Network rarely becomes a bottleneck if you have enough 10GbE interfaces, but the switches should be good, and you need much more of them than in solution with DAS. Here’s the slide from Cisco’s presentation regarding this subject:

Local Storage. Having remote HDFS might look like a good option, but what about the local storage on the “compute” nodes? Usually people forget that the same MapReduce stores intermediate data on the local storage, and the same Spark puts all the shuffle intermediate data to the local storage. Plus the same Hive and Pig are translated into MR or Tez or Spark, storing their intermediate results on local storage as well. Thus even “compute” nodes should have enough local storage, and the safest option is to have the same amount of raw …read more

MPP vs Hadoop Talk

Today I had a great talk at the Hadoop User Group Ireland meetup in Dublin, and it was an adapted and refactored version of the article on the same subject, MPP vs Hadoop. Here are the slides:

Feel free to comment and share your opinion on this subject

…read more

Big Data & Brews: Anil Chakravarthy Diagrams the Big Data Ecosystem

Our last installment of Big Data & Brews with Anil touches on a cool topic. Of course, I like that we get to use the chalkboard but we also had a chance to break down how Informatica sees the ecosystem (hint, the data intelligence layer is the most promising). We also talked about what he sees happening in the next 10 years that will really accelerate change in the industry.

The full conversation is just a click away – tune in!


Stefan: What would be interesting to see is in this ecosystem really of data technologies, right, where are you guys are sitting and then where you see Hadoops, Teradatas, Microstrategies, Datmeers. I kind of see you as the fabric that brings it all together. Is there a central brain of that fabric?

Anil: Right. You know, we believe so. Let me just take a stab at how we think of the word. This is obviously a logical view and it has to be translated based on … We see the world as start with this is — think of this as data persistence. This world is obviously is changing very rapidly. It was basically the databases of the world. Could be anything from mainframe database to relational database, etc. Now Hadoop and NoSQL and this world could be either on the framework or in the cloud or a combination.

Then we see the world or what we think of as data infrastructure. So this is the world, which we have traditionally played in and this world is also changing rapidly because it obviously, when this changes, this has to change here. You have things like data ingestion, which is changing very rapidly. Somebody once joked to me that that whatever IBM worked on in the 1970s always will be useful at some point so it’s like that. Things, concepts like changes and capture. The concepts like real time, streaming, etc. so all of those are coming back, right?

You have ingestion. You have data integration. Obviously that’s where you put it together, the aggregation etc. I think you have a lot of work around data quality, which is increasingly, “How do you do quality, especially on unstructured data” and things like that. That becomes a lot of work to …read more

Hadoop Manufacturing Innovation & IoT

Grant Bodley

The advent of connected manufacturing has ushered in an era where low-cost machine sensors take thousands of measurements per second at many points across the manufacturing process. This stream of sensor data enables manufacturers to quickly detect emerging anomalies and solve issues before they impact yield and quality.

Big Data insights enable predictive analytics for those rapid, proactive process adjustments. Manufacturers can capitalize on this opportunity by following an approach that combines the power of Teradata with Hortonworks Data Platform’s storage and compute efficiencies at extreme scale. Working together, our technologies enable big data insights that can dramatically improve existing manufacturing processes.

Register for the Teradata Partners Event

On Wednesday October 21st from 12:00-12:45, I will be presenting a webinar along with Dale Glover, Teradata VP of Industry Consulting. Join us for 45 minutes to learn more about how manufacturing companies are utilizing Hadoop to:

Establish a Single View of data on products throughout their entire lifecycles
Build a 360° view of lifetime customer value
Optimize manufacturing quality and yield
Proactively maintain equipment to minimize the risk of downtime
Event Details
​Presentation Title: Hadoop for Manufacturing Innovation & IoT
Session Number: 3719
Date & Time: Wednesday October 21st from 12 to 12:45PM PST
Location: 202 AB
About the Speakers

Grant Bodley: Hortonworks GM for Global Manufacturing Solutions

As General Manager of Global Manufacturing Industry Solutions at Hortonworks, Grant Bodley brings over 25 years of manufacturing experience in working with leading Automotive, Industrial, High Tech, and Aerospace Manufacturers in leveraging Big Data Insights and high impact use-cases to transform their businesses. Prior to Hortonworks, Grant was Vice President of Manufacturing Industry Solutions at SAP for more than 10 years.

Dale Glover: Vice President of Industry Consulting for Teradata

Dale Glover is a Vice President of Industry Consulting for Teradata. His Industry Consulting team is responsible for helping clients successfully implement Business Intelligence and Analytics to drive business process impact and value. He is leading the transformation of this organization to support an analytic consulting focus across a broad ecosystem of platforms and tools. His advanced Applied Analytic Team is helping organizations move from Big Data insights into the realization of value from advanced analytics in day to day operations.

The post Hadoop Manufacturing Innovation & IoT appeared first on Hortonworks.

…read more