Latest Posts

From Mechanical Engineer to Oil & Gas Data Scientist

I recently had the pleasure of visiting with Arvind Battula, Sr. Data Scientist at Schlumberger. We discussed his background as a chemical and mechanical engineer and his move onto the Data and Analytics team as a data scientist. The following is a transcript of my conversation with Arvind. We discussed his background, his interesting focus areas for data science in oil and gas, and technologies that he believes will help transform the industry.

Kohlleffel: Arvind, you entered the data science world recently on the Schlumberger Data and Analytics team and have a very interesting background coming from both chemical engineering and mechanical engineering disciplines. Tell me about your experience and engineering background.

Battula: Certainly, my background is diverse. I started my formal training as a chemical engineer. After my bachelors, I applied for graduate school in mechanical engineering to deal mostly with computational fluid dynamics. I wanted to pursue a Ph.D. in the same area, but my doctoral work changed direction to focus on nanophotonics, which is the interaction of nanometer-scale objects with light.

Kohlleffel: That makes for quite a compelling base of experience for your data science work. Now that you’ve moved to the Data and Analytics team, where have you focused so far?

Battula: My mechanical engineering background has been very helpful at Schlumberger since we are dealing with designing products that are used in the harshest conditions imaginable on the planet. In everything we do, we must consider very minute design details to ensure the most robust end product. Before we design and build parts and assemblies, we are very thorough in our calculations and modeling–to quantify our engineering and physics assumptions. This is where we leverage data and analytics to bring a new rigor to the process and move beyond some standard linear assumptions which can be obstacles to efficiently model complex phenomena across all variables.

For example, factors like high temperature, high pressure, stress, vibration, corrosion, aging all act in parallel on the mechanical systems. We can look deeply into that data to better understand the combinations of these variables that are causing mechanical failures and then we can bring together the data streams for both physics and engineering.

This non-linear root cause analysis shows us the real world we deal with on a daily basis. It is ideally suited to leveraging big data and analytics and it benefits multiple groups within our company including engineering, manufacturing, sustaining and maintenance.

In …read more

Strata Hadoop World New York

What I saw at Strata in New York! For whatever reason, these are the things that struck me immediately: Kafka has gained a lot of momentum, and I might wager that 6 months from now, you will be using it…
Read more

Big Data Expo comes to Utrecht, Netherlands


There’s excitement in the air as one of Benelux’s largest Big Data conferences “Big Data Expo”, comes to Utrecht in The Netherlands.

We’re sponsoring and you’ll find our experts Chris Harris and Jhon Masschelein presenting such topics as “5 Steps for Effective use of Apache Spark in Hortonworks Data Platform 2.3” and “Lessons Learned: 5 Common Hadoop Use Cases”. You can register here.

As Hortonworks continues to extended its footprint in Europe, we’re seeing some exciting use cases and an increasing momentum of enterprise adoption of Hadoop. The Hadoop Summit that we organized in Brussels early this year showcased some of the great European use cases. Here’s a short overview of one my favorites:

ING Bank: Destroying Data Silos for Creating a Predictive Bank

Hellmar Becker a Utrecht resident discusses breaking down Data Sillos and creating a centralized Datalake at ING. He also discusses the modernization of their data centers, migrating away from legacy systems within their governance and security framework.

Bart Buler, Hellmar’s co-presenter discusses the banks steps into becoming a truly predictive bank. Bart also provides some do’s, don’ts and difficulties in this journey and talks about the future for the bank including “integrating analytics as part of data flows”, “showing interactive results to individuals without access to the cluster” and many more.

You can more videos listed here

To conclude, Big Data Expo, will showcase an array of new technologies, exciting case studies and organizations making the most out of data. Come visit us at Stand 21 as my colleague Alfie Murray-Dudgeon pictured below awaits.

The post Big Data Expo comes to Utrecht, Netherlands appeared first on Hortonworks.

…read more

Big Data & Brews: Anil Chakravarthy & How Consumer Tech Will Influence Enterprise Tech

If the sky was the limit and we had unlimited storage and compute, what would the future of the data world look like? In part 4 of my interview with Informatica, acting CEO, Anil Chakravarthy, says we’re already seeing a preview of it in the consumer world. What does he mean? Watch below to find out more:


Stefan: Let me switch gears here a little bit. Where do you see the future really in the data world? If sky’s the limit, and we have unlimited storage on compute and, you know, Ray Kurzweil is right and we have chips are faster than our brains in something like five years. Where is this going?

Anil: Yes, to me actually I think we already see a preview of the future. I’m talking about enterprise data right now. I think we see a preview of that feature already in the consumer world. I mean think of the Apple App Store for example – what are there, over a million apps right now at this point? But the apps are already separated from the data. Your data that the apps operate on is kind of under your control; you may have a separate repository that you use for it, either your own or iCloud, etc, and the apps are extremely modular and the apps come and go very quickly, the data lives a lot longer.

If you contrast that with the enterprise world, the enterprise world has been one where the data has been very closely tied to the apps. You know you have ERP apps or CRM apps or other kinds of apps, or custom apps where the data models have been very closely tied. You still have some separation, that’s why you can reuse the data, but the data and the apps have been very closely tied together. To me, that world is going to go the same way as the consumer world already has gone. So if you ask me what’s the future, it’s like, the data models, the understanding of what different data types are, whether it’s schema-on-read or pre-defined schema and things like that, the data will be designed for durability and will be designed essentially to be used by a variety of apps, maybe cloud-based apps, maybe on-premise apps, etc, etc. The apps will become a lot more modular and the apps will come and go, and maybe apps may be …read more

YARN – What’s the big deal


Since the partnership between Hortonworks and SAS we have created some awesome assets (i.e., SAS Data Loader sandbox tutorial, educational webinars and array of blogs) that have enabled Hadoop and Big Data enthusiasts’ hands-on training with Apache Hadoop and SAS’ powerful analytics solutions. You can find more details around our partnership and resources here:

To continue the momentum, we have Paul Kent, Vice President of Big Data at SAS, share his insights on the value of YARN and the benefits it brings to SAS and its users- this time around SAS Grid and YARN.

On my travels and in the SAS Executive Briefing Center, it has become more obvious that many folks have grabbed on to the idea that Hadoop will allow them two things:

to assemble a copy of all their data in one place
to provide enough processing horse power to actually make some sense (business value) of the patterns contained in a holistic view of said data

As they get closer to this goal they realize what a valuable resource the data lake has become. They need an effective means to “share nicely” – its not likely that every department is going to have the resources to establish their own data lake, and even if they do, you’ll be back to arguing about which version of the truth is the correct one.

YARN is the component in the Hadoop eco-system that helps folks share the value gained from building a shared pool of the organizations data.

Move the work to the Data

As the data volumes and velocities grow it has become important to find a strategy that minimized the number of hard (permanent) copies of data (and inherent reconciliation and governance). YARN allows Hadoop to become “the Operating System for your data” – a tool that manages and mediates access to the shared pool of data, as well as the resources to manipulate the pool.

Yarn allows the various patterns of work destined for your cluster to form orderly and rational queues, so that you can set the policy for what is urgent, what is important, what is routine, and what should be allowed to soak up resources so long as no one else requires them at the moment.

Expand then Consolidate

Disruptive technologies like Hadoop are often deployed “at the fringes” of an organization (perhaps in an Innovation Lab). Initial ROI is often …read more

Big Data & Brews: Informatica on Database Schemas

My chat with Informatica’s Anil Chakravarthy touched on the subject of database schemas, ETL and dynamic mapping. With the growing number of data sources and complexity, Anil argues that a purely static schema has only limited use and that flexibility is critical. He also points out that technology doesn’t have to provide the perfect answer, but it should save time, which to me, is the most valuable asset.

Enjoy the next episode of Big Data & Brews!


Stefan: What’s your perspective? As you said, there’s a growing number of data sources and more insights shape up as you’re enriching more the data. Is it really hard to define the static schema that we used to do?

Anil: Yeah, absolutely. Let me actually, just because it’s good conversation, I’ll start with the other extreme because the schema discussion usually goes from either …

Stefan: Black to white.

Anil: Yeah. Either everything is fixed or nothing is fixed. As you mentioned earlier, when I was at Symantec, one of the businesses, product groups, that I ran was data loss prevention, the DLP business. There, there is no schema. It’s basically, how do you un-structure data, especially over email? Somebody might be sending social security numbers, etc. What do you do in those cases? DLP became a very successful category by just having essentially regular expressions. That you look for certain data.

Has that been enough? Clearly not because you look at what’s going on in the world of breeches etc. It’s necessary, but not sufficient. That’s what has shaped our world view. You don’t want to insist on schema everywhere. There will be many, many types of data where you can do perfectly good processing without schema. That is not sufficient by itself. Even in the world of security that we’ve been talking about now, like we just talked about, you need to understand metadata. You need to understand what is valuable data. You cannot combine it with other schemaless… You might, for example, have SharePoint documents where you’ll never get any schema, but they still contain valuable information in order to protect data and process it. You …read more

Is Data Really the New Oil?

In a world that creates 2.5 quintillion bytes of data every year, how can organizations take advantage of unprecedented amounts of data? Is data becoming the largest untapped asset? What architectures do companies need to put in place to deliver new business insights while reducing storage and maintenance costs?

Cisco and Hortonworks have been partnering since 2013 to offer operational flexibility, innovation and simplicity when it comes to deploying and managing Hadoop clusters. UCS Director Express for Big Data provides a single touch solution that automates deployment of Apache Hadoop and gives a single management pane across both physical infrastructure and Hadoop software. It is integrated with major open-source distributions to ensure consistent and repeatable Hadoop UCS cluster configuration, relieving customers from manual operations.

In this video at Hadoop Summit, Bharath Aleti, Product Manager, Cisco UCS Big Data Solutions, talks about the partnership.

Today, a number of enterprises have deployed Hortonworks Data Platform (HDP) and Cisco UCS, not only to deploy big data solutions faster and lower total cost of ownership, but also to extract new and powerful business insights. For instance, a leading North American insurance company allowed their analysts to run 10 to 15 times the number of models they could run before, leveraging 10 billion miles of data on customers’ driving habits. It used to take 14 days for them to process queries, which meant that, by the time they obtained insights, the information was old and potentially inaccurate. Their existing system also placed limitations on the amount of data it would support, their ability to blend their data with external data, and was expensive to operate. The combination of HDP on Cisco UCS now gives them the flexibility to merge all types of data sources, with no limitations on the amount of data they can analyze, ultimately improving customer relationships and driving revenue.

These transformations do not only happen in insurance. In every industry, from healthcare to retail to telecommunications, big data allows companies to create business intelligence they could never have dreamed of and dramatically change the way they do business. Are you ready to leverage the new oil too?

Next Steps

– Meet us in person at Strata in NYC:

Cisco: booth #425
Hortonworks: booth #409

– Learn more about our joint reference architecture.

– Check out our tutorial.

– Visit the Hortonworks – Cisco partner page.

The post Is Data Really the New Oil? appeared first on <a class="colorbox" …read more

Big Data & Brews: Part II on Data Security with Informatica

Informatica’s Anil Chakravarthy and I continue our conversation around data security, this time discussing how risk management is a perfect example of a data-driven exercise. He elaborates that in the past it was either driven by human expertise or by process and increasingly, it’s becoming process-driven.

We also talk about the role Informatics plays and how cloud and data aggregation is their sweet spot.

Don’t miss it! Tune in below for part two of our Big Data & Brews with Informatica.


Stefan: But let’s talk a little bit about that “using data to secure” topic. Where do you see the opportunity in the market?

Anil: You mentioned Splunk earlier. You see a lot of companies now which have really changed the ways essentially security happens. Or, I can even broaden the topic further to your earlier conversation about risk management. When you think of managing your risk, that is essentially a data-driven exercise right now. In the past, it was either human-expertise-driven or process-driven. I think increasingly we’ve seen that it is becoming data-driven. A great example is, think of just what is happening even at the network security level. In the past, it used to be that you had specific devices like routers and firewalls, etc. from which you collected logs and you prerecorded what you were looking for and you basically said, “This is what a security attack looks like.” And then, you look for patterns that match that prerecorded knowledge that you had.

Now that world is changing very quickly even at the network level. You basically now collect logs not only from all the network devices, applications, active directory interface, user access. You pretty much collect all of that information and then you use big data techniques to find the pattern rather than say, “Hey, I already know the pattern of attack and I’m just going to go look for that pattern.” I say, “I don’t know the pattern of attack.” The assumption right now is, I have all this work and attackers only needs one way to get in. Therefore, I don’t know what way they’re using to get in. So, let me get the data and see what the data tells me in terms of what made me abnormal and then use that to find if it’s really a security vulnerability, right? That, to me, is how data is being used to change the world and that’s …read more

Big Data & Brews: Informatica Talks Security

I’m extremely excited to return from our hiatus with a new interview with Informatica’s Acting CEO, Anil Chakravarthy. He has over 15 years of experience in security and given the importance of big data governance, I thought he was the perfect candidate to share what he sees coming down the pipeline.

Tune in below to see the first installment.



Anil Chakravarthy, Acting CEO, Informatica

Stefan: Welcome to Big Data and Brews. It’s been a long time. I’m very excited to start off a new season of Big Data and Brews with Anil Chakravarthy from Informatica. Thanks for joining.

Anil: My pleasure.

Stefan: Usually we ask to please introduce yourself and the brew you brought, but it’s so early in the morning, we decided we’d go for coffee and refreshing water. Tell me a little bit about your background. You have a very interesting background, very security-focused. How did that shape how you got to Informatica and what you’re doing there?

Anil: Yes, as you said, I’ve had a deep background in security for the last 15 years. I was at Symantec, where I ran the enterprise security business. I was at Symantec for nearly 10 years. Before that at VeriSign, where I was responsible for product management of the VeriSign security services. Coming to Informatica, to me, was really a great way to bring that security expertise to the data layer.

As you know, a lot of the security world is still very much at the network layer. It’s creeping up into the application layer, but if you really look at where security can be most affective, it’s really at the data layer. There you know what you are trying to protect, what is sensitive, what is valuable. We at Informatica are taking a new approach, based on my background, but based also on what we see from the industry. We are taking a new data-centric approach to security.

Stefan: I think there are two topics I want to talk to you about today. One is really securing data and one is using data to secure, if that makes sense?

Anil: Yeah, yeah, it does.

Stefan: Why don’t we start with the first one? What’s your perspective about what’s going on in … Maybe we expand it from security to overall data governance. What is really the requirement of the market? Where are the products today? Where do they have to come, where are the shortcomings?

Anil: Yeah, let’s start with …read more

Reflections from the New Guy


It’s been 20 years since I was “the new Guy.”

Hello friends and colleagues. I wanted share some thoughts after my first 90 days at Hortonworks. It’s been a thrill ride to say the least, there is all of the normal new guy / first impression stuff – and for those of you who know me, you know I am very sensitive to all that!

Working with our founders and engineering team has been a blast. Seeing the passion in their eyes, feeling the energy and enthusiasm in their voice, has been inspirational. Their unbridled dedication to our new compute and open source paradigm is evident and infectious.

It is clear that we are at the center of multiple inflection points.

First, the open source paradigm will continue to reshape how software is developed. Leveraging a community of brilliant people means constant innovation. It also means that these talented people actually compete to find the best solutions and approach to data management problems—and the real winners are users of Apache Hadoop and HDP.

Second, it is really about the platform. Providing the first real solution for quickly landing very large and very diverse data, Hadoop along with the broader ecosystem provides the ability to capture data that used to go to waste. This collective pool of information is the raw material for refined and advanced analytics that will drive improved business models.

Meeting our customers and prospects has also been quite revealing. These are companies who are redefining their industries by being data centric and data driven. Along the way, I’ve heard some common themes.

“Get Going.”

At the Hadoop Summit in June, we had a customer panel comprised of some real thought leaders. They all mentioned in their comments that the best thing they did was getting started. The sooner you start collecting and making broad and diverse data available to data scientists and business analysts, the sooner the value shows up. And, while it may seem ‘salesy,’ it’s actually the point. Today’s modern data architecture turns the normal IT projects upside down. Schema on read is the opposite of traditional models, and very relevant today. Big data and sources of big data evolve and change so rapidly that it’s only possible to glean value by landing and analyzing.

“We’ve only just begun.”

The new and innovative use cases are being invented now, taking advantage of the new ‘land it first’ mentality. From …read more