Latest Posts

Moving Streaming Analytics Out of the Data Center

This blog focuses on moving streaming analytics outside the confines of the traditional data center. Moving streaming analytics closer to where data originates can be accomplished by leveraging an enterprise grade data movement application, married with an extremely lightweight streaming engine. This combination is being used by forward-looking organizations to solve usage cases in a […]

The post Moving Streaming Analytics Out of the Data Center appeared first on Hortonworks.

…read more

Cross-component Lineage for Apache Hadoop

Apache Hadoop® exists within a broader ecosystem of enterprise analytical packages. This includes ETL tools, ERP and CRM systems, enterprise data warehouses, data marts and others. Modern workloads flow from these various traditional analytical sources into Hadoop and then often back out again. What dataset came from which system, when and how did it change over […]

The post Cross-component Lineage for Apache Hadoop appeared first on Hortonworks.

…read more

Data Industry Trends

hadoop and big data trendline

Yesterday my blog has got the 100th subscriber. To commemorate this, I prepared the post on the major industry trends happening in the field of “data”. I might miss something, so feel free to comment and extend the article with your opinion!

Big data is falling down the hype curve

Even though Gartner has removed “Big Data” from the last year’s hype diagram, it does not mean it suddenly moved from the peak of the “hype” to the plateau of adoption. Here is how the hype cycle look like:

And here is how the trends look like for Big Data and Hadoop, according to Google Trends:

The diagram of “Big Data” looks exactly as expected by the hyped technology on the rise. Here is my version of what has happened, how it happened and why it happened:

Hadoop was born by Google’s ideas and Yahoo’s technologies to accommodate the needs for distributed compute and storage frameworks by biggest internet companies. 2003-2008 are the early ages of Hadoop when almost no one knows what it is, why it is and how to use it;
In 2008, a group of enthusiasts formed a company called Cloudera, to occupy the market niche of “cloud” and “data” by building commercial product on top of open source Hadoop. Later they abandoned the “cloud” and focused solely on “data”. In March 2009 they have released their first Cloudera Hadoop Distribution. You can see this moment on the trends diagram immediately after 2009 mark, the raise of Hadoop trend. This was a huge marketing push related to the first commercial distribution;
From 2009 to 2011, Cloudera was the one who tried to heat the “Hadoop” market, but it was still too small to create a notable buzz around the technology. But first adopters has proven the value of Hadoop platform, and additional players has joined the race: MapR and Hortonworks. Early adopters among startups and internet companies are starting to play with this technology at this time;
2012 – 2014 are the years “Big Data” has became a buzzword, a “must have” thing. This is caused by the massive marketing push by the companies noted above, plus the companies supporting this industry in general. In 2012 alone, major tech companies spent over $15b buying companies doing data processing and analytics. Some of them were bubbles (like Autonomy), some – not. But the demand for “big data” solutions were growing, and the analyst publications were heating the market very hard. Early adopters among enterprises are starting to play with the promising new technology at this time;
2014 – 2015 are the years “Big Data” is approaching the hype peak. Intel has invested $760m in Cloudera giving its the valuation of $4.1b, Hortonworks went public with valuation of $1b. Major new data technologies has emerged like Apache Spark, Apache Flink, Apache Kafka and others. IBM invests $300m in Apache Spark technology. This is the peak of the hype. These years a massive adoption of “Big Data” in enterprises has started, architecture concepts of “Data Lake” / “Data Hub” / “Lambda Architecture” have emerged to simplify integration of modern solutions into conventional infrastructures of enterprises:

2016 and beyond – this is an interesting timing for “Big Data”. Cloudera’s valuation has dropped by 38%. Hortonworks’s valuation has dropped by almost 40%, forcing them to cut the professional services department. Pivotal has abandoned its Hadoop distribution, going to market jointly with Hortonworks. What happened and why? I think the main driver of this decline is enterprise customers that started adoption of technology in 2014-2015. After a couple of years playing around with “Big Data” they has finally understood that Hadoop is only an instrument for solving specific problems, it is not a turnkey solution to take over your competitors by leveraging the holy power of “Big Data”. Moreover, you don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale. All of this has caused a big wave of priorities re-evaluation by enterprises, shrinking their investments into “Big Data” and focusing on solving specific business problems. “Big Data” market is cooling down:

The emerge of Data in the Cloud

This is the second major trend of “data” industry. IBM acquires Cloudant. Databricks, the company behind Apache Spark, has their product offering for cloud only, in collaboration with AWS. Most common use case for Docker containers is running data services inside of them. All the major public Cloud companies are offering you technologies like “managed databases”, or even analytical databases in the cloud. All the major Hadoop vendors has already pushed their “cloud” offering to the market. DBaaS industry is getting more and more hot, with all the major DBMS vendors offering their solutions in the clouds.

Initially, “cloud” was meant to host applications only (aka 12-factor applications), and the databases had to be managed separately. But the time passes, and now many companies moving to the cloud, hosting their databases in the cloud and even running analytics using cloud-hosted distributed processing engines. Amazon Redshift alone is reportedly running more than 100k of nodes!

Data is going open

If you have seen my visualization of open source data community, you understand what I am talking about:

10 years ago the only open source data processing offerings were Postgres and MySQL. Over the time, the open source industry has emerged, and over the last couple of years you can see more and more companies going open source!

Pivotal open sources Greenplum, HAWQ and Gemfire (aka Geode). DataTorrents open sources its technology as Apache Apex. Cloudera open sources Kudu. Citus Data open sources CitusDB. Google open sources TensorFlow. Google open sources Dataflow as Apache Beam. There are more examples of it, just scroll through the visualization the see how the open source data industry moves from 10 to 100+ projects within the last 10 years.

Artificial Intelligence climbs the hype

“Artificial Intelligence” is starting to climb on the …read more

Open Source Data Community Visualization

Open source data community has been rapidly growing over the last 10 years. You can feel this by the emerge of projects like Apache Hadoop, Apache Spark and the likes. It is growing this fast that there is almost no chance of keeping up with its growth without constantly monitoring the related events, announcements and other changes. 10 years ago it was enough to know “just Oracle” or “just MySQL” to make a successful career in data. Now the things has greatly changed, and if you cannot answer questions like “what is the difference between MapReduce and Spark?” and “when would you prefer to use Flink over Storm?” at your job interview you are screwed.

Github Data Community Graph Snapshot

Also, what would be the “next big thing” in data?

And where the community is moving to? You must have seen hundreds of blogs trying to describe their “vision” of the data over the next number of years. Most of them are sponsored by different vendors that try to replace prediction of community dynamics with shaping it, making you believe that “product X” is a next big thing for it to really become so. This is the marketing.

I’m not a big fan of marketing of any kind and I kindly believe that a good software will find its way even without this aggressive marketing. But thinking about the dynamics of open source data community, I was trying to find a trustful source of information, clean from the marketing of big enterprises. And I have found it: it is a github. Github stores source code for most of the open source data products. But it is not just sources, it is a complete story of changes for each of them, including information about the author of this change. What if I analyze this information to show what is happening in community? And here is what I ended up with:

Nodes of the graph represent open source data projects. Each node’s area is representing the number of unique contributors to the specific open source project over the last 10 weeks, prior to the moment of visualization shown on the timeline on top of the video. Edges represent relations between the projects. Relation between 2 open source projects exist if there is at least a single person that has contributed to both projects within the 10 week time range. Usually people are contributing to the related projects, for example people contributing to Apache Hadoop would likely also contribute to Apache HBase and Apache Hive. Colors of the nodes does not have any particular meaning, they are just selected from the colormap at random. All the animation is done in Python with matplotlib and lots of manual code to make it work the way I want.

In fact, there is a huge mutual relationship between many projects, this is why I also include the projects that are not directly related to “data”, but are also making impact on shaping the community of data. These project include Node.js, Docker, Kubernetes and the likes.

There is also a funny thing I have discovered analyzing all this data. There are only 8333 unique developers that are responsible for this whole progress! And luckily I’m one of them.

Enjoy the visualization and feel free to ask questions.

…read more

Living in the Age of Data preparing for the Future

Screen Shot 2016-03-02 at 6.53.27 AM

The world’s data now doubles in volume every two years. We’re living in an Age of Data fed by the Internet of Anything.

Life in the Age of Data is always-on and always-connected with easy access to incredibly rich sources of analyzed information coming from the Internet, mobile devices, servers, machines, sensors, and so on.

Every business will have the ability to use this data to convert yesterday’s impossible challenges into today’s new products, cures, and life saving innovations. Right now, the leading pharma, automotive, electronics and packaged goods companies are already building their factories of the future around the actionable intelligence from this kind of data to do things like improve manufacturing yields. And older industries like automotive, agriculture and retail are catching up by taking modern data architectures on the road, through the field, or to the cash register to do things that were before been possible.

The power of big data is fundamentally changing the delivery of healthcare. Efforts such as the White House’s Precision Medicine Initiative aim to revolutionize how the United States improves health and treats disease. Businesses also use actionable intelligence from big data to fight fraud, viruses and identity theft, and new open source projects like Apache Metron are changing how we think about cyber security technology.

With all that is happening today, I can’t wait for tomorrow and be part of the movement to a bright future with data shining the way.

The Power of an Open Approach

Getting to the age of data, did not just happen overnight. Before was the Age of RDBMS led by Oracle, and the Age of Web led by Linux, Red Hat, Apache Software Foundation and the Apache HTTP Server.

With the emergence of Apache Hadoop in 2006, the Age of Data was born with the Apache Software Foundation playing a key role yet again. Now 100% of businesses say they will adopt Apache Hadoop and its ecosystem of projects such as Apache Hive, HBase, Spark, Kafka, Storm, NiFi as the center of gravity for a modern data architecture.

In the Age of Data, open is simply the norm, and Hortonworks philosophy has always been predicated on open innovation, open community, open development, open delivery…a fully open approach.

We’re Open, We’re Public, and We’re Proud

With our IPO in 2014, we proudly became a public bellwether for the Age of Data.

Last month, we reported our financial results for 2015 that included $121.9 million in revenue and $165.9 million in gross billings. We also set guidance for 2016 of continued high growth in revenue and billings. When it comes to spending, our CEO Rob Bearden added: “.. make no mistake, we’re manically focused on achieving adjusted EBITDA breakeven and anticipate doing so by the end of 2016.”

Since our business is built on open source, we’re frequently asked the question “but can an open source business model really scale?”. The proof is in the results. Since a picture is worth a thousand words, I’ve charted the earliest years of inflation-adjusted financial data for Oracle, Red Hat, to provide some perspective.

Noteworthy chart details: Oracle was founded in 1977, and in 1986 they went public and achieved $55.4 million in revenue (which equates to $111.4 million in 2011 dollars). Red Hat was founded in 1993, went public in 1999, and in 2001 achieved more than $100 million in inflation-adjusted revenue. Finally, was founded in 1999 and in 2004 they went public and achieved about $100 million in revenue.

We’re Powering the Future of Data by Focusing on Customer Success
In the Age of Data every business is a data business. Tomorrow’s leaders are already mastering the value of data to their organizations and embracing an open approach. We’re focused on powering the future of data with them by delivering a new class of data management software solutions built on open source technology.

Whether from data at rest with Hortonworks Data Platform or data in motion with Hortonworks DataFlow, our connected data platforms help our customers tap into all data. We give the world’s leading companies and government agencies actionable intelligence to create modern data applications that were never before possible.

Our open approach ensures we partner with our customers on their data journey. That journey can start by renovating IT architectures to reduce costs and boost functionality. Or it can start by innovating modern data applications that differentiate the business or open new revenue streams.

We are thankful to our customers and partners for embracing the open approach and we vow to stay focused on their success while empowering the broader community in the process.
Join us and be part of the movement to a bright future with an open approach leading the way.

The post Living in the Age of Data preparing for the Future appeared first on Hortonworks.

…read more

Cybersecurity: Conceptual architecture for analytic response

Screen Shot 2016-03-02 at 6.58.09 AM

Welcome back to my blogging adventure. If you’ve been reading my Cybersecurity series; “echo: hello world”, “Cybersecurity: the end of rules are nigh”, and Cybersecurity: why context matters and how do we find it you know just how much time I’ve spent explaining why an integrated cybersecurity analytic solution should focus on delivering value and making the lives of the folks doing incident response easier. As I look across the landscape of security analytic offerings, I see walled gardens consisting of proprietary models and pretty dashboards. Yes, walled gardens are pretty, well maintained places to visit; however, we can’t live there because they don’t meet our needs. Our offices and living rooms are cluttered and organized around how we live and not some pretty picture in an interior design magazine. I believe that a real cybersecurity solution should aim to reflect our work spaces; functional and configurable to how we want to work and not some engineer’s idea of what’s best for us.

Conceptual Architecture

Today, we will go over a high level conceptual architecture for a practical cybersecurity analytic framework that works for us by adapting to how we do business. Before we dive in let’s give the 100,000 foot overview of what the conceptual architecture looks like.

Data Flow

The critical path in the architecture is the red arrow in the middle. We need to take raw sensor data and reliably generate an automated response. Like our messy living room or office; it is the output of the work and not the pretty picture that provides value. If the analytic models and response rules can make that call for response then no pretty dashboard is required. Why build in a big red button for the SOC analyst to click if an invisible response is faster?


The sensors component is the data ingestion point of all machine data in the company and acts as the interface to the data flow. The critical path starts here. Automation and remote management of these sensors allows for efficient operation and flexible response mid-incident if greater data volume or fidelity is required. I foresee a shift from niche security products towards sensors embedded in our application architecture; as monolithic applications transform into as-a-service cloud enabled components, our security controls must transform along with them.

Automated Response
This is where the system provides maximum value. Regardless of whether the analytical models & rules, or the workflow and manual review through the user interface triggered the response event, automation of the response activity is part of the critical path. The automated response provides the automation interface to the rest of the company’s assets for command and control. Again, I foresee a shift from specialized security products towards automated response components embedded in our application architecture. These embedded sensors and response components will give a new, truer meaning to data centric security in the internet of anything.

The Internet of Anything


Data Centric Security

Data Lake

Data is stored in the data lake for historical analytic replay as new knowledge becomes available is a key advantage in this approach. In addition, this data is available for training models regarding normal and abnormal behavior, and allowing the simulation of new automated response capability. The ability to demonstrate with actual data that a new automated blocking capability, when replayed over the last three years of collected data, wouldn’t have caused negative impact to business operations is necessary in gaining approval for implementation.


Both historical data in the data lake and streaming live data flowing through the analytical models:

Generate baseline understanding of normal and abnormal activity
Create the full picture context of what is happening on the applications, networks, and systems
Enrich and correlate information into full context events for either automated response or manual review.

After the analytic models have transformed the raw data flowing through the system into enriched data elements that are both descriptive and predictive in nature the rules engine applies the company’s prescriptive rules or policy on how those events need to be handled. This is critically important in allowing an organization to apply their own risk tolerance to the response process.


Similar to rules in that they allow the company to configure solution to meet their needs. Workflow allows the company to configure the incident response steps and automated response in a manner that enables the business instead of the business bending around the solution. This multi-user/multi-tenant workflow engine allows for cross organization response to be configured. In addition, by being part of the analytic solution, key performance and risk metrics can be collected to: measure the health of the process, allow for security analyst performance review and on the job training, and make the work visible in a manner that shows the value to the organization as a whole.


This is the layer that provides visual interface elements that visualize data. By refactoring these dashboard elements away from the user interface, we enable each user to create their own user interface experience and provide a consistent visualization of the data across user interface displays for efficient cognitive uptake of information.

User Interface

It is important that the user interface elements are decoupled from the rest of the solution stack. If we are going to hit the goal of a single pane of glass view to the analytic response process we need a user interface that: adapts to the user’s needs and changing roles, fine grained security for multi-user/multi-tenant access through the user interface, and a pluggable design that allows both workflow steps and dashboard elements to be combined for the most efficient response. This solution is open and ever changing so it is critical that the user interface provides the ability to plug in and organize user interface elements from other areas instead of creating them; otherwise every weekly change requires reprogramming the user interface for the new workflow or data elements. I foresee a future where a vibrant community of public and proprietary analytical components are able to plug into …read more

Open Community Innovation

Hadoop just turned 10, the first code check-in was on Feb. 2, 2006 by our very own co-founder, Owen O’Malley. I am tremendously proud to have been a part of this first 10 years, and even more excited on where this open movement is going to take us. Congratulations to everyone in the Community!

We started Hortonworks with a vision that Hadoop would process half of the world’s data and we founded Hortonworks on four key principles:

1. Innovate at the core architecture of Hadoop

2. Commit to Enterprise Hadoop

3. Enable an Open Ecosystem

4. Do everything in Open Source

Not only did we relentlessly follow these four key principles, in the process we built a great business. We finished our first year as a public company with $122M in revenue and became the fastest software company to reach $100M in just 4 years. We have over 800 customers that work with us every day, and we have over 1600 partners in this growing and thriving ecosystem.

It is interesting to reflect back on the early days, but what excites me more is where this is all going, what is the future of data? In fact, we at Hortonworks, see a future where Hadoop and related technologies will manage all of the world’s data. After all, it’s about All Data, not just Big Data, but All Data – Data from every endpoint, person, device, click, swipe, server log and stream that can be collected, conducted and curated to deliver actionable intelligence for every business. Data is at the heart of every business and one of its most important assets. Data-in-motion and data-at-rest. Data that is real time, predictive, streaming, structured, unstructured, mission-critical and everything in between. Our vision and promise of Powering the Future of DataTM is underway.

Today, in San Francisco, we announced our strategy around Open and Connected Data Platforms for data-in-motion and data-at-rest. We’ve updated our core product, the Hortonworks Data Platform (HDP) and announced a new release model that will keep pace with this amazing market. We announced Spark 1.6 is now available on HDP 2.4. We also announced our data-in-motion Hortonworks DataFlow platform now has integration with streaming analytics engines, Apache Kafka and Storm. Finally, we were proud to stand on stage with Hewlett Packard Enterprise’s CTO, Martin Fink as we discussed how we are collaborating on more community contributions around optimizing the performance of Spark. It is great to have such a great partner working together on open community innovation.

The Apache Hadoop community should be commended for truly tackling the Data challenge and taking the tech to the next level. I personally believe that the community is immensely important to the future. That is is critical to have an open, truly open community. Innovation happens as a result of the community. And we at Hortonworks are certainly proud to be a part of it.

Thanks to Open Community Innovation, the next 10 years promise to be even more exciting, we encourage you all to join the conversation surrounding Hortonworks and the #futureofdata.

The post Open Community Innovation appeared first on Hortonworks.

…read more