Monthly Archive: February 2016

Spark Summit: Accelerating Enterprise Spark

Screen Shot 2016-02-17 at 3.11.04 PM

I had the pleasure to speak at Spark Summit in New York today about accelerating the adoption of Spark by mainstream enterprises. I had to admit at the beginning of my talk that I’m an “open source addict” — over the past 12 years I’ve been blessed to have called JBoss, Red Hat, SpringSource, and Hortonworks home. My focus has been the same at each stop: how can we innovate in open source technology and deliver enterprise-scale, easy to use products and solutions that can be consumed by mainstream enterprises?

While I’m excited to talk about the technology itself, it’s always important to root the conversation in why enterprises should care. In the case of Apache Spark, the simple answer is: because Spark helps unlock the enormous potential of data for the enterprise.

I have had the pleasure to work with the team at Webtrends and they are a great example of exactly what I mean. They adopted Hadoop and Spark a while ago, and they consolidated their Spark and Hadoop clusters into one YARN-based HDP cluster where they run Spark on YARN in the Hortonworks Data Platform (HDP) as one of many workloads. The company is approaching 1.5 petabytes stored in its HDP data lake. Spark now processes 13 billion events per day. What I find most compelling is that this modern data architecture enabled them to introduce a new product offering called Webtrends Explore which allows their customers to dive deep into their data and gain the flexibility of answering important business questions immediately. You can learn more about Webtrends use cases and journey by watching the video here.

One of the other examples I presented is how a railroad company is using HDP and Spark to deliver a realtime view of the state-of-the-train-tracks. Video images and geolocation are key data elements in the solution that’s focused on preventing accidents before they occur. If this example doesn’t underscore the fact that the age of data has truly arrived for any type of business, then I’m not sure what will.
So with that as context, what are the macro trends we’re seeing?

First, Spark is becoming the defacto data API for many big data processing workloads. To date for analytics and reporting and more recently for workloads like ETL and streaming. It’s become one of the key tools in the toolbox and an important element in a modern data architecture.

Second, Spark is getting broad adoption in the enterprise. A series of use cases are developing rapidly. For example using Spark as a query federation engine, or with HDP ecosystem projects such as Hive and HBase. Any new apps will likely be built on Spark. But missing enterprise capabilities is still key. That’s where we can bring our expertise to bear.
Third, agile analytic development and data science still remains the frontier. We need to democratize Spark to not only for those who know Scala, Java, Python, and R but to the broadest community of “developers” possible. We need better tooling for professional developers as well as business “developers.” We need to encourage universities to pay attention to this movement, and we need to reach out to undergrads and encourage them to learn Spark and/or tools that ride atop.

In light of this, Hortonwork’s strategy is threefold in relation to Apache Spark:

#1: Make agile analytic development and data science easier and more productive. Highlights include:

Apache Zeppelin: a web-based notebook for agile analytic development. This open source tool provides a visual interactive experience for uncovering insights and sharing those insights with others.
Magellan: an open source library for Geospatial Analytics that uses Spark as the underlying execution engine. Geospatial data is pervasive in mobile devices, sensors, logs, and wearables. If you are working with geospatial data and big data sets that need spatial context, there are limited open source tools that make it easy for you to parse and query at scale, which makes this hard for business intelligence and predictive analytics apps. Magellan facilitates geospatial queries and builds upon Spark to address the hard problems of dealing with geospatial data at scale.

#2: Accelerate capabilities that harden Spark for enterprise use. In areas ranging from encryption and security, data governance, HA, DR, operations and debugging. We’re also improving data integration with things like RDD caching in HDFS, and providing a unified Hive and Spark connector for HBase that eliminates complexity and improves overall performance.

#3: Continue to innovate at the core. We want to make this the best experience and performance possible with HDP. No secret sauce. All open and all going back into the community. This includes enhanced support for YARN with dynamic executor allocation support in HDP so Spark runs better within multitenant YARN clusters. We’ve also been quietly working with the talented folks at HP Labs on providing an optimized Spark experience at the core. I can’t go into details now, but I encourage you to tune in on March 1st!

The pace of innovation in the Spark community is moving fast, and we plan on staying in lock step with the community. For example within a few hours of the community release of Spark 1.6, we made a technical preview available for deployment on our current version of HDP, and we’re marching quickly to GA.

We live in an age where every business is a data business. Tomorrow’s leaders are already mastering the value of data and embracing an open approach. If you’re just getting started, don’t be shy. Join the community and be part of this journey.


Shaun Connolly


The post Spark Summit: Accelerating Enterprise Spark appeared first on Hortonworks.

…read more

Cybersecurity: Why context matters and how do we find it?


Welcome back to my blogging adventure. If you’ve been reading along, you’re aware of the lightbulb moments from my article, “echo: hello world”, that allowed me to discover the benefits of an analytic approach to cybersecurity. Next I gave a little slice in the life of our intrepid SOC analyst in, “Cybersecurity: the end of rules are nigh”, where I gave a little detail behind my belief that we need to move away from a rules detection approach to cybersecurity monitoring. Today, we will spend some more time with our SOC analyst living the life of event triage. My hope is we come away with a greater understanding of why context matters as I show a high level process for efficient incident response triage.

The context conundrum

To understand why context is so critically important we need to forget about technology for a moment and focus on people and process. A hard lesson I’ve learned in my career is that when we focus on the technology we end up creating solutions that make the person work for the machine instead of the machine enabling the person. So let’s take a moment and get in the shoes of our intrepid SOC folks and walk through a day in their lives.

Triage Analyst
Typically, the first line is the SOC analyst focused on responding to alerts and determine if it’s a false positive or something that requires escalation. Typically, this is a junior shift level person in the security equivalent of the help desk call specialist role. They have job guides, run books, or knowledge trees that they follow as they gain experience. The process they follow is probably documented as follows: Easy as 1-2-3:
1. Look in SEIM and select top alert
2. Review
3. Decide to escalate or filter

Pretty simple process right? If only the real world worked that way. What actually probably happens are steps 1-14, give or take another 1 or 14 more.
Look at the SOC dashboards to get an overall feel of what’s going on
Look at SIEM alert containing two IP addresses and and obscure alert name
Go into several other consoles looking up what system owns each IP address
Web search if IP address is external to see who owns it and if it has a bad reputation
Take the looked up system names and enter yet another console to look up the asset inventory information such as what should be running on the machine and who owns it
Send emails to the owners for details since the asset inventory information is probably out of date or incomplete
Look in yet another console for details regarding the alert such as what it does and what vulnerabilities it targets
Look in yet another console to see if the asset has been patched or fixed as of the last scan, since the scan is probably a few weeks out of date
Send email for a one-off vulnerability scan to verify
Look in yet another console to see if the systems have been backed up and when the last time the backup has actually been tested
Start feeling nervous since most DR tests are yearly
Look in yet another console to see if the assets are one of the systems logging to the consolidated repository or if more emails need to be sent to get the logs
Realize that not every application log on the system is actually logging to the repository
Send emails to the applications teams to get those logs…


Two hours later after all the data can be reviewed – false alarm
Time to send emails to escalate all the broken things like log forwarding being broken and no one had noticed, backup jobs showing as incomplete or failed, patches marked as installed showing up as no longer installed in the one off vulnerability scan, etc
Time for lunch then on to the next alert review.


End of shift turnover to next group of analysts regarding emails still waiting for response and the other few hundred thousand events you didn’t have time to get to.

Security Engineer

The short term goal of any analyst is to get away from the triage cycle and move up to the role of the security engineer. Typically, the security engineer’s primary responsibility is the care and feeding of one or more point security solutions. They do capacity planning, system maintenance and upgrades, and be available to assist if their technology is part of an incident response escalation. The promise is a standard work day with the occasional off hours call if incident response is required. The reality is change management requires all maintenance be done outside of business hours and incident response means helping the triage specialists at all hours several days every week. Since the point security solution is probably a rules or signature engine, the security engineer spends many hours dialing down the rules generating most of the false positives.

Forensic Investigator
Do you get excited watching paint dry? Great! You have the mental fortitude to be a forensic investigator. Your job is to get involved days after the fact to collect evidence and figure out what actually happened. Chain of custody is a big deal either to enable the business to engage law enforcement, you need to help prepare the lawyer’s response, or regulations require a forensic response. Making your job difficult is all the activity of the IT and SOC folks accessing the systems during the triage response and cleanup that you now have to painstakingly separate from the malicious activity. Since the incident response process probably didn’t ensure chain of custody procedures were followed, you can’t rely on their work and have to recreate it from where you could establish custody procedures. Yes, it will take weeks. Yes, you have a several month backlog, but hey, job security.

IT and business folks
Yes you have a job to do and these emails and tickets from the SOC folks are all marked most urgent. Don’t they …read more