Author Archive: datafoam
Open source data community has been rapidly growing over the last 10 years. You can feel this by the emerge of projects like Apache Hadoop, Apache Spark and the likes. It is growing this fast that there is almost no chance of keeping up with its growth without constantly monitoring the related events, announcements and other changes. 10 years ago it was enough to know “just Oracle” or “just MySQL” to make a successful career in data. Now the things has greatly changed, and if you cannot answer questions like “what is the difference between MapReduce and Spark?” and “when would you prefer to use Flink over Storm?” at your job interview you are screwed.
Github Data Community Graph Snapshot
Also, what would be the “next big thing” in data?
And where the community is moving to? You must have seen hundreds of blogs trying to describe their “vision” of the data over the next number of years. Most of them are sponsored by different vendors that try to replace prediction of community dynamics with shaping it, making you believe that “product X” is a next big thing for it to really become so. This is the marketing.
I’m not a big fan of marketing of any kind and I kindly believe that a good software will find its way even without this aggressive marketing. But thinking about the dynamics of open source data community, I was trying to find a trustful source of information, clean from the marketing of big enterprises. And I have found it: it is a github. Github stores source code for most of the open source data products. But it is not just sources, it is a complete story of changes for each of them, including information about the author of this change. What if I analyze this information to show what is happening in community? And here is what I ended up with:
Nodes of the graph represent open source data projects. Each node’s area is representing the number of unique contributors to the specific open source project over the last 10 weeks, prior to the moment of visualization shown on the timeline on top of the video. Edges represent relations between the projects. Relation between 2 open source projects exist if there is at least a single person that has contributed to both projects within the 10 week time range. Usually people are contributing to the related projects, for example people contributing to Apache Hadoop would likely also contribute to Apache HBase and Apache Hive. Colors of the nodes does not have any particular meaning, they are just selected from the colormap at random. All the animation is done in Python with matplotlib and lots of manual code to make it work the way I want.
In fact, there is a huge mutual relationship between many projects, this is why I also include the projects that are not directly related to “data”, but are also making impact on shaping the community of data. These project include Node.js, Docker, Kubernetes and the likes.
There is also a funny thing I have discovered analyzing all this data. There are only 8333 unique developers that are responsible for this whole progress! And luckily I’m one of them.
Enjoy the visualization and feel free to ask questions.
This is the talk I made on Java Day Kiev 2015. It was a great conference after all
The world’s data now doubles in volume every two years. We’re living in an Age of Data fed by the Internet of Anything.
Life in the Age of Data is always-on and always-connected with easy access to incredibly rich sources of analyzed information coming from the Internet, mobile devices, servers, machines, sensors, and so on.
Every business will have the ability to use this data to convert yesterday’s impossible challenges into today’s new products, cures, and life saving innovations. Right now, the leading pharma, automotive, electronics and packaged goods companies are already building their factories of the future around the actionable intelligence from this kind of data to do things like improve manufacturing yields. And older industries like automotive, agriculture and retail are catching up by taking modern data architectures on the road, through the field, or to the cash register to do things that were before been possible.
The power of big data is fundamentally changing the delivery of healthcare. Efforts such as the White House’s Precision Medicine Initiative aim to revolutionize how the United States improves health and treats disease. Businesses also use actionable intelligence from big data to fight fraud, viruses and identity theft, and new open source projects like Apache Metron are changing how we think about cyber security technology.
With all that is happening today, I can’t wait for tomorrow and be part of the movement to a bright future with data shining the way.
The Power of an Open Approach
Getting to the age of data, did not just happen overnight. Before was the Age of RDBMS led by Oracle, and the Age of Web led by Linux, Red Hat, Apache Software Foundation and the Apache HTTP Server.
With the emergence of Apache Hadoop in 2006, the Age of Data was born with the Apache Software Foundation playing a key role yet again. Now 100% of businesses say they will adopt Apache Hadoop and its ecosystem of projects such as Apache Hive, HBase, Spark, Kafka, Storm, NiFi as the center of gravity for a modern data architecture.
In the Age of Data, open is simply the norm, and Hortonworks philosophy has always been predicated on open innovation, open community, open development, open delivery…a fully open approach.
We’re Open, We’re Public, and We’re Proud
With our IPO in 2014, we proudly became a public bellwether for the Age of Data.
Last month, we reported our financial results for 2015 that included $121.9 million in revenue and $165.9 million in gross billings. We also set guidance for 2016 of continued high growth in revenue and billings. When it comes to spending, our CEO Rob Bearden added: “.. make no mistake, we’re manically focused on achieving adjusted EBITDA breakeven and anticipate doing so by the end of 2016.”
Since our business is built on open source, we’re frequently asked the question “but can an open source business model really scale?”. The proof is in the results. Since a picture is worth a thousand words, I’ve charted the earliest years of inflation-adjusted financial data for Oracle, Red Hat, Salesforce.com to provide some perspective.
Noteworthy chart details: Oracle was founded in 1977, and in 1986 they went public and achieved $55.4 million in revenue (which equates to $111.4 million in 2011 dollars). Red Hat was founded in 1993, went public in 1999, and in 2001 achieved more than $100 million in inflation-adjusted revenue. Finally, Salesforce.com was founded in 1999 and in 2004 they went public and achieved about $100 million in revenue.
We’re Powering the Future of Data by Focusing on Customer Success
In the Age of Data every business is a data business. Tomorrow’s leaders are already mastering the value of data to their organizations and embracing an open approach. We’re focused on powering the future of data with them by delivering a new class of data management software solutions built on open source technology.
Whether from data at rest with Hortonworks Data Platform or data in motion with Hortonworks DataFlow, our connected data platforms help our customers tap into all data. We give the world’s leading companies and government agencies actionable intelligence to create modern data applications that were never before possible.
Our open approach ensures we partner with our customers on their data journey. That journey can start by renovating IT architectures to reduce costs and boost functionality. Or it can start by innovating modern data applications that differentiate the business or open new revenue streams.
We are thankful to our customers and partners for embracing the open approach and we vow to stay focused on their success while empowering the broader community in the process.
Join us and be part of the movement to a bright future with an open approach leading the way.
The post Living in the Age of Data preparing for the Future appeared first on Hortonworks.
Welcome back to my blogging adventure. If you’ve been reading my Cybersecurity series; “echo: hello world”, “Cybersecurity: the end of rules are nigh”, and Cybersecurity: why context matters and how do we find it you know just how much time I’ve spent explaining why an integrated cybersecurity analytic solution should focus on delivering value and making the lives of the folks doing incident response easier. As I look across the landscape of security analytic offerings, I see walled gardens consisting of proprietary models and pretty dashboards. Yes, walled gardens are pretty, well maintained places to visit; however, we can’t live there because they don’t meet our needs. Our offices and living rooms are cluttered and organized around how we live and not some pretty picture in an interior design magazine. I believe that a real cybersecurity solution should aim to reflect our work spaces; functional and configurable to how we want to work and not some engineer’s idea of what’s best for us.
Today, we will go over a high level conceptual architecture for a practical cybersecurity analytic framework that works for us by adapting to how we do business. Before we dive in let’s give the 100,000 foot overview of what the conceptual architecture looks like.
The critical path in the architecture is the red arrow in the middle. We need to take raw sensor data and reliably generate an automated response. Like our messy living room or office; it is the output of the work and not the pretty picture that provides value. If the analytic models and response rules can make that call for response then no pretty dashboard is required. Why build in a big red button for the SOC analyst to click if an invisible response is faster?
The sensors component is the data ingestion point of all machine data in the company and acts as the interface to the data flow. The critical path starts here. Automation and remote management of these sensors allows for efficient operation and flexible response mid-incident if greater data volume or fidelity is required. I foresee a shift from niche security products towards sensors embedded in our application architecture; as monolithic applications transform into as-a-service cloud enabled components, our security controls must transform along with them.
This is where the system provides maximum value. Regardless of whether the analytical models & rules, or the workflow and manual review through the user interface triggered the response event, automation of the response activity is part of the critical path. The automated response provides the automation interface to the rest of the company’s assets for command and control. Again, I foresee a shift from specialized security products towards automated response components embedded in our application architecture. These embedded sensors and response components will give a new, truer meaning to data centric security in the internet of anything.
The Internet of Anything
Data Centric Security
Data is stored in the data lake for historical analytic replay as new knowledge becomes available is a key advantage in this approach. In addition, this data is available for training models regarding normal and abnormal behavior, and allowing the simulation of new automated response capability. The ability to demonstrate with actual data that a new automated blocking capability, when replayed over the last three years of collected data, wouldn’t have caused negative impact to business operations is necessary in gaining approval for implementation.
Both historical data in the data lake and streaming live data flowing through the analytical models:
Generate baseline understanding of normal and abnormal activity
Create the full picture context of what is happening on the applications, networks, and systems
Enrich and correlate information into full context events for either automated response or manual review.
After the analytic models have transformed the raw data flowing through the system into enriched data elements that are both descriptive and predictive in nature the rules engine applies the company’s prescriptive rules or policy on how those events need to be handled. This is critically important in allowing an organization to apply their own risk tolerance to the response process.
Similar to rules in that they allow the company to configure solution to meet their needs. Workflow allows the company to configure the incident response steps and automated response in a manner that enables the business instead of the business bending around the solution. This multi-user/multi-tenant workflow engine allows for cross organization response to be configured. In addition, by being part of the analytic solution, key performance and risk metrics can be collected to: measure the health of the process, allow for security analyst performance review and on the job training, and make the work visible in a manner that shows the value to the organization as a whole.
This is the layer that provides visual interface elements that visualize data. By refactoring these dashboard elements away from the user interface, we enable each user to create their own user interface experience and provide a consistent visualization of the data across user interface displays for efficient cognitive uptake of information.
It is important that the user interface elements are decoupled from the rest of the solution stack. If we are going to hit the goal of a single pane of glass view to the analytic response process we need a user interface that: adapts to the user’s needs and changing roles, fine grained security for multi-user/multi-tenant access through the user interface, and a pluggable design that allows both workflow steps and dashboard elements to be combined for the most efficient response. This solution is open and ever changing so it is critical that the user interface provides the ability to plug in and organize user interface elements from other areas instead of creating them; otherwise every weekly change requires reprogramming the user interface for the new workflow or data elements. I foresee a future where a vibrant community of public and proprietary analytical components are able to plug into …read more
Hadoop just turned 10, the first code check-in was on Feb. 2, 2006 by our very own co-founder, Owen O’Malley. I am tremendously proud to have been a part of this first 10 years, and even more excited on where this open movement is going to take us. Congratulations to everyone in the Community!
We started Hortonworks with a vision that Hadoop would process half of the world’s data and we founded Hortonworks on four key principles:
1. Innovate at the core architecture of Hadoop
2. Commit to Enterprise Hadoop
3. Enable an Open Ecosystem
4. Do everything in Open Source
Not only did we relentlessly follow these four key principles, in the process we built a great business. We finished our first year as a public company with $122M in revenue and became the fastest software company to reach $100M in just 4 years. We have over 800 customers that work with us every day, and we have over 1600 partners in this growing and thriving ecosystem.
It is interesting to reflect back on the early days, but what excites me more is where this is all going, what is the future of data? In fact, we at Hortonworks, see a future where Hadoop and related technologies will manage all of the world’s data. After all, it’s about All Data, not just Big Data, but All Data – Data from every endpoint, person, device, click, swipe, server log and stream that can be collected, conducted and curated to deliver actionable intelligence for every business. Data is at the heart of every business and one of its most important assets. Data-in-motion and data-at-rest. Data that is real time, predictive, streaming, structured, unstructured, mission-critical and everything in between. Our vision and promise of Powering the Future of DataTM is underway.
Today, in San Francisco, we announced our strategy around Open and Connected Data Platforms for data-in-motion and data-at-rest. We’ve updated our core product, the Hortonworks Data Platform (HDP) and announced a new release model that will keep pace with this amazing market. We announced Spark 1.6 is now available on HDP 2.4. We also announced our data-in-motion Hortonworks DataFlow platform now has integration with streaming analytics engines, Apache Kafka and Storm. Finally, we were proud to stand on stage with Hewlett Packard Enterprise’s CTO, Martin Fink as we discussed how we are collaborating on more community contributions around optimizing the performance of Spark. It is great to have such a great partner working together on open community innovation.
The Apache Hadoop community should be commended for truly tackling the Data challenge and taking the tech to the next level. I personally believe that the community is immensely important to the future. That is is critical to have an open, truly open community. Innovation happens as a result of the community. And we at Hortonworks are certainly proud to be a part of it.
Thanks to Open Community Innovation, the next 10 years promise to be even more exciting, we encourage you all to join the conversation surrounding Hortonworks and the #futureofdata.
I’ve said it before and I’ll say it again, we are OPEN, we are PUBLIC and we are PROUD. Hortonworks Data Platform is 100% open source. Hortonworks Data Flow is 100% open source. Apache Metron, the incubating cybersecurity effort Hortonworks is stewarding, is 100% open source.
Our strategy remains committed to 100% open, our products are 100% open, Hortonworks breaks down silos pushes boundaries and enables the entire ecosystem to flourish and innovate.
Will we collaborate with other companies that offer proprietary software? Absolutely. Do we offer proprietary software? Not at all.
There’s been a lot of chatter about who we are and what we do in the last few days. Guys, sometimes a headline is just a headline.
Tomorrow we have some incredible announcements hope you can tune in http://hortonworks.com/march-1/
I had the pleasure to speak at Spark Summit in New York today about accelerating the adoption of Spark by mainstream enterprises. I had to admit at the beginning of my talk that I’m an “open source addict” — over the past 12 years I’ve been blessed to have called JBoss, Red Hat, SpringSource, and Hortonworks home. My focus has been the same at each stop: how can we innovate in open source technology and deliver enterprise-scale, easy to use products and solutions that can be consumed by mainstream enterprises?
While I’m excited to talk about the technology itself, it’s always important to root the conversation in why enterprises should care. In the case of Apache Spark, the simple answer is: because Spark helps unlock the enormous potential of data for the enterprise.
I have had the pleasure to work with the team at Webtrends and they are a great example of exactly what I mean. They adopted Hadoop and Spark a while ago, and they consolidated their Spark and Hadoop clusters into one YARN-based HDP cluster where they run Spark on YARN in the Hortonworks Data Platform (HDP) as one of many workloads. The company is approaching 1.5 petabytes stored in its HDP data lake. Spark now processes 13 billion events per day. What I find most compelling is that this modern data architecture enabled them to introduce a new product offering called Webtrends Explore which allows their customers to dive deep into their data and gain the flexibility of answering important business questions immediately. You can learn more about Webtrends use cases and journey by watching the video here.
One of the other examples I presented is how a railroad company is using HDP and Spark to deliver a realtime view of the state-of-the-train-tracks. Video images and geolocation are key data elements in the solution that’s focused on preventing accidents before they occur. If this example doesn’t underscore the fact that the age of data has truly arrived for any type of business, then I’m not sure what will.
So with that as context, what are the macro trends we’re seeing?
First, Spark is becoming the defacto data API for many big data processing workloads. To date for analytics and reporting and more recently for workloads like ETL and streaming. It’s become one of the key tools in the toolbox and an important element in a modern data architecture.
Second, Spark is getting broad adoption in the enterprise. A series of use cases are developing rapidly. For example using Spark as a query federation engine, or with HDP ecosystem projects such as Hive and HBase. Any new apps will likely be built on Spark. But missing enterprise capabilities is still key. That’s where we can bring our expertise to bear.
Third, agile analytic development and data science still remains the frontier. We need to democratize Spark to not only for those who know Scala, Java, Python, and R but to the broadest community of “developers” possible. We need better tooling for professional developers as well as business “developers.” We need to encourage universities to pay attention to this movement, and we need to reach out to undergrads and encourage them to learn Spark and/or tools that ride atop.
In light of this, Hortonwork’s strategy is threefold in relation to Apache Spark:
#1: Make agile analytic development and data science easier and more productive. Highlights include:
Apache Zeppelin: a web-based notebook for agile analytic development. This open source tool provides a visual interactive experience for uncovering insights and sharing those insights with others.
Magellan: an open source library for Geospatial Analytics that uses Spark as the underlying execution engine. Geospatial data is pervasive in mobile devices, sensors, logs, and wearables. If you are working with geospatial data and big data sets that need spatial context, there are limited open source tools that make it easy for you to parse and query at scale, which makes this hard for business intelligence and predictive analytics apps. Magellan facilitates geospatial queries and builds upon Spark to address the hard problems of dealing with geospatial data at scale.
#2: Accelerate capabilities that harden Spark for enterprise use. In areas ranging from encryption and security, data governance, HA, DR, operations and debugging. We’re also improving data integration with things like RDD caching in HDFS, and providing a unified Hive and Spark connector for HBase that eliminates complexity and improves overall performance.
#3: Continue to innovate at the core. We want to make this the best experience and performance possible with HDP. No secret sauce. All open and all going back into the community. This includes enhanced support for YARN with dynamic executor allocation support in HDP so Spark runs better within multitenant YARN clusters. We’ve also been quietly working with the talented folks at HP Labs on providing an optimized Spark experience at the core. I can’t go into details now, but I encourage you to tune in on March 1st!
The pace of innovation in the Spark community is moving fast, and we plan on staying in lock step with the community. For example within a few hours of the community release of Spark 1.6, we made a technical preview available for deployment on our current version of HDP, and we’re marching quickly to GA.
We live in an age where every business is a data business. Tomorrow’s leaders are already mastering the value of data and embracing an open approach. If you’re just getting started, don’t be shy. Join the community and be part of this journey.
The post Spark Summit: Accelerating Enterprise Spark appeared first on Hortonworks.
Welcome back to my blogging adventure. If you’ve been reading along, you’re aware of the lightbulb moments from my article, “echo: hello world”, that allowed me to discover the benefits of an analytic approach to cybersecurity. Next I gave a little slice in the life of our intrepid SOC analyst in, “Cybersecurity: the end of rules are nigh”, where I gave a little detail behind my belief that we need to move away from a rules detection approach to cybersecurity monitoring. Today, we will spend some more time with our SOC analyst living the life of event triage. My hope is we come away with a greater understanding of why context matters as I show a high level process for efficient incident response triage.
The context conundrum
To understand why context is so critically important we need to forget about technology for a moment and focus on people and process. A hard lesson I’ve learned in my career is that when we focus on the technology we end up creating solutions that make the person work for the machine instead of the machine enabling the person. So let’s take a moment and get in the shoes of our intrepid SOC folks and walk through a day in their lives.
Typically, the first line is the SOC analyst focused on responding to alerts and determine if it’s a false positive or something that requires escalation. Typically, this is a junior shift level person in the security equivalent of the help desk call specialist role. They have job guides, run books, or knowledge trees that they follow as they gain experience. The process they follow is probably documented as follows: Easy as 1-2-3:
1. Look in SEIM and select top alert
3. Decide to escalate or filter
Pretty simple process right? If only the real world worked that way. What actually probably happens are steps 1-14, give or take another 1 or 14 more.
Look at the SOC dashboards to get an overall feel of what’s going on
Look at SIEM alert containing two IP addresses and and obscure alert name
Go into several other consoles looking up what system owns each IP address
Web search if IP address is external to see who owns it and if it has a bad reputation
Take the looked up system names and enter yet another console to look up the asset inventory information such as what should be running on the machine and who owns it
Send emails to the owners for details since the asset inventory information is probably out of date or incomplete
Look in yet another console for details regarding the alert such as what it does and what vulnerabilities it targets
Look in yet another console to see if the asset has been patched or fixed as of the last scan, since the scan is probably a few weeks out of date
Send email for a one-off vulnerability scan to verify
Look in yet another console to see if the systems have been backed up and when the last time the backup has actually been tested
Start feeling nervous since most DR tests are yearly
Look in yet another console to see if the assets are one of the systems logging to the consolidated repository or if more emails need to be sent to get the logs
Realize that not every application log on the system is actually logging to the repository
Send emails to the applications teams to get those logs…
Two hours later after all the data can be reviewed – false alarm
Time to send emails to escalate all the broken things like log forwarding being broken and no one had noticed, backup jobs showing as incomplete or failed, patches marked as installed showing up as no longer installed in the one off vulnerability scan, etc
Time for lunch then on to the next alert review.
End of shift turnover to next group of analysts regarding emails still waiting for response and the other few hundred thousand events you didn’t have time to get to.
The short term goal of any analyst is to get away from the triage cycle and move up to the role of the security engineer. Typically, the security engineer’s primary responsibility is the care and feeding of one or more point security solutions. They do capacity planning, system maintenance and upgrades, and be available to assist if their technology is part of an incident response escalation. The promise is a standard work day with the occasional off hours call if incident response is required. The reality is change management requires all maintenance be done outside of business hours and incident response means helping the triage specialists at all hours several days every week. Since the point security solution is probably a rules or signature engine, the security engineer spends many hours dialing down the rules generating most of the false positives.
Do you get excited watching paint dry? Great! You have the mental fortitude to be a forensic investigator. Your job is to get involved days after the fact to collect evidence and figure out what actually happened. Chain of custody is a big deal either to enable the business to engage law enforcement, you need to help prepare the lawyer’s response, or regulations require a forensic response. Making your job difficult is all the activity of the IT and SOC folks accessing the systems during the triage response and cleanup that you now have to painstakingly separate from the malicious activity. Since the incident response process probably didn’t ensure chain of custody procedures were followed, you can’t rely on their work and have to recreate it from where you could establish custody procedures. Yes, it will take weeks. Yes, you have a several month backlog, but hey, job security.
IT and business folks
Yes you have a job to do and these emails and tickets from the SOC folks are all marked most urgent. Don’t they …read more
Starting Apache Spark version 1.6.0, memory management model has changed. The old memory management model is implemented by StaticMemoryManager class, and now it is called “legacy”. “Legacy” mode is disabled by default, which means that running the same code on Spark 1.5.x and 1.6.0 would result in different behavior, be careful with that. For compatibility, you can enable the “legacy” model with spark.memory.useLegacyMode parameter, which is turned off by default.
Previously I have described the “legacy” model of memory management in this article about Spark Architecture almost one year ago. Also I have written an article on Spark Shuffle implementations that briefly touches memory management topic as well.
This article describes new memory management model used in Apache Spark starting version 1.6.0, which is implemented as UnifiedMemoryManager.
Long story short, new memory management model looks like this:
Apache Spark Unified Memory Manager introduced in v1.6.0+
You can see 3 main memory regions on the diagram:
Reserved Memory. This is the memory reserved by the system, and its size is hardcoded. As of Spark 1.6.0, its value is 300MB, which means that this 300MB of RAM does not participate in Spark memory region size calculations, and its size cannot be changed in any way without Spark recompilation or setting spark.testing.reservedMemory, which is not recommended as it is a testing parameter not intended to be used in production. Be aware, this memory is only called “reserved”, in fact it is not used by Spark in any way, but it sets the limit on what you can allocate for Spark usage. Even if you want to give all the Java Heap for Spark to cache your data, you won’t be able to do so as this “reserved” part would remain spare (not really spare, it would store lots of Spark internal objects). For your information, if you don’t give Spark executor at least 1.5 * Reserved Memory = 450MB heap, it will fail with “please use larger heap size” error message.
User Memory. This is the memory pool that remains after the allocation of Spark Memory, and it is completely up to you to use it in a way you like. You can store your own data structures there that would be used in RDD transformations. For example, you can rewrite Spark aggregation by using mapPartitions transformation maintaining hash table for this aggregation to run, which would consume so called User Memory. In Spark 1.6.0 the size of this memory pool can be calculated as (“Java Heap” – “Reserved Memory”) * (1.0 – spark.memory.fraction), which is by default equal to (“Java Heap” – 300MB) * 0.25. For example, with 4GB heap you would have 949MB of User Memory. And again, this is the User Memory and its completely up to you what would be stored in this RAM and how, Spark makes completely no accounting on what you do there and whether you respect this boundary or not. Not respecting this boundary in your code might cause OOM error.
Spark Memory. Finally, this is the memory pool managed by Apache Spark. Its size can be calculated as (“Java Heap” – “Reserved Memory”) * spark.memory.fraction, and with Spark 1.6.0 defaults it gives us (“Java Heap” – 300MB) * 0.75. For example, with 4GB heap this pool would be 2847MB in size. This whole pool is split into 2 regions – Storage Memory and Execution Memory, and the boundary between them is set by spark.memory.storageFraction parameter, which defaults to 0.5. The advantage of this new memory management scheme is that this boundary is not static, and in case of memory pressure the boundary would be moved, i.e. one region would grow by borrowing space from another one. I would discuss the “moving” this boundary a bit later, now let’s focus on how this memory is being used:
Storage Memory. This pool is used for both storing Apache Spark cached data and for temporary space serialized data “unroll”. Also all the “broadcast” variables are stored there as cached blocks. In case you’re curious, here’s the code of unroll. As you may see, it does not require that enough memory for unrolled block to be available – in case there is not enough memory to fit the whole unrolled partition it would directly put it to the drive if desired persistence level allows this. As of “broadcast”, all the broadcast variables are stored in cache with MEMORY_AND_DISK persistence level.
Execution Memory. This pool is used for storing the objects required during the execution of Spark tasks. For example, it is used to store shuffle intermediate buffer on the Map side in memory, also it is used to store hash table for hash aggregation step. This pool also supports spilling on disk if not enough memory is available, but the blocks from this pool cannot be forcefully evicted by other threads (tasks).
Ok, so now let’s focus on the moving boundary between Storage Memory and Execution Memory. Due to nature of Execution Memory, you cannot forcefully evict blocks from this pool, because this is the data used in intermediate computations and the process requiring this memory would simply fail if the block it refers to won’t be found. But it is not so for the Storage Memory – it is just a cache of blocks stored in RAM, and if we evict the block from there we can just update the block metadata reflecting the fact this block was evicted to HDD (or simply removed), and trying to access this block Spark would read it from HDD (or recalculate in case your persistence level does not allow to spill on HDD).
So, we can forcefully evict the block from Storage Memory, but cannot do so from Execution Memory. When Execution Memory pool can borrow some space from Storage Memory? It happens when either:
There is free space available in Storage Memory pool, i.e. cached blocks don’t use all the memory available there. Then it just reduces the Storage Memory pool size, increasing the Execution Memory pool.
Storage Memory pool size …read more