The Rise of Unstructured Data
The word “data” is ubiquitous in narratives of the modern world. And data, the thing itself, is vital to the functioning of that world. This blog discusses quantifications, types, and implications of data. If you’ve ever wondered how much data there is in the world, what types there are and what that means for AI and businesses, then keep reading!
Quantifications of data
The International Data Corporation (IDC) estimates that by 2025 the sum of all data in the world will be in the order of 175 Zettabytes (one Zettabyte is 10^21 bytes). Most of that data will be unstructured, and only about 10% will be stored. Less will be analysed.
Seagate Technology forecasts that enterprise data will double from approximately 1 to 2 Petabytes (one Petabyte is 10^15 bytes) between 2020 and 2022. Approximately 30% of that data will be stored in internal data centres, 22% in cloud repositories, 20% in third party data centres, 19% will be at edge and remote locations, and the remaining 9% at other locations.
The amount of data created over the next 3 years is expected to be more than the data created over the past 30 years.
So data is big and growing. At current growth rates, it is estimated that the number of bits produced would exceed the number of atoms on Earth in about 350 years – a physics-based constraint described as an information catastrophe.
The rate of data growth is reflected in the proliferation of storage centres. For example, the number of hyperscale centres is reported to have doubled between 2015 and 2020. Microsoft, Amazon and Google own over half of the 600 hyperscale centres around the world.
And data moves around. Cisco estimates that global IP data traffic has grown 3-fold between 2016 and 2021, reaching 3.3 Zettabytes per year. Of that traffic, 46% is done via WiFi, 37% via wired connections, and 17% via mobile networks. Mobile and WiFi data transmissions have increased their share of total transmissions over the last five years, at the expense of wired transmissions.
Classifications of data
A first analysis of the world’s data can be taxonomical. There are many ways to classify data: by its representation (structured, semi-structured, unstructured), by its uniqueness (singular or replicated), by its lifetime (ephemeral or persistent), by its proprietary status (private or public), by its location (data centres, edge, or endpoints), etc. Here we mostly focus on structured vs unstructured data.
In terms of representation, data can be broadly classified into two types: structured and unstructured. Structured data can be defined as data that can be stored in relational databases, and unstructured data as everything else. In other words, structured data has a pre-defined data model, whereas unstructured data doesn’t.
Examples of structured data include the Iris Flower data set where each datum (corresponding to a sample flower) has the same, predefined structure, namely the flower type, and four numerical features: height and width of the petal and sepal. Examples of unstructured data, on the other hand, include media (video, images, audio), text files (email, tweets), business productivity files (Microsoft Office documents, Github code repositories, etc.)
Generally speaking, structured data tends to have a more mature ecosystem for its analysis than unstructured data. However –and this is one of the challenges for businesses– there is an ongoing shift in the world from structured to unstructured data, as reported by IDC. Another report states that between 80% and 90% of the world’s data is unstructured, with about 90% of it having been produced over the last two years alone. Currently only about 0.5% of that data is analysed. Similar figures of 80% of data being unstructured and growing at a rate of 55% to 65% annually is reported here.
Data produced by sensors is reported to be one of the fastest growing segments of data and to soon surpass all other data types. And it turns out that image and video cameras, although making a relatively small portion of all manufactured sensors, are reported to produce the most data among sensors. From this information, it can be argued that images and video make up a very significant contribution to the world’s data.
The IDC categorizes data into four types: entertainment video and images, non-entertainment video and images, productivity data, and data from embedded devices. The last two types, productivity data and data from embedded devices, are reported to be the fastest growing types. Data from embedded devices, in particular, is expected to continue this trend due to the growing number of devices, which itself is expected to increase by a factor of four over the next ten years.
All of the above figures are for data that is produced, but not necessarily transmitted, e.g., between IP addresses. It is estimated that about 82% of the total IP traffic is video, up from 73% in 2016. This trend might be explained by increased usage of Ultra High Definition television, and the increased popularity of entertainment streaming services like Netflix. Video gaming traffic, on the other hand, though much smaller than video traffic, has grown by a factor of three in the last five years, and currently accounts for 6% of the total IP traffic.
Now let’s explore some of the challenges that copious amounts of data bring to the AI, business, and engineering communities.
The challenges of data
Data facilitates, incentivizes, and challenges AI. It facilitates AI because, to be useful, many AI models require large amounts of data for training. Data incentivizes AI because AI is one of the most promising ways to make sense of, and extract value from, the data deluge. And data challenges AI because, in spite of its abundance in raw form, data needs to be annotated, monitored, curated, and scrutinized in its societal effects. Here we briefly describe some of the challenges that data poses to AI.
Abundance of data has been one of the main facilitators of the AI boom of the last decade. Deep Learning, a subset of AI algorithms, typically requires large amounts of human annotated data to be useful. But performing human annotations is expensive, unscalable, and ultimately unfeasible for all the tasks that AI may be set to perform in the future. This challenges AI practitioners because they need to develop ways to decrease the need for human annotations. Enter the field of learning with limited labeled data.
There is a plethora of efforts to produce models that can learn without labels or with few labels. Since learning with labeled data is known as supervised learning, methods that reduce the need for labels have names such as self-supervision, semi-supervision, weak-supervision, non-supervision, incidental-supervision, few-shot learning, and zero-shot learning. The activity in the field of learning with limited data is reflected in a variety of courses, workshops, reports, blogs and a large number of academic papers (a curated list of which can be found here). It has been argued that self-supervision might be one the best ways to overcome the need for annotated data.
“Everyone wants to do the model work, not the data work” starts the title of this paper. That paper makes the argument that work on data quality tends to be under-appreciated and neglected. And, it is argued, this is particularly problematic in high-stakes AI, such as applications in medicine, environment preservation and personal finance. The paper describes a phenomenon called Data Cascades, which consists of the compounded negative effects that have their root in poor data quality. Data Cascades are said to be pervasive, to lack immediate visibility, but to eventually impact the world in a negative manner.
Related to the neglect of data quality, it has been observed that much of the efforts in AI have been model-centric, that is, mostly devoted to developing and improving models, given fixed data sets. Andrew Ng argues that it is necessary to place more attention on the data itself – that is, to iteratively improve the data on which models are trained, rather than only or mostly improving the model architectures. This promises to be an interesting area of development, given that improving large amounts of data might itself benefit from AI.
Data fairness is one of the dimensions of ethical AI. It aims to protect AI stakeholders from the effects of biased, compromised or skewed datasets. The Alan Turing Institute proposes a framework for data fairness that includes the following elements:
- Representativeness: using correct data sampling to avoid under- or over-representations of groups.
- Fitness-for-Purpose and Sufficiency: the collection of enough quantities of data, and the relevancy of it to the intended purpose, both of which impact the accuracy and reasonableness of the AI model trained on the data.
- Source Integrity and Measurement Accuracy: ensuring that prior human decisions and judgments (e.g., prejudiced scoring, ranking, interview-data or evaluation) are not biased.
- Timeliness and Recency: data must be recent enough and account for evolving social relationships and group dynamics.
- Domain Knowledge: ensuring that domain experts, who know the population distribution from which data is obtained and understand the purpose of the AI model, are involved in deciding the appropriate categories and sources of measurement of data.
There are also proposals to move beyond bias-oriented framings of ethical AI, like the above, and towards a power-aware analysis of datasets used to train AI systems. This involves taking into account “historical inequities, labor conditions, and epistemological standpoints inscribed in data”. This is a complex area of research, involving history, cultural studies, sociology, philosophy, and politics.
Before we discuss the implications of data and their challenges, it is relevant to say a few words about computational resources. In 2019 OpenAI reported that the computational power used in the largest AI trainings has been doubling every 3.4 months since 2012. This is much higher than the rate between 1959 and 2012, when requirements doubled only every 2 years, roughly matching the growth rate of computational power itself (as measured by the number of transistors, Moore’s law). The report doesn’t explicitly say whether the current compute-hungry era of AI is a result of increasing model complexity or increasing amounts of data, but it is likely a combination of both.
Addressing the challenges of data
At Cloudera we have taken on several of the challenges that unstructured data poses to the enterprise. Cloudera Fast Forward Labs produces blogs, code repositories and applied prototypes that specifically target unstructured data like natural language, images, and will soon be adding resources for video processing. We have also addressed the challenge of learning with limited labeled data and the related topic of few shot classification for text, as well as ethics of AI. Additionally, Cloudera Machine Learning facilitates the work of enterprise AI teams with the full data lifecycle, data pipelines, and scalable computational resources, and enables them to focus on AI models and their productionization.
Perhaps the two most important pieces of information presented above are
- Unstructured data is both the most abundant and the fastest-growing type of data, and
- The vast majority of that data is not being analysed.
Here we explore the implications of these facts from four different perspectives: scientific, engineering, business, and governmental.
From a scientific perspective, the trends described above imply the following: developing fundamental understandings of intelligence will continue to be facilitated, incentivized and challenged by large amounts of unstructured data. One important area of scientific work will continue to be the development of algorithms that require little or no human annotated data, since the rates at which humans can label data cannot keep pace with the rate at which data is produced. Another area of work that will grow is data-centric model development of AI algorithms, which should complement the model-centric paradigm that has been dominant up to now.
There are many implications of large unstructured data for engineering. Here we mention two. One is the continued need to accelerate the maturation process of ecosystems for the development, deployment, maintenance, scaling and productionization of AI. The other is less well defined but points towards innovation opportunities to extend, refine and optimize technologies originally designed for structured data, and make them better suited for unstructured data.
Challenges for business leaders include, on the one hand, understanding the value that data can bring to their organizations, and, on the other, investing and administering the resources necessary to attain that value. This requires, among other things, bridging the gap that often exists between business leadership and AI teams in terms of culture and expectations. AI has dramatically increased its capacity to extract meaning from unstructured data, but that capacity is still limited. Both business leaders and AI teams need to extend their comfort zones in the direction of each other in order to create realistic roadmaps that deliver value.
And last but not least, challenges for governments and public institutions include understanding the societal impact of data in general, and, in particular, on how unstructured data impacts the development of AI. Based on that understanding, they need to legislate and regulate, where appropriate, practices that ensure positive outcomes of AI for all. Governments also hold at least part of the responsibility of building AI national strategies for economic growth and the technological transformation of society. Those strategies include development of educational policies, infrastructure, skilled labour immigration processes, and regulatory processes based on ethical considerations, among many others.
All of those communities, scientific, engineering, business, and governmental, will need to continue to converse with each other, breaking silos and interacting in constructive ways in order to secure the benefits and avoid the drawbacks that AI promises.