Bringing the Community Together with Parquet
This blog post was jointly written by Cloudera (Justin Kestelyn), MapR (Dale Kim), and Twitter (Julien Le Dem) – all contributing companies to the Apache Parquet project:
Since its founding in July 2013, Apache Parquet (incubating) has taken the Apache Hadoop ecosystem by storm and has become the preferred data format in Hadoop. Based on customer requirements for interactive queries at larger scale, the design center of this open source columnar storage format is to limit I/O needed for long-running queries, save storage space with better compression, and enable vectorized execution engines. It has seen rapid adoption across the Apache Hadoop ecosystem and has solidified its place as an open standard.
As a standard across the Hadoop ecosystem, Parquet is supported by both multiple Hadoop vendors and natively across many popular Hadoop components for portability and compatibility. Additionally, broad industry contributions and multiple production use cases mean sustainable value for Parquet users as they can trust in its quality for the long-term.
There are a variety of reasons we’re seeing users adopting Parquet:
1. History tells us that when the entire ecosystem is on board (such as Apache Spark), sustained contributions and innovation is a key result. For example, we are already seeing players outside the original contributors, such as Criteo, Stripe, Netflix, Salesforce.com, and MapR, make invaluable new contributions to Parquet. This engineering effort ensures ongoing innovation and high quality advancements.
2. The aforementioned options for commercial support alleviate vendor lock-in concerns, so users are assured portability across multiple vendors.
3. With Parquet support under multiple projects (Apache Hive, Impala, Apache Spark, Apache Drill, and more), developers can focus on the use cases at hand and continue to leverage the right tool for the job rather than be tied to a single component.
We are continuing to see excitement and innovation in the Parquet community. A few examples of recent advancements include:
- A Look at Recent Community Contributions to Parquet
- Added Data Types as the Initial Step to Self-Describing Data in Parquet
- Parquet Support to Presto
The entire community deserves congratulations not only for making Parquet as useful as it is, but for making both end-user and developer concerns a top priority as its production use cases expand.