What was the primary focus of Apache Parquet when it was first released over ten years ago?

When Apache Parquet was released over ten years ago, it was designed as a more efficient file format with a specific emphasis on the speed to read data.

How has the focus of data analytics shifted in recent years compared to the early days of 'big data'?

While the early focus was on big data within the Apache Hadoop ecosystem, the current market emphasis has shifted toward 'speed to value' to support the needs of A.I. and M.L. models.

Does the blog post mention Databricks or Snowflake as part of the 'big data' companies from 2013?

No. The author notes that Databricks took its first funding in 2013 and Snowflake did not launch until 2014, so they were not part of the 2013 big data landscape described.

Jay Cuthrell August 28, 2023

Are You Gonna Go Parquet

A look at the past, present, and future of Apache Parquet.

This week we take a look at the past, present, and future of Apache Parquet.

This week’s musical inspiration in title and lyrics:

https://open.spotify.com/track/45Ia1U4KtIjAPPU7Wv1Sea?si=ece0e6f236f348e5

Getting Informed

A year goes by fast. In fact, a year ago I penned the footnote laden “You Get A Line and I’ll Get A Poll Result” (2022).

Just over ten years ago, Apache Parquet was released as a more efficient file format with an emphasis on _speed to read data_. At the time, “big data” was seeing regular coverage within the Apache Hadoop ecosystem that had developed over the prior seven years.

Today, the need to feed A.I. and M.L. models is part of the growing interest in “data analytics” with increasing emphasis on _speed to value_. In fact, as I mentioned back in “Dig Your Own SQL” (2021), the market emphasis on _speed to value_ is only going to accelerate.

Now it’s time for reading 📖, watching 📺, and listening 🎧 suggestions:

First, 📖 Demystifying the Parquet File Format in which Michael Berk makes the case for why Apache Parquet is such an attractive option.
Second, 📖 How Much Longer Can Computing Power Drive Artificial Intelligence Progress? (2022) in which Andrew Lohn and Micah Musser raised the red flag on cloud costs to achieve AI models by literally throwing hardware at the problem.
Third, 🎧 How statistics are collected and used when working with Apache Parquet files and Apache Iceberg tables in which Alex Merced breaks down the reasons for using Apache Parquet.
Fourth, 📺 Using AI to Find Truth in Data in which John Furrier and Dave Vellante interview Joel Inman and discuss a different way of thinking about compute for AI-driven SQL architecture.

So that’s why you’ve got to try 🎶

Ten years ago, the hottest companies of “big data” in 2013 according to CIO were probably companies many IT folks with longer lived careers would recognize. But where are they now?

Cloudant (acquired by IBM in 2014)
ParStream (acquired by Cisco in 2015)
Skytree (acquired by Infosys in 2017)
ScaleArc (acquired by IgniteTech in 2018)
Xplenty (acquired by Xenon Ventures in 2018)
MapR (acquired by HPE in 2019)
Cloudera (acquired by PE firms KKR and CD&R in 2021)
Lucidworks (Series F)
SiSense (Series F)
SumAll (appears to have gone out of business)

Perhaps you are wondering why you don’t see two of the most talked about “big data” companies of recent years like Databricks or Snowflake. Well, it’s worth noting that current “data analytics” companies like Databricks took their first funding in 2013 and Snowflake would not launch until 2014.

We must engage and rearrange 🎶

Setting aside “big data” circa 2013, what is the importance of Apache Parquet to modern “data analytics” companies today? Here’s a quick scan for Apache Parquet references across just a few names.

While the list above is not exhaustive in any way, it shows that the adoption of Apache Parquet reflects ten years of progress. It is also worth noting that PrestoDB began in 2013, Apache Spark began in 2014, and PrestoSQL aka Trino began in 2019.

So, it’s fair to say… this is just the tip of the Apache Iceberg… which began in 2017.

So, what will be the next big thing in Apache Parquet and other speed to value related technologies?

Until then… Place your bets!

🤓

View this page on GitHub.