🙏 A big thank you to our new sponsor, NexusTek! 🙏
⬅️ Little Green Tags 🧭 Who Said The AI ML Was Fair? ➡️
Are You Gonna Go Parquet
This week we take a look at the past, present, and future of Apache Parquet.
This week’s musical inspiration in title and lyrics:
https://open.spotify.com/track/45Ia1U4KtIjAPPU7Wv1Sea?si=ece0e6f236f348e5
Getting Informed
A year goes by fast. In fact, a year ago I penned the footnote laden “You Get A Line and I’ll Get A Poll Result” (2022).
Just over ten years ago, Apache Parquet was released as a more efficient file format with an emphasis on _speed to read data_. At the time, “big data” was seeing regular coverage within the Apache Hadoop ecosystem that had developed over the prior seven years.
Today, the need to feed A.I. and M.L. models is part of the growing interest in “data analytics” with increasing emphasis on _speed to value_. In fact, as I mentioned back in “Dig Your Own SQL” (2021), the market emphasis on _speed to value_ is only going to accelerate.
Now it’s time for reading 📖, watching 📺, and listening 🎧 suggestions:
- First, 📖 Demystifying the Parquet File Format in which Michael Berk makes the case for why Apache Parquet is such an attractive option.
- Second, 📖 How Much Longer Can Computing Power Drive Artificial Intelligence Progress? (2022) in which Andrew Lohn and Micah Musser raised the red flag on cloud costs to achieve AI models by literally throwing hardware at the problem.
- Third, 🎧 How statistics are collected and used when working with Apache Parquet files and Apache Iceberg tables in which Alex Merced breaks down the reasons for using Apache Parquet.
- Fourth, 📺 Using AI to Find Truth in Data in which John Furrier and Dave Vellante interview Joel Inman and discuss a different way of thinking about compute for AI-driven SQL architecture.
So that’s why you’ve got to try 🎶
Ten years ago, the hottest companies of “big data” in 2013 according to CIO were probably companies many IT folks with longer lived careers would recognize. But where are they now?
- Cloudant (acquired by IBM in 2014)
- ParStream (acquired by Cisco in 2015)
- Skytree (acquired by Infosys in 2017)
- ScaleArc (acquired by IgniteTech in 2018)
- Xplenty (acquired by Xenon Ventures in 2018)
- MapR (acquired by HPE in 2019)
- Cloudera (acquired by PE firms KKR and CD&R in 2021)
- Lucidworks (Series F)
- SiSense (Series F)
- SumAll (appears to have gone out of business)
Perhaps you are wondering why you don’t see two of the most talked about “big data” companies of recent years like Databricks or Snowflake. Well, it’s worth noting that current “data analytics” companies like Databricks took their first funding in 2013 and Snowflake would not launch until 2014.
We must engage and rearrange 🎶
Setting aside “big data” circa 2013, what is the importance of Apache Parquet to modern “data analytics” companies today? Here’s a quick scan for Apache Parquet references across just a few names.
- Snowflake: How to Batch Ingest Parquet Fast with Snowflake
- Databricks: What is Parquet?
- Airbyte: A Deep Dive into Parquet: The Data Format Engineers Need to Know
- Dremio: Parquet File Best Practices
- AWS: Using the Parquet format in AWS Glue
While the list above is not exhaustive in any way, it shows that the adoption of Apache Parquet reflects ten years of progress. It is also worth noting that PrestoDB began in 2013, Apache Spark began in 2014, and PrestoSQL aka Trino began in 2019.
So, it’s fair to say… this is just the tip of the Apache Iceberg… which began in 2017.
So, what will be the next big thing in Apache Parquet and other speed to value related technologies?
Until then… Place your bets!
Disclosure
I am linking to my disclosure.
🤓