A glimpse into the future of open data architecture


(evannovstro / Shutterstock)

Hadoop may have fizzled out as a data platform, but it laid the foundation for an open data architecture that continues to grow and evolve today, much of it in the cloud. We got a glimpse into the future of this open data architecture at the recent Subsurface conference, which brought together the creators of several promising technologies for data lakes and data lakehouses.

Much of the exciting work in data architecture today takes place in the cloud. Thanks to the availability of infinite object-based storage (such as S3) and unlimited on-demand compute (thanks to Docker and Kubernetes), the physical limitations of collecting, storing and processing massive quantities data has largely disappeared (it also introduced new cost concerns, but that’s another topic for another day).

When a problem is fixed, new problems usually appear. In this case, as storage and compute has been ‘solved’, the focus is now on how best to allow the largest group of users to access and use this data in the most efficient way. . For various reasons, this problem is not solved, especially when it comes to growing big data environments. Attempts to classify legacy data management technologies and techniques into this new cloud data paradigm have had mixed success.

In short, with the new age of the data cloud upon our doorstep, the thinking goes, we need new tools and technologies to take advantage of it. This is precisely what a new generation of technologists who advocate open data tools to work in the open data architecture hopes to do. This is also what the cloud analytics provider Dremio focused on with his Live conference below the surface, which was held virtually at the end of July.

In a Subsurface panel on the future of open data architecture, Gartner analyst Sanjeev Mohan spoke about the future with four people who create these technologies, including Wes McKinney, creator of Pandas and co-creator of Apache Arrow; Ryan Blue, creator of the Iceberg table format; Julien Le Dem, co-creator of Parquet; and Ryan Murray, co-creator of Nessie.

“It’s very exciting to see that a journey that we started in open source decades ago seems to be coming to fruition,” Mohan said. “We finally seem to be in a place where, in the open data architecture, we now have a set of open source projects that complement each other, and they help us build an end-to-end solution.”

Take Apache Iceberg, for example. The technology was originally developed by engineers at Netflix and Apple to address the performance and usability challenges of using Apache Hive tables. While Hive is just one of many SQL parsing engines, the Hive metastore has survived as a de facto glue connecting data stored in HDFS and S3 with modern SQL engines, such as Dremio, Presto, and Spark. .

Unfortunately, the Hive metastore does not perform well in dynamic big data environments. Changes to data must be coordinated, which can be a complex and error-prone process. When not done correctly, the data can get corrupted. As a replacement for Hive tables, Iceberg supports atomic transactions, which gives users this guarantee of accuracy.

But that was not enough. As we have learned, when one problem is solved, another tends to emerge. In the case of Project Nessie, it was necessary to provide version control for data stored in table formats such as Iceberg.

“When we started to think about Project Nessie, we started to really think about the progression of the Data Lake platform over the past 10 or 15 years,” said Murray, an engineer at Dremio. “We saw people [slowly]… Build abstractions, whether it’s abstractions to help us calculate, or abstractions for things like tables and data files and that sort of thing. We started to think, what’s the next abstraction? What is the thing that makes the most sense? “

For Murray, the next abstraction that was needed was a catalog placed on top of the table formats to promote better interaction with downstream components.

“Just as Ryan Blue felt that Aache Hive was not well suited to the table format – with the single point of failure, the large number of API calls to this metastore, even the Thrift endpoint – a made scalability very difficult, made it really difficult to use effectively, especially cloud native, ”Murray said. “So we were looking for something that was going to be cloud native and work with modern table formats and we could start thinking about extending to all the other wonderful things that my panel is building.”

As one of the most popular big data formats, Parquet is another technology that was originally developed for Hadoop, but continued to be widely adopted after Hadoop adoption ended, thanks to its ability to be used in cloud object stores. The columnar format gives users the ability to run demanding analytical queries, at the same time Teradata, while its compression and native support for distributed file systems allow it to work in modern big data clusters.

Le Dem co-developed Parquet while working at Twitter, who did much of their data analysis on Hadoop or Vertic. Hadoop could scale for large datasets, but it lacked performance for demanding queries. Vertica was the opposite: it could handle ad hoc queries with good performance, but it just couldn’t handle big data.

“We were always between the two options,” Le Dem said. “And I think some of them were making Hadoop a warehouse. Starting from the bottom up, starting with the columnar layout, and making it more efficient, following in the footsteps of these columnar databases. “

Although the prosecution has seen mass adoption, there are still fundamental limits in what it can do. “Parquet is just a file format,” Le Dem said. “It makes things better for the query engine, but it doesn’t deal with anything like, how to create a table, how to do all of these things. So we needed a layer on top. It was great to see this happening in the community.

This brings us to Apache Arrow, which was co-developed by McKinney and which Le Dem is also involved in development. Arrow’s contribution to open data architecture is that it provides a very fast file format for sharing data between a large collection of systems and query engines. This heterogeneity is a hallmark of open data architecture, said Le Dem.

“One of the drivers of this open storage architecture is that people don’t use just one tool,” said Le Dem. “They [use] things like Spark, they use things like Pandas. They use warehouses, or SQL-on-Hadoop type things, like Dremio and Presto, but also other proprietary warehouses. So there’s a lot of fragmentation, but they still want to be able to use all of these tools and machine learning on the same data. So have that common storage layer [Arrow] It makes perfect sense to standardize this so that you can create and transform data from a variety of sources. “

The need for Arrow emerged in the middle of the Hadoop hype cycle. “About six years ago, we recognized that… the community had developed Parquet as an open standard for data storage and warehousing for data lakes and for the Hadoop ecosystem,” McKinney said.

“But more and more we were seeing this increase in heterogeneity of applications and programming languages, where you have to love applications. data between two different stages of the application pipeline is very expensive, ”he continued.

McKinney, who recently integrated Ursa Computing into his new startup Voltron Data, is now working on Arrow Flight, a framework for rapid data transport that relies on gPRC, a remote procedure call (PRC) technology that functions as a protocol buffer for distributed applications. An extension for Arrow Flight could potentially replace JDBC and ODBC, enabling rapid data transformation at all levels, McKinney said.

Going forward, as technologies like Arrow, Iceberg, Nessie, and Parquet are integrated into the data ecosystem, this will enable a new generation of productivity among developers and engineers tasked with building data-driven applications, Murray said.

“A lot of the data engineers I interact with are thinking about how big my Parquet file is and what directory it belongs to for partitions to be used, and how to make sure it has the right schema and all that sort of thing. things, “he said. “And I think we’re so ready to stop talking about this. So engineers can just start writing SQL and applications on top of these.

Freedom of choice is a hallmark of open data lake architecture, said Tomer Shiran, CTO of Dremio, during his opening address on Surface.

“You can choose the best-in-class engine for a given workload,” said Shiran. “Not only that, but in the future, as new engines are created, you will be able to choose those engines as well. It becomes very easy to launch a new engine, point it at your data, open source Parquet files, or open source Iceberg tables, and start querying and modifying that data.

Open data lakes and lake houses are gaining ground in the market and, thanks to such technologies, will become the predominant architecture in the future, predicts Dremio CEO Billy Bosworth.

“When you have these architectural changes as we see it today, from classic relational database structures to these open data lake architectures, these types of changes tend to last for decades,” Bosworth said during of its Subsurface session. “Our engineers and architects are building that future for all of us, a future where things are more easily accessible, where data arrives faster and the time to value for that data increases rapidly. And it does so in a way that allows people to have the best options in the types of services they want to use against that data. “

Related articles:

Apache Iceberg: the hub of an emerging data services ecosystem?

Do customers want open data platforms?

Weighing the value of open source for the future of big data

Source link


Leave A Reply