Rethinking information architectures for a cloud world

Information analytics options are persevering with to emerge at a quick and livid charge. Information groups are on the heart of the storm as a result of they need to stability all of the calls for for entry, information integrity, safety, and correct governance, which entails compliance with insurance policies and laws. The companies they serve want data as rapidly as attainable and have little persistence for that precarious balancing act. The information groups have to maneuver quick and sensible.

In addition they need to be fortune tellers as a result of they should construct not simply the techniques for at present, but in addition the platforms for tomorrow. The primary key query the information staff should contemplate is: open or closed information architectures.

Open vs. closed information structure

Let’s begin with the phrase “information architectures.” If I have been to point out you an structure diagram from any enterprise over the past 50 years, odds are that their labels for information would in reality be labels representing databases—not the information itself, however the engines that act upon the information. Names listed here are acquainted, each previous and new: Oracle, DB2, SQL Server, Teradata, Exadata, Snowflake, and so on. These are all databases into which you load your datasets for both operational or analytical functions, and they’re the foundations of the “information structure.”

By definition, these databases are what we might name “closed information architectures.” That’s not a price assertion; it’s a descriptive one. It implies that the information itself is closed off from different functions and should be accessed by the database engine. That is true even for transferring information round with ETL jobs as a result of sooner or later, to do the export or the import, you want to undergo the database, whether or not that’s the optimum approach to obtain what you wish to do or not. The information is “closed” off from the remainder of the structure on this necessary sense.

In distinction, an “open information structure” is one which shops the information in its personal unbiased tier inside the structure, which permits completely different best-of-breed engines for use for a corporation’s number of analytic wants. That’s necessary as a result of there’s by no means been a silver bullet in the case of analytic processing wants, and there possible by no means can be. An open structure places you in an excellent place to have the ability to use no matter best-of-breed providers exist at present or sooner or later.

To summarize: A closed information structure brings the information to a database engine, and an open information structure brings the database engine to the information.

data architectures Dremio

An easy way to test if you’re dealing with an open architecture is to consider how hard it would be in the future to adopt a new engine. Will you be able to run the new engine side by side with an existing one (on the same data), or will a wholesale (and likely impractical) migration be required?

Note at this point, we’ve touched on a critical aspect of “open” that has nothing to do with open source. Step one is deciding that you want your data open and available to any services that wish to take advantage of it, and that brings us to open in a cloud world.

Open, services-oriented data architecture

When applications moved from client-server to web, the fundamental architecture changed. We went from monolithic applications that ran in one process, to services-oriented applications that were broken into smaller, more specialized software services. Eventually, these became known as “microservices” and they remain the dominant design for web and mobile applications. The microservices approach held many advantages that were realized due to the nature of cloud infrastructure. In a scale-out system with on-demand resource models and numerous teams working on pieces of functionality, the “application” became nothing more than a facade for dozens or hundreds of microservices.

Everyone agrees that this approach has many advantages for building modular and scalable applications. For some reason, we’re expected to believe that this paradigm isn’t nearly as effective for data. At Dremio, we believe that’s inaccurate. We believe the logic of looking at our data in the same open, services-oriented manner as our applications is intuitively obvious and desirable. On a practical and strategic level, an open, services-oriented data architecture just makes sense.

That’s why, for us, the issue of open source software is secondary. The primary “open” that matters most is the first step of deciding an open data architecture is more desirable than a closed one. Once that happens, a watershed of goodness is unleashed. Open file and table formats (Apache Parquet, Apache Iceberg, etc.) are critical as they allow for industry-wide innovation. That innovation gets delivered in the form of services that act upon the independent data tier. Messy, costly, fragile, and compliance-undermining copying of data is greatly reduced or even eliminated. The data team gets to choose from best-of-breed services to act upon that data, slotting them into the architecture the same way we have been doing with application services for more than a decade. It’s time for data architectures to catch up.

There is one legitimate claim levied by those disputing the value of open data architectures: They’re too complicated. Complication comes with any major technological shift. Midrange computers were initially more complicated to manage than established mainframes. Then Intel-based servers were initially more complicated to manage than established midrange systems. Managing PCs was initially more complicated than managing established dumb terminals. You see the point. Each time a technology shift happens, it goes through the normal adoption curve into the mainstream. The early days are always more complicated from a management perspective, but with time, new tools and approaches reduce that complexity, resulting in the benefits far outweighing the initial complexity cost. That’s why we have innovation.

Dremio was created to make an open, services-oriented data architecture much, much easier and more powerful. With Dremio, running SQL against a lakehouse is easy because of the way we put all the pieces together. And we’ve created industry-changing open source projects along the way, such as Nessie, Apache Arrow, and Arrow Flight. These are open source projects because open source technology encourages adoption and interoperability, which are critical for service integration layers in an organization’s data architecture. Everyone wins. Customers win because they get a collective industry working on and innovating key pieces of technology to better serve them. Open source enthusiasts win because they get access to the code to better understand it, and even improve it. And we win because we use those innovations to make SQL on lakehouses fast and easy.

To put a fine point on this discussion, the reality is that no matter how “open” a vendor claims to be, no matter how much they talk about supporting open formats and open standards, even if that vendor was open source at its core, if the data architecture is closed, it is closed. Period.

One key point that Snowflake has made in recent articles is that you need to be closed in areas like the data format and storage ownership in order to meet business requirements. While this may have been true 20 years ago, recent advancements such as cloud storage and transactional table formats now enable open architectures to meet these requirements. And if a company can meet its requirements with an open architecture and all the benefits that come with it, why would it choose a closed architecture? We suspect this might be why Snowflake is spending so much time arguing that open doesn’t matter.

Data as a first-class citizen

At Dremio we’re advocating for a world where the data itself becomes a first-class citizen in the architecture. We’re making that easier and easier to realize for companies that want the benefits of an open architecture, such as: (1) flexibility to use best-of-breed engines best suited for different jobs; (2) avoiding being locked into going through a proprietary engine in order to access their data; (3) setting themselves up to take advantage of tomorrow’s innovations; and (4) eliminating the complexity that endless copying and moving of data into and out of data warehouses has created.

We’re not only committed to open standards and open source, important as they may be—we’re first and foremost committed to open data architectures. We believe that as they become easier and easier to implement and use, the advantages are overwhelming when compared to a closed data architecture. We’re also committed to equipping and educating people on this journey with initiatives like our Subsurface industry conference, which attracted over 10,000 attendees in our first-ever events last year. The momentum is building and the destination is a future with open data architectures at its core.

Tomer Shiran is co-founder and chief product officer at Dremio.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to [email protected]

Copyright © 2021 IDG Communications, Inc.

Source link