Datagrom | AI & Data Science Consulting

View Original

Snowflake vs Databricks: Where Should You Put Your Data?

See this content in the original post

Software ate the world, turned it into data, and now it’s backed up and suffers from indigestion. 55% of data goes unused (Splunk, 2019 survey). MapReduce is all but dead. And the on-premise EDW is on life support. Welcome to the current state of data infrastructure.

Open source MapReduce died quickly as a software engine to process big data stored in “data lakes”. Better open source alternatives showed up, and since it was free, it was easy to replace. The outdated proprietary on-premise enterprise data warehouse (EDW) designed to store and process transactional data continues to die a slow and painful death thanks to all of its locked-in customers. They can’t afford to abandon large investments in tightly coupled hardware & software systems that sit in their basements.

Fortunately, much improved alternative solutions from Snowflake & Databricks have emerged to replace the legacy EDW 1.0 & Data Lake 1.0. They take advantage of new cloud offerings to help us convert more of those data calories into something useful. They deliver faster performance at lower cost thanks to the price elasticity of the cloud.

Snowflake & Databricks best represent the two main ideological data digestive camps we’ve seen before with a fresh reboot for the cloud. Snowflake offers a cloud-only proprietary EDW 2.0. Meanwhile, Databricks offers an on-premise-cloud hybrid open-source-based Data Lake 2.0 approach.

Databricks & Snowflake Heritage

Both camps are competing to become the one-stop-shop to handle all your data for any use case. Snowflake launched the first attack on the data lake in 2019 when it suggested that in addition to a modern EDW it also provides an improved data lake called a “Data Ocean”. In 2020, Databricks launched a counterattack on the data warehouse with a “Data Lakehouse”.

In my career, I have sold both proprietary EDW systems for Oracle and open-source data lake solutions for Pivotal (acquired by VMware in 2019). I’ll help you cut through the complexity and marketing noise surrounding this topic. Here’s what you need to know:

  1. EDW & Data Lake Story: A brief history of the EDW and Data Lake 1.0 (hint: history likes to repeat itself)

  2. Snowflake Cloud Data Platform vs Databricks Data Lakehouse: I’ll give you an “apples-to-apples” comparison of the EDW and Data Lake 2.0

  3. Where Should You Put Your Data — Snowflake vs Databricks: I’ll help you understand the advantages & disadvantages, given what we’ve seen in the past.

The EDW 1.0 Story

Basic Enterprise Data Warehouse (EDW) 1.0 Architecture

The EDW came first in the 1980s as data became more available. Businesses began to rely more heavily on data to make business-critical decisions. They needed to organize and consolidate their data in a central place. Vendors like Oracle, Teradata, and IBM responded to this need. They sold customers on-premises EDW systems (hardware + software) that both stored and processed your data.

Top 3 Characteristics of EDW 1.0:

  1. Structured Data: The traditional EDW only stored data that was organized or “structured” into tables with a “schema” of rows and columns, similar to what you see in Excel. This structure helps you quickly access and analyze your data with SQL queries. You needed an ETL (Extract, Transform, Load) solution to grab your data, organize it, and load it into your EDW.

  2. Centralized Storage & Processing: Data was stored and processed in gigantic tables. Some systems made multiple copies of these large tables to speed up processing in a “parallel” architecture.

  3. Expensive: These systems were built for business-critical applications that could not go down. They also required high performance levels for processing SQL queries. To achieve these goals, they built them with expensive, premium hardware.

EDW 1.0 was an effective solution for a while until the world changed. Businesses began to acquire data at greater volume, variety, and speed than ever before. Because of this rapid growth, we couldn’t organize our data quickly enough to make it immediately useful in the EDW. We needed a new data storage and processing solution that was more flexible.

The Data Lake 1.0 Story

Basic Data Lake 1.0 Architecture

We knew this ever-increasing pile of data would likely be useful in the future, so instead of throwing it away because we couldn’t organize it quickly enough, we decided to simply start storing all of it in its native format on cheap (or “commodity”) distributed hardware. Essentially, we decided to store now and organize later.

Data Lake solutions were born in 2006 within the leading technology companies, Google and Yahoo, out of sheer necessity because they were acquiring data faster than anyone else. They then open-sourced these early systems and provided them to the world at no cost to use and improve as they saw fit to meet their specific needs. This is how the industry leading data lake ecosystem, called Apache Hadoop was born.

Hadoop was not, however, suitable for most enterprises right out of the box. Most enterprises also needed a vendor to reliably support these systems. As a consequence, vendors like Cloudera, Hortonworks, Pivotal, and others emerged to create fully supported data lake offerings for the enterprise built around the Apache Hadoop open-source core.

Top 3 Characteristics of Data Lake 1.0:

  1. Unstructured Data: The Data Lake was designed to store data in its original, or “unstructured” format. You didn’t need an ETL system to structure your data before loading and saving it in the data lake.

  2. Decentralized Storage & Processing: Instead of storing data in gigantic tables, data was split up into many smaller “distributed” tables or files that you could store in the open-source Hadoop Distributed File System (HDFS). These distributed files were stored on many cheap connected computers. Only the computer, or “node” that stored the specific data that we wanted to work with processed the data. This was much more efficient than working with gigantic centralized tables or files. You used the MapReduce programming model to process your data on the computer disk where it was stored.

  3. Cheap: Instead of working with expensive hardware that would fail less often, the open-source Hadoop Distributed File System (HDFS) was designed with the assumption that hardware would fail, and to automatically handle those failures. This meant that you could now store data on cheaper computers that worked together to be as reliable as the EDW. With EDW solutions businesses had to purchase far more expensive capacity up-front than what they needed at the moment to plan for the future. This stands in contrast to data lake solutions which enabled businesses to simply purchase additional inexpensive hardware when needed.

The original Hadoop Data Lake 1.0 allowed businesses to store and work with large piles of any type of data in a very efficient and cost-effective manner. There were challenges, however.

The Data Lake couldn’t process SQL queries as quickly as EDW 1.0, and the MapReduce programming model was cumbersome. It required high levels of expertise that many enterprises simply didn’t have in-house.

Then came the cloud, which like the data lake, made everything cheaper and easier to scale. And unlike either original EDW and Data Lake systems, the cloud eliminated the need for businesses to purchase and manage hardware altogether. In short, the cloud enabled both the EDW and Data Lake to become “democratized” and instantly available to any company of any size.

See this content in the original post

Snowflake Cloud Data Platform vs Databricks Data Lakehouse

Databricks Lakehouse vs Snowflake Cloud Data Platform Basic Architecture

Snowflake took the proprietary on-premise EDW 1.0 concept and turned it into a compelling, modern, fully managed cloud replacement. Anyone can get started with Snowflake in the cloud for pennies. Customers no longer have to pay large up-front costs for systems from legacy vendors like Teradata, Oracle, or IBM. In essence, Snowflake “consumerized” these technologies in the cloud almost overnight.

On the other end of the spectrum, Databricks has modernized Data Lake 1.0 for the cloud with incredible success. Databricks was founded by the original creators of Apache Spark. The open-source Apache Hadoop ecosystem itself evolved to improve its shortcomings. For instance, Spark arrived to process data much more quickly “in-memory” (think RAM vs disk hard-drive). Apache Spark also largely replaced the MapReduce programming model which proved difficult to work with. Spark also evolved to provide interfaces to work with data in a variety of popular programming languages including R, Python, and Java as well as with SQL.

Like Snowflake, the Databricks value proposition is compelling for some of the same reasons — Low up-front costs to get started and pricing scales elastically with use. Additionally, Databricks handles the complex underlying infrastructure so that customers can focus on generating business value.

Snowflake vs Databricks — Top 3 Differences:

  1. Data Structure

    Snowflake: Unlike EDW 1.0, and similar to a data lake, with Snowflake you can upload and save both structured and semi-structured files without using an ETL tool to first organize the data before loading it into the EDW. Once uploaded, Snowflake will automatically transform the data into its internal structured format. Unlike a data lake, however, Snowflake does require that you add structure to your unstructured data before you can load it and work with it.

    Databricks: Like with Data Lake 1.0, Databricks can work with all data types in their original format. And in fact, you can use Databricks as an ETL tool to add structure to your unstructured data so that other tools like Snowflake can work with it.

  2. Data Ownership

    Snowflake: Snowflake will tell you that when compared with EDW 1.0, they have decoupled the storage and processing layers. What they mean is that you can scale each independently in the cloud according to your needs. This will save you money since as we’ve learned, we process less than half of the data we store. Like the legacy EDW, however, Snowflake does not decouple data ownership. It still owns both the data storage and processing layers.

    Databricks: With Databricks, on the other hand, data storage and processing layers are fully decoupled. Databricks focuses more on the data processing and application layers. You can leave your data wherever it is (even on-premise), in any format, and you can use Databricks to process it.

  3. Use Case Versatility

    Snowflake: Like EDW 1.0, Snowflake is best suited for SQL-based, Business Intelligence use cases where it shines. To work on data science & machine learning uses cases with Snowflake data, you will likely have to rely on their partner ecosystem. Like Databricks, Snowflake provides ODBC & JDBC drivers to integrate with third parties. These partners would likely pull Snowflake data and use a processing engine outside of Snowflake, like Apache Spark, before sending results back to Snowflake.

    Databricks: Databricks also enables high-performance SQL queries for Business Intelligence use cases. Databricks created open-source Delta Lake as a layer that adds reliability on top of Data Lake 1.0. With Databricks Delta Engine on top of Delta Lake, you can now submit SQL queries with high-performance levels previously reserved for SQL queries to an EDW.

    In addition to SQL, and unlike Snowflake, you can also work with your data in a variety of programming languages. This is especially important for data science & machine learning applications. Data Scientists primarily use the R & Python languages to work with big data. In addition to including native support for these languages, Databricks provides a collaborative data science & machine learning platform.

Where Should You Put Your Data — Snowflake vs Databricks

Like any data scientist will tell you, the best way to predict the future is to first take a look at similar historic events and their outcomes. We’ll take the same approach here and consider the fate of EDW vs Data Lake 1.0 to train our mental models to help us predict what we may see with Snowflake vs Databricks. This will help us decide where we should put our data.

Snowflake

Snowflake will continue to acquire customers for 3 main reasons:

  1. A superior alternative to EDW 1.0: Who wants to buy big metal boxes, real-estate to house them, and hire people to manage them? No-one. Not even the CIA & NSA.

  2. Business intelligence: Like EDW 1.0, Snowflake can be a great option for business intelligence workloads where it shines the brightest.

  3. Simplicity: Snowflake is incredibly simple to use. Like with EDW 1.0 it will continue to appeal to the analyst community for this reason. In the cloud, customers no longer have to worry about managing hardware. And with Snowflake, they don’t even have to worry about managing the software either.

Word of caution: If data is the new gold, should you put it all in one basket? Or perhaps, should you diversify your portfolio? Certainly, a Snowflake EDW means less vendor lock-in compared with EDW 1.0, mainly because it doesn’t also come packaged with large expensive hardware.

What should give you pause, however, is that EDW 1.0 was largely satisfied to store and process your structured transaction data. In contrast, due to technology improvements, Snowflake can also handle your semi-structured data, and it wants it badly.

Don’t forget, technology in this area advances rapidly. Remember your friends stuck with EDW 1.0 in their basements and proceed with caution.

Databricks

Databricks will continue to acquire customers for 3 main reasons:

  1. Superior technology: Until we see technology leaders like Google, Netflix, Uber, and Facebook transition from open source to proprietary systems, you can rest assured that systems based on open-source, like Databricks, are superior from a technology perspective. They’re far more versatile.

  2. Data science & machine learning: Like Data Lake 1.0 vs EDW 1.0, without question, the Databricks platform is far better suited to data science & machine learning workloads than Snowflake.

  3. Minimal Vendor Lock-In: As with Data Lake 1.0, with Databricks, vendor lock-in is much less of a concern, if at all. In fact, with Databricks you can leave your data wherever you want. Use Databricks to connect to it, and process it for virtually any use case.

Word of caution: Databricks delivers technology built by experts largely for experts. To Databricks’ credit, they have simplified the data lake when compared with Data Lake 1.0. Like we saw with Data Lake 1.0, however, we can still expect the Databricks approach to appeal more heavily to tech-savvy teams.

Snowflake may seem attractive in how simple it is to get started. Short term thinking in a technology landscape that will continue to change rapidly, can cost you dearly in the long-term as you struggle to adapt. If you can, hire for expertise to help manage your data as a core business investment and diversify appropriately.

See this content in the original post

Subscribe to our Data Science & Machine Learning Newsletter

Stay ahead of the curve with data science & machine learning Insights, resources, and tips

See this form in the original post

See this content in the original post

Subscribe to our weekly Data Science & Machine Learning Technology Newsletter

See this form in the original post

Recent Posts

See this gallery in the original post

Posts by Category

See this content in the original post