Data can live in many different places. The obvious home is a database, but data stored in many different databases is not always easily accessible for analytics purposes. For that reason corporate data, or copies of it, often resides in data warehouses and, increasingly, data lakes.
The terms data warehouse and data lake are often wrongly used interchangeably, though their popularity with large enterprises is similar. Gartner research reveals that approximately 80% of enterprises use data warehouses or plan to use them in the next 12 months, while 73% use or plan to use data lakes in a similar time period.
Data Lake or Data Warehouse: Spot the Difference
Data warehouses and data lakes are two distinct and very different concepts, so what exactly is the difference between the two?
Gartner defines a data warehouse as a collection of data in which two or more disparate data sources can be brought together in an integrated, time-variant information management strategy. Data warehouses generally house well-known and structured data and they support predefined and repeatable analytics needs that can be scaled across many users in the enterprise.
A data lake, on the other hand, collects unrefined data (that is, data in its native form, with limited transformation and quality assurance) and events captured from a diverse array of source systems. Data lakes usually support data preparation, exploratory analysis and data science activities — potentially across a wide range of subjects and constituents.
These two definitions don’t sound very different, so let’s take a closer look at these two ways of storing data and the idea behind each one.
What is a Data Warehouse?
Perhaps the easiest way to appreciate the difference between a data warehouse and a data lake is to think of a data warehouse as a pre-assembled system designed to enable users — and in many cases that means business users — to analyse particular data sets in broadly pre-defined ways.
In order to make the data warehouse system work, it’s first necessary to decide which data sets are going to be required. These data sets then have to be transformed, processed or normalized into some standard data structure — which is optimized for fast SQL queries, along with metadata that relates to this data — before it can be stored in the data warehouse. This serves as a “single source of the truth”.
Once the data sets have been transformed, and a process has been put in place so that new data can be similarly transformed and added to the data warehouse, then all that remains is for users to create specific SQL queries (or have someone build the queries for them) so that they can get regular answers to specific questions such as, “how are sales doing in this region?”, and then drill down into the data to get more specific information.
The key point is that a data warehouse is generally designed to be able to provide data to answer specific queries, and the types of queries that it will be used for dictates the data that needs to be transformed and then fed into the data warehouse. New types of queries may require that new types of data are fed into the data warehouse, and before that can happen new transformations will have to be designed and applied.
What is a Data Lake?
In contrast to a data warehouse, a data lake is just a huge repository into which data flows from all kinds of sources in its raw or native forms. The key points here are that data lakes are intended to capture much or all the data that an enterprise generates — not just data that is needed for specific types of queries — and that data is poured into the data lake without any kind of transformation or normalization.
That makes it easy to handle a new stream of data because no transformations need to be done, and it also means that data lakes are likely to be much larger than data warehouses.
Data in data lakes is generally kept forever, in case it is needed in the future, while data in data warehouses may well have a lifecycle that means that it is discarded after a certain period of time or even transferred to a data lake.
In terms of the types of data that may be stored in a data lake, the rule of thumb is that anything goes: structured data, as well as data from social network activity, text, images and anything else that conceivably may be of use. By contrast, data in a data warehouse is mostly structured quantitative metrics along with accompanying metadata.
It is worth noting, however, that vast amounts of data in a data lake without accompanying metadata that describes the data and its age can be virtually useless. Simple metadata such as “Twitter comments” or “results of 2020 user survey” can prevent data lakes becoming costly but pointless “data swamps.”
Read more: Key Trends in Data Lakes
Who uses Data Lakes and Data Warehouses?
Data warehouses are generally used by business professionals who need to run similar reports to keep tabs on the business every day, week, or perhaps month.
When it comes to carrying out more in-depth data analytics, or creating unusual data mashups, business users and data scientists are more likely to turn to a data lake with its mass of raw data, and the ease with which new data streams can be added to the lake. Since the structure of the data or schema is not defined, users are not restricted in the types of questions they can ask. Rather than being restricted to SQL queries, analysts can use big data analytics tools, full text searches, real-time analytics and even machine learning algorithms to get insights and useful information from the data.
Data Lakes and Data Warehouses are Changing
The biggest thing that has had an impact on data lakes and data warehouses is the burgeoning market for cheap data storage in the cloud, through services such as Amazon S3 and Azure Data Lake Storage. Combined with the rapid increase in the take-up of cloud services by enterprises, it is increasingly becoming more economical and easier on corporate networks to run data warehouses and data lakes in the cloud.
Combined with the near infinite and highly elastic compute resources that are available in these and other clouds, using compute engines such as Presto, Spark, Apache Kafka (for streaming) and Hadoop, the distinction between structured data warehouses and unstructured data lakes is also becoming blurred.
For example, Apache Iceberg, the open source table format, can manage data stored in data lakes while providing common data warehouse functionality.
With the growth in remote working over the last twelve months, it is likely that cloud-based data warehouses and data lakes will become increasingly common, particularly because they are likely to offer far higher performance when accessed while bypassing the corporate network — either directly or via a SD-WAN.