The Lakehouse - a brief overview

John Kirby
Jan 12, 2022
4 min read

Sounds like a lovely holiday location! In fact, the lakehouse is a design concept that brings together multiple design approaches to storing data in an accessible way for 'modern technologies' and a 'modern data scope'.

Modern technologies makes some words spring to mind; 'cloud', 'big data', 'data mesh', 'data bricks'. What doesn't spring to mind is the 'classic' on-prem rdbms technologies like SQL Server and Sharepoint that are well embedded in many organisations. Can these classic technologies be included in the term 'modern'? Yes. Azure SQL DB is the latest version of SQL Server (albeit with a few functional differences) - it's just rebranded and in the cloud. Sharepoint also sits alongside the Office 365 cloud. Here's something to push the boundaries of 'modern' technology... Excel. Can Excel be classed as 'modern'? Again, yes, as it still has its valid use cases, it's up to date, and also sits in the cloud. Integration technologies like Azure Data Factory and Databricks with Pandas integrate with Excel.

There's a couple of phrases to bear in mind when designing data solutions; 'KIS' (keep it simple) and 'The right tool for the right job'. Shunning technology that's been around for ages isn't sensible. Similarly, ignoring cutting edge technologies just because they're not yet proven also isn't sensible. So when designing a Lakehouse, or any new solution, let's be open-minded, do our research, and find the best tech for the job that fits with the overall solution architecture.

As I write this, Delta Lake is quickly becoming a popular part of the Lakehouse design, taking the role of the data storage on top of the data lake. Databricks is also very popular for data processing. Both are platform agnostic, which is great, and offer a variety of implementation options and languages. Azure Data Factory (ADF) is a great tech candidate to orchestrate the data movement. ADF can pick up the business' data from its source (Excel, databases, APIs and loads of other places), and feed it to Databricks. Databricks uses Delta Lake to reliably store different incarnations of the data, typically through a 3-step curation process of bronze, silver and gold levels of curation.

The 'Lake' in 'Lakehouse'

Data gets stored in a data lake. Delta Lake is a product that integrates nicely with Databricks to make the storage in the data lake ACID compliant.

Bronze is usually a raw copy of the source data so it can be interrogated without putting stress on its source business system. It's great for developers to do analysis on when modelling the other 2 layers, and for auditing purposes to work a figure back from an end point like a Power BI dashboard to its source. The use of delta lake in bronze can be considered optional. It depends on the source feed - are whole files provided once, or are deltas given from the source?

Silver usually has some cleansing done to it, maybe some modelling too. Some basic cleansing could include; datetime fields from different source systems set to the same format and timezone (UTC); trailing whitespaces and non-ascii characters removed from text fields; columns added for ISO versions of fields like countries and currencies. Data quality checks could happen on this layer against a set of business rules. The silver layer uses delta lake to track changes and keep a history, as well as giving ACID consistency to writes so we don't end up with a partial write. This layer could also be modelled using Data Vault 2 to help it make sense to its users. Silver layer users are typically data scientists and analysts.

Gold is typically a star-schema, mainly because its end users in the business are familiar with it, and it integrates well with end user tools like Power BI (even Excel) for them to build reports and dashboards in from. Gold also uses Delta Lake.

The 'House' in Lakehouse

The Gold layer is interesting. In more classic BI implementations, this layer would reside in an RDBMS like SQL Server. Kimball is a great methodology to follow to create this layer, basing the model and data points on the business requirements. But in the Lakehouse it sits within the Delta Lake, not a relational RDBMS. Power BI can still connect to it using JDBC as if it was a classic SQL Server database. So, is the SQL Server star schema database (i.e. a Data Warehouse) redundant? I'd say not yet, not until the integration with end tools like Power BI are more mature. At some point, I think it could be, mainly because of 'modern data scopes'.

Why?

Modern Data Scopes can include way more sources and types of data than classic BI systems. Sure, accountancy operational systems still provide a source of highly structured data. But businesses need to compete in a technological, data-driven world. This necessitates getting every weapon out in that battle, and analysing and leveraging more data sources. These range from social media, streaming IoT devices like machine sensors, 3rd party data sets, and even media like images of scanned documents. One of the drivers behind the creation of the lakehouse was the ability to bring this together for end users. Can a classic RDBMS handle all this? I say not very well. Unstructured data in particular.

Really?

Yes, really. The lakehouse continues to get ever more popular, driven by its success in achieving its goals. No solution is perfect however. Delta Lake has helped with ACID transactional writes which was a major challenge for data engineers to implement in the data lake where the benefits of the classic RDBMS didn't exist. Some challenges still remain; data integrity between entities, usability for end-users, and a big one is data management and governance. With such a broad modern data scope, keeping track of authoritive sources, data lineage, data dictionaries and other governance pieces is also a broad undertaking.

BI SQL CONSULTING

The Lakehouse - a brief overview

Recent Posts

Comments