Book an appointment
Knowledge base

What is Microsoft Fabric? Part 6: Lakehouse integrates artificial intelligence and combines disparate data

Data_Lakehouse

You can read the previous parts of the blog series through the links below:

Part 1: From the past to the present
Part 2: Technology
Part 3: Performance
Part 4: Licenses
Part 5: Fabric is Microsoft’s favorite child and it is developed rapidly

As I write this, I have been working on real client projects with Fabric for over a year. In my world, the number of clients who have moved to Fabric is in the double digits. In retrospect, the biggest single change for a long-time Microsoft data warehouse employee like myself is the component that collects and stores analytical data. That is, Lakehouse. How did this happen?

What is Lakehouse, and what problem does it solve?

As the name suggests, Lakehouse is lake & house. Lake refers to a part that is comparable to the hard disk of a workstation, for example. Folders and files. House refers to the formal (database-like) part that stores data in a table structure. This means that data is stored in rows and columns for querying and further processing, sums, calculations, averages, etc.

When considering a traditional on-premises data warehouse or a more modern cloud storage, there is always a place for file-based data (xls, csv, etc.). In the overall architecture of a data warehouse, the SQL server is usually supported by various disk sharing solutions, and AzureSQL by Azure Data Lake, among others. The focus has been on the database, but this has resulted in redundancy with data in two places, behind two different access rights and in different components. Keeping these in sync has usually been a challenge, and maintenance and development have required separate people. Lakehouse does both and provides a flexible way to handle more than just structured data.

Data-Lakehouse-1

 

Why ignore Lakehouse?

Relaatiotietokannat tietovaraston keskuksena pitivät pintansa vuosikymmeniä. Vuosien varrella uhkaajia ilmestyi tasaisesti ja itsekin ehti jo tottumaan, että haastajat (Hadoop, data lake, muistinvaraiset datamallit, dokumenttikannat jne) tulivat provokatiivisesti torvet soiden ja sitten pölyn laskeuduttua esimerkiksi Microsoft-näkökulmasta tutut SQL-Server, AzureSQL tai Synapse -relaatiokanta jäivät taistelukentälle seisomaan voittajana. 

Fabric tunnetusti tarjoaa useampia vaihtoehtoja tiedon tallentamiseen. Fabricin Warehouse on looginen jatkumo Microsoftin tietokannoille. Tätä historiaa vasten Lakehousen ohittaminen ja Warehouseen tarttuminen tulee helposti ikäänkuin lihasmuistista. Lakehousea vaihtoehtona ei tulisi kuitenkaan missään tapauksessa ilman perusteita sivuuttaa.

Why was Lakehouse included in Fabric?

Lakehouse is not Microsoft’s invention. The unofficial story is that a group of students at the University of Berkeley developed an in-memory and much more powerful version of Hadoop (Apache Spark), and added an even more powerful storage method (Apache Delta Lake). These pieces of the puzzle later became Databricks (the company and the product) and the term Lakehouse. Today, Lakehouse is part of both Databrick and Fabric.

Spark has appeared in Microsoft’s products before (Synapse Analytics, HDInsight), but it came into the spotlight with Fabric.

Lakehouse is often associated with the term “Big Data”. If Big Data is any data that does not fit into an Excel spreadsheet because of its size or format (big refers to both the size of the data and its format), then you’ll get a good idea of the kind of data Lakehouse shines with. 

Power BI reporting and analysis has been the driving force behind Microsoft’s data products for several years. Up to 97% of Fortune 500 companies use Power BI. And when it comes to Big Data, up to 80% of Fortune 500 companies use Spark. Of course, I don’t know the decision was made to include Lakehouse in Fabric, but it’s logical that Microsoft Fabric brings these success stories together in one package.

The analytic data architecture in Fabric

Let’s talk about storing or moving data. Fabric offers various alternatives. However, we can’t choose them all for client projects. Instead, we need to decide from the outset which components are best suited the client.  

According to current recommendations, the path from operational data to analytical data, including reports, consists of three phases: 

Raw operational data = Bronze 
Cleaned and processed data = Silver
Data ready for business use = Gold

A client of mine summarized this so-called medallion architecture as follows: “the medals describe how Dante’s hell of operational data is transformed into a paradise of analytical data”. 

And that’s what it’s all about. To enable this in projects, different Fabric components are chosen to implement the different medals.

Why does Lakehouse seem to win medals like the Norwegian national ski team? 

Lakehouse provides the best and most agile solution for housing raw operational data, because some of the source data is still file-based and an increasing amount of it is semi-structured json from REST APIs.  

In practice, this means that Lakehouse is the natural choice for Bronze data. Silver and Gold, on the other hand, have a long history in relational databases. Fabric offers an alternative to Lakehouse in the form of Warehouse and SQL database (preview). Warehouse is based on the same Delta format as Lakehouse. Both have T-SQL endpoints for querying. Lakehouse is distinguished by the Files section and Big Data capabilities of the Spark engine, as well as the T-SQL insert and update options familiar from Warehouse. 

The migration path from previous data warehouse solutions to an SQL Database component is usually short for Silver and Gold. However, I think that clients should listen to the rationale suppliers give for offering SQL Database for Silver and Gold. SQL Database is not designed for analytical data, but rather, as a database for applications (https://learn.microsoft.com/en-us/fabric/database/sql/overview). Warehouse, on the other hand, is designed to store analytical data, and its technology is like a hybrid of Lakehouse and SQL Database. In real-life Fabric projects, the benefits of Lakehouse always seem to come through. This leaves me wondering if clients’ needs have been aligned with Lakehouse over time, or if we’ve been driving screws into the wall with a hammer because it worked better than the screwdrivers in our toolbox.  

Fabric is a data platform born in the AI era. One sign of a changed world is that when the Silver layer is being processed in Lakehouse, the AI team colleagues are doing their work in the same orchestrated data process and even in the same notebooks. With Lakehouse, separate projects by different experts become shared tasks.

If Bronze and Silver are in Lakehouse, it is worth considering whether the technical stack should be expanded by selecting a different type of component for Gold. For practical reasons, Lakehouse appears to be a viable option for Gold as well. However, continuing with Lakehouse allows flexibility in exporting Gold level data (with notebooks) to APIs of external systems, among other things. Even data exported from the system has a logical place in Lakehouse. Details may vary, but in the big picture, Lakehouse appears to be the most cost-effective data warehouse.

And as Lakehouse has an SQL endpoint, you can still use your existing skills. I still use Management Studio and T-SQL every day to crawl through data and design models. However, the component behind it all has mostly moved from a database to Lakehouse. And that’s a good thing.

Read more 

Part 1: From the past to the present
Part 2: Technology
Part 3: Performance
Part 4: Licenses
Part 5: Fabric is Microsoft’s favorite child and it is developed rapidly
Knowledge management and business intelligence  


Jani Laitala

Jani Laitala

I work at Pinja as a data architect, implementing both data warehouse solutions and reporting. I am interested in the information in the data and I want to communicate it with visuals. In our free time, our family is busy with both culture and sports.

Read more from this author