Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Solution ideas
This article describes a solution idea. Your cloud architect can use this guidance to help visualize the major components for a typical implementation of this architecture. Use this article as a starting point to design a well-architected solution that aligns with your workload's specific requirements.
This article describes how small and medium businesses (SMBs) can build a modern data platform architecture by combining existing investments in Azure Databricks with a fully managed software as a service (SaaS) data platform such as Microsoft Fabric. SaaS data platforms are end-to-end data analytics solutions that integrate with tools like Azure Machine Learning, Foundry Tools, Power Platform, Microsoft Dynamics 365, and other Microsoft technologies.
Simplified architecture
Download a Visio file of this architecture.
The interoperability between Azure Databricks and Fabric provides a robust solution that minimizes data fragmentation while enhancing analytical capabilities.
Fabric provides an open and governed data lake, called OneLake, as the underlying SaaS storage. OneLake and Azure Databricks both use the Delta Parquet format. To access your Azure Databricks data from OneLake, you can mirror the Azure Databricks Unity Catalog in Fabric to integrate data without replication or data movement. With this integration, you can augment your Azure Databricks analytics systems with generative AI on top of OneLake.
You can also use Direct Lake mode in Power BI on your Azure Databricks data in OneLake. Direct Lake mode simplifies the serving layer and improves report performance. OneLake supports APIs for Azure Data Lake Storage and stores all tabular data in Delta Parquet format.
As a result, Azure Databricks notebooks can use OneLake endpoints to access the stored data. The experience is the same as accessing the data through a Fabric warehouse. With this integration, you can use Fabric or Azure Databricks without reshaping your data.
Architecture
Download a Visio file of this architecture.
Data flow
The following data flow corresponds to the previous diagram:
Use existing Azure Data Factory pipelines to ingest structured and unstructured data from source systems and land it in the existing data lake.
You can use Microsoft Dynamics 365 data sources to build centralized BI dashboards on augmented datasets by using Azure Synapse Link or Microsoft Fabric Link. Bring the fused, processed data back into Microsoft Dynamics 365 and Power BI for further analysis.
Streaming data can be ingested through Azure Event Hubs or Azure IoT Hub, depending on the protocols that send these messages.
In the cold path, you can use Azure Databricks to bring the streaming data into the centralized data lake for further analysis, storage, and reporting. This data can then be unified with other data sources for batch analysis.
In the hot path, you can analyze data in real time and create real-time dashboards through Microsoft Fabric Real-Time Intelligence.
You can use the existing Azure Databricks notebooks to perform data cleansing, unification, and analyses. Consider using medallion architecture such as:
- Bronze, which holds raw data.
- Silver, which contains cleaned, filtered data.
- Gold, which stores aggregated data that's useful for business analytics.
For golden data or a data warehouse, continue to use Azure Databricks SQL or create a mirroring of the Azure Databricks Unity Catalog in Fabric. To enable reporting and analytics on a Fabric lakehouse, create a semantic model explicitly and build Power BI dashboards by using Direct Lake or DirectQuery for high performance. For more information, see Semantic models in Fabric.
The following tools are used for governance, collaboration, security, performance, and cost monitoring.
Discover and govern:
Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.
Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.
Platform resources:
Microsoft Entra ID provides single sign-on (SSO) for Azure Databricks users. Azure Databricks supports automated user provisioning with Microsoft Entra ID to:
- Create new users.
- Assign each user an access level.
- Remove users and deny them access.
Microsoft Cost Management provides financial governance services for Azure workloads.
Azure Key Vault manages secrets, keys, and certificates.
Azure Monitor collects and analyzes Azure resource telemetry. This service maximizes performance and reliability by proactively identifying problems.
Microsoft Defender for Cloud provides security posture management and threat protection for Azure resources and workloads.
Azure DevOps provides continuous integration and continuous deployment (CI/CD) and other integrated version control features.
GitHub provides version control and collaborative development capabilities for managing code and deployment pipelines.
Components
Data Lake Storage is a scalable data storage service designed for structured and unstructured data. In this architecture, Data Lake Storage serves as the underlying infrastructure for the Delta Lake. It's the primary storage layer for raw and processed data, which enables efficient data ingestion, storage, and retrieval for analytics and machine learning workloads.
Data Factory is a cloud-based data integration service that orchestrates and automates data movement and transformation. In this architecture, Data Factory creates, schedules, and orchestrates data pipelines that move and transform data across various data stores and services.
Event Hubs is a real-time data ingestion service that can process millions of events per second from any source. In this architecture, Event Hubs captures and streams large volumes of data from various sources to enable real-time analytics and event-driven processing.
IoT Hub is a managed service that improves security and reliable communication between Internet of Things (IoT) devices and the cloud. In this architecture, IoT Hub facilitates the ingestion, processing, and analysis of telemetry data from IoT devices to provide real-time insights and enable remote monitoring.
Microsoft Dataverse is a scalable data platform that organizations can use to help securely store and manage data that business applications use. In this architecture, it serves as a data source that feeds into the analytics pipeline via Azure Synapse Link or Microsoft Fabric Link.
Azure Synapse Link is a data integration feature that connects Dynamics applications with either Azure Synapse Analytics or Data Lake Storage. In this architecture, it copies data in near real-time from Dataverse to Data Lake Storage.
Microsoft Fabric Link is a data integration feature that connects Dynamics applications to Fabric. In this architecture, it replicates data from Dataverse to Fabric in near real-time.
Azure Databricks is an Apache Spark-based analytics platform for big data processing, machine learning, and data engineering. In this architecture, it performs data cleansing, transformation, and analysis by using medallion architecture layers.
Delta Lake is an open-source storage layer that brings atomicity, consistency, isolation, and durability (ACID) transactions to Spark and big data workloads. In this architecture, Delta Lake enhances data reliability and performance within the data lake.
Azure Databricks SQL is a SQL-based analytics service that enables users to run SQL queries on data stored in Azure Databricks. In this architecture, Azure Databricks SQL provides a powerful SQL interface to query and analyze data, which enables interactive analytics.
AI and machine learning encompass a range of technologies and services that enable the development, deployment, and management of machine learning models. In this architecture, AI and Machine Learning services build, train, and deploy predictive models. This capability enables data-driven decision-making.
Unity Catalog is a data governance solution that provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces. In this architecture, Unity Catalog helps ensure data governance and security by providing fine-grained access controls, auditing, and data lineage tracking.
Medallion lakehouse architecture is a data architecture pattern that organizes data into bronze, silver, and gold layers for efficient data processing and analytics. In this architecture, it structures data processing workflows by using Data Lake Storage, Delta Lake, and Azure Databricks to support scalable analytics.
Fabric is a comprehensive data platform that integrates various data services and tools to provide a seamless data management and analytics experience. In this architecture, Fabric connects and integrates data from multiple sources, which enables comprehensive data analysis and insights across the organization.
Real-Time Intelligence is a data processing capability that enables organizations to ingest, process, and analyze data in real time. Real-Time Intelligence processes streaming data from various sources. In this architecture, it provides real-time insights and enables automated actions based on data patterns.
OneLake shortcuts create an in-place link between OneLake and other data sources. In this architecture, they simplify data access and management, and provide a unified view of data across the organization.
Fabric Copilot is an AI-powered assistant integrated across Fabric workloads. It uses large language models (LLMs) to help users interact with data by using natural language. It simplifies tasks such as generating SQL, DAX, and transformations, and it creates reports or dashboards. Copilot supports conversational context, creates visualizations, and helps build analytics pipelines. It helps organizations accelerate data insights and optimize workflows without requiring deep coding expertise.
A Fabric data agent is an intelligent, LLM-based service in Fabric that organizations use to query and analyze data across multiple sources, including lakehouses, warehouses, semantic models, KQL databases, and mirrored databases, through a single interface. It supports complex multiple‑step queries, applies custom logic through example queries and agent or data-source instructions, and publishes to Microsoft 365 Copilot or Teams. It provides business users with secure, governed access to enterprise data in natural language.
Power BI is a business analytics service that provides interactive visualizations and business intelligence (BI) capabilities. In this architecture, Power BI visualizes data from Fabric and Azure Databricks by using Direct Lake mode for improved performance.
Microsoft Purview is a unified data governance service that helps organizations manage and govern their data across various sources. In this architecture, it catalogs data, tracks lineage, and enforces compliance across the data estate. You can integrate Unity Catalog into Purview to access Unity Catalog metadata from Purview.
Microsoft Entra ID is a cloud-based identity and access management solution that helps ensure secure sign-ins and access to resources like Microsoft 365, Azure, and other SaaS applications. In this architecture, Microsoft Entra ID provides secure identity and access management for Azure resources. This feature enables secure sign-ins, manages user identities, and helps ensure authorized access to data and resources.
Cost Management is a suite of FinOps tools that organizations can use to analyze, monitor, and optimize Microsoft Cloud costs. In this architecture, these tools provide financial governance over Azure resources.
Key Vault is a cloud service that stores and manages secrets, such as API keys, passwords, certificates, and cryptographic keys. In this architecture, Azure Databricks can retrieve secrets from Key Vault to authenticate and access Data Lake Storage, which ensures secure integration.
Azure Monitor is a monitoring service that provides full-stack observability for applications, infrastructure, and networks. Azure Monitor enables users to collect, analyze, and act on telemetry data from their Azure and on-premises environments. In this architecture, Azure Monitor ensures performance and reliability by proactively identifying problems.
Defender for Cloud is a cloud-native application protection platform that provides security posture management and threat protection across Azure, hybrid, and multicloud environments. In this architecture, Defender for Cloud secures data platforms and workloads by identifying vulnerabilities, detecting threats, and providing security recommendations across Azure resources.
Azure DevOps is a set of development tools that support a collaborative culture and streamlined processes. These tools enable developers, project managers, and contributors to develop software more efficiently. Azure DevOps provides integrated features such as Azure Boards, Azure Repos, Azure Pipelines, Azure Test Plans, and Azure Artifacts. You can access these features through a web browser or an integrated development environment client. In this architecture, Azure DevOps supports automated deployment and version control for data pipelines and notebooks.
GitHub is a cloud-based Git repository hosting service that simplifies version control and collaboration for developers. Individuals and teams can store and manage their code, track changes, and collaborate on projects. In this architecture, GitHub integrates with Azure DevOps to enforce automation and compliance in development workflows and deployment pipelines for Data Factory, Azure Databricks, and Fabric.
Alternatives
To create an independent Fabric environment, see Greenfield lakehouse on Fabric.
To migrate an on-premises SQL analytics environment to Fabric, see Modern data warehouses for SMBs.
Service alternatives within this architecture
Batch ingestion
- Optionally, use data pipelines in Fabric for data integration instead of Data Factory pipelines. The choice depends on several factors. For more information, see Differences between Azure Data Factory and Fabric Data Factory.
Microsoft Dynamics 365 ingestion
If you use Data Lake Storage as your data lake storage and want to ingest Dataverse data, use Azure Synapse Link for Dataverse with Data Lake Storage. For Dynamics 365 Finance and Operations apps, see Choose finance and operations data in Azure Synapse Link for Dataverse.
If you use a Fabric lakehouse as your data lake storage, see Link your Dataverse environment to Fabric.
Streaming data ingestion
- The decision between Azure IoT and Event Hubs depends on the source of the streaming data, whether you need cloning and bidirectional communication with the reporting devices, and the required protocols. For more information, see Compare IoT Hub and Event Hubs.
Lakehouse
- A Fabric lakehouse is a unified data architecture platform for managing and analyzing structured and unstructured data in an open format that primarily uses Delta Parquet files. It supports two storage types. These storage types are managed tables like CSV, Parquet, or Delta, and unmanaged files. Managed tables are automatically recognized. Unmanaged files require explicit table creation. The platform enables data transformations via Spark or SQL endpoints and integrates with other Fabric components. This integration allows data sharing without duplication. This concept aligns with the common medallion architecture that's used in analytic workloads. For more information, see Lakehouse in Fabric.
Real-time analytics
Azure Databricks
- If you have an existing Azure Databricks solution, you might want to continue to use Spark structured streaming for real-time analytics. For more information, see Streaming on Azure Databricks.
Fabric
If you previously used other Azure services for real-time analytics or have no existing real-time analytics solution, see Real-time Intelligence versus Azure streaming solutions.
Fabric structured streaming uses Spark structured streaming to process and ingest live data streams as continuously appended tables. Structured streaming supports various file sources, like CSV, JSON, ORC, Parquet, and messaging services like Kafka and Event Hubs. This approach ensures scalable and fault-tolerant stream processing, which optimizes high-throughput production environments. For more information, see Data streaming into a lakehouse with Spark.
Data engineering
- Use Fabric or Azure Databricks to write Spark notebooks. For more information, see Use Fabric notebooks. To learn how Fabric notebooks compare to what Azure Synapse Spark provides, see Compare Fabric Data Engineering and Azure Synapse Spark. For more information about Azure Databricks notebooks, see Introduction to Azure Databricks notebooks.
Data warehouse or gold layer
- You can use either Fabric or Azure Databricks to create a SQL-based warehouse or gold layer. For a decision guide on how to choose a data warehouse or gold layer storage solution within Fabric, see Choose a data store. For more information about SQL warehouse types in Azure Databricks, see SQL warehouse types.
Data science
Use either Fabric or Azure Databricks for data science capabilities. For more information about the Fabric Data Science offering, see Data Science in Fabric. For more information about the Azure Databricks offering, see AI and machine learning on Azure Databricks.
Fabric Data Science differs from Machine Learning. Machine Learning provides a comprehensive solution for managing workflows and deploying machine learning models. Fabric Data Science is tailored to an analysis and reporting scenario.
Power BI
Azure Databricks integrated with Power BI enables data processing and visualization. For more information, see Connect Power BI to Azure Databricks.
By mirroring Azure Databricks Unity Catalog in Fabric, you can access data that Azure Databricks Unity Catalog manages directly from the Fabric workload. For more information, see Mirror Azure Databricks Unity Catalog. You can query this data from Power BI in Direct Lake mode without copying the data into the Power BI service.
Scenario details
SMBs that have an existing Azure Databricks environment, and optionally, a lakehouse architecture, can benefit from this pattern. They currently use an Azure extract, transform, load (ETL) tool such as Data Factory and serve reports in Power BI. However, they might also have multiple data sources that use different proprietary data formats on the same data lake, which leads to data duplication and vendor lock-in concerns. This situation can complicate data management and increase dependency on specific vendors. They might also require up-to-date and near real-time reporting for decision-making and want to adopt AI tools across their environment.
Fabric is an open, unified, and governed SaaS foundation that you can use to:
Centralize data in OneLake to store, manage, and analyze data in a single location without vendor lock-in concerns.
Innovate faster with integrations to Microsoft 365 apps.
Gain rapid insights with the benefits of Power BI Direct Lake mode.
Benefit from Copilot in every Fabric experience.
Accelerate analysis by developing AI models on a single foundation.
Keep data in place without movement, which reduces the time that data scientists need to provide value.
Contributors
Microsoft maintains this article. The following contributors wrote this article.
Principal authors:
- Naren Jogendran | Cloud Solution Architect
- Bonita Rui | Cloud Solution Architect
To see nonpublic LinkedIn profiles, sign in to LinkedIn.
Next steps
- Learning paths for data engineers
- Fabric - Get started with Microsoft Learn
- Fabric - Microsoft Learn modules
- Create a storage account for Data Lake Storage
- Event Hubs quickstart - Create an event hub by using the Azure portal
- What is the medallion lakehouse architecture?
- What is a lakehouse in Fabric?