Dieser Artikel ist auch auf Deutsch verfügbar
Today’s data engineering shifted from building monolithic data pipeline structures to modular data products.
A data product is the deliverable that contains everything around a business concept to fulfill a data consumer’s need:
- tables to actually store data
- code that transform data
- tests to verify and monitor that data is correct
- output ports to make data accessable
- input ports to ingest data from source systems or access other data products
- data contracts to describe the API
- documentation
- meta information, such as ownership
A data product is usually managed in one Git repository. Databricks is one of the most popular modern data platforms, now how can we engineer a professional data product with Databricks?
In this article, we will use Data Contracts and the new Databricks Asset Bundles that are a great fit to implement data products. All source code of this example project is available on GitHub.
Define the Data Contract
Before we start implementing, let’s discuss and define the business requirements. What does our data consumer need from us, what is their use case, what do they expect as a data model. And we need to make sure, that we understand and share the same semantics, quality expectations, and expected service levels.
We call this approach contract-first. We start designing the interface of the provided data model and its metadata as a data contract. We use the data contract to drive the implementation.
In our example, the COO of an e-commerce company wants to know if there is an issue with articles that are not sold for a longer period, i.e., articles with no sale during the last three months, the so-called shelf warmers.
In collaboration with the data consumer, we define a data contract as YAML, using the Data Contract Specification.
The dataset will contain all articles that currently are in stock and it includes the last_sale_timestamp
, the attribute that is most relevant for the COO. The COO can easily filter in their BI tool (such as PowerBI, redash, …) for articles with last_sale_timestamp
older than three months. Terms and service Level attributes make it clear that the dataset is update daily at midnight.
Create the Databricks Asset Bundle
Now it is time to develop a data product that implements this data contract. Databricks recently added the concept of Databricks Asset Bundles that are a great fit to structure and develop data products. As time of writing in March 2024, they are in Public Preview, meaning ready for production-use.
Databricks Asset Bundles include all the infrastructure and code files to actually deploy data transformations to Databricks:
- Infrastructure resources
- Workspace configuration
- Source files, such as notebooks and Python scripts
- Unit tests
The Databricks CLI bundles these assets and deploys them to Databricks Platform, internally it uses Terraform. Asset Bundles are well-integrated into the Databricks Platform, e.g., it is not possible to edit code or jobs directly in Databricks, which enables a strict version control of all code and pipeline configuration.
Bundles are extremely useful, when you have multiple environments, such as dev, staging, and production. You can deploy the same bundle to multiple targets with different configurations.
To create a bundle, let’s init in a new bundle:
We use this configuration:
- Template to use: default-python
- Unique name for this project: stock_last_sales
- Include a stub (sample) notebook in
stock_last_sales/src
: yes - Include a stub (sample) Delta Live Tables pipeline in
stock_last_sales/src
: no - Include a stub (sample) Python package in
stock_last_sales/src
: yes
When we look into the bundle structure, let’s have a quick look at the most relevant files:
- databricks.yml The bundle configuration and deployment targets
- src/ The folder for the transformation code
- tests/ The folder to place unit tests
- resources/ The job definition for the workflow definition
Note: We recommend to maintain an internal bundle as template that incorporates the company’s naming conventions, global policies, best practices, and integrations.
With asset bundles, we can write our code locally in our preferred IDE, such as VS Code (using the Databricks extension for Visual Studio Code), PyCharm, or IntelliJ IDEA (using Databricks Connect).
To set up a local Python environment, we can use venv and install the development dependencies:
Generate Unity Catalog Table
How do we organize the data for our data product? In this example, we use Unity Catalog to manage storage as managed tables. On an isolation level, we decide that one data product should represent one schema in Unity Catalog.
We can leverage the data contract YAML to generate infrastructure code:
The model defines the table structure of the target data model. With the Data Contract CLI tool, we can generate the SQL DDL code for the CREATE TABLE statement.
The Data Contract CLI tool is also available as a Python Library datacontract-cli. So let’s add it to the requirements-dev.txt and use it in directly in a Databricks notebook to actually create the table in Unity Catalog:
The Unity Catalog tables is a managed tables that internally uses the Delta format for efficient storage.
Develop Transformation Code
Now, let’s write the core transformation logic. With Python-based Databricks Asset Bundles, we can develop our data pipelines as:
- Databricks Notebooks,
- Delta Live Tables, or
- Python files
In this data product, we’ll write plain Python files for our core transformation logic that will be deployed as Wheel packages.
Our transformation takes all available stocks that we get from an input port, such as the operational system that manages the current stock data, and left-joins the dataframe with the latest sales timestamp for every sku. The sales information are also an input port, e.g., another upstream data product provided by the checkout team. We store the resulting dataframe in the previously generated table structure.
With that option, the code remains reusable, easy to test with unit tests, and we can run it on our local machines. As professional data engineers, we make sure that the calculate_last_sales()
function works as expected by writing good unit tests.
We update the job configuration to run the Python code as a python_wheel_task and configure the scheduler and the appropriate compute cluster.
When we are confident, we can deploy the bundle to our Databricks dev instances (manually for now):
And let’s trigger a manual run of our workflow:
In Databricks, we can see that the workflow run was successful:
And we have data in our table that we created earlier.
Test the Data Contract
We are not quite finished with our task. How do we know, that the data is correct? While we have unit tests that give us confidence on the transformation code, we also need an acceptance test to verify, that we implemented the agreed data contract correctly.
For that, we can use the Data Contract CLI tool to make this check:
The datacontract
tool takes all the schema and format information from the model, the quality attributes, and the metadata, and compares them with the actual dataset. It reads the connection details from the servers section and connects to Databricks executes all the checks and gives a comprehensive overview.
We want to execute this test with every pipeline run, so once again, let’s make a Notebook task for the test:
Deploy with CI/CD
To automatically test, deploy the Asset Bundle to Databricks, and finally run the job once, we set up a CI/CD pipeline in GitHub, using a GitHub Action.
Now, every time we update the code, the Asset Bundle is automatically deployed to Databricks.
Publish Metadata
For others to find, understand, and trust data products, we want to register them in a data product registry.
In this example, we use Data Mesh Manager, a platform to register, manage, and discover data products, data contracts, and global policies.
Again, let’s create a notebook task (or Python code task) to publish the metadata to Data Mesh Manager and add the task to our workflow. We can use Databricks Secrets to make the API Key available in Databricks.
Conclusion
Now, the COO can connect to this table with a BI tool (such as PowerBI, Tableau, Redash, or withing Databricks) to answer their business question.
Databricks Asset Bundles are a great fit to develop professional data products on Databricks, as it bundles all the resources and configurations (code, tests, storage, compute, scheduler, metadata, …) that are needed to provide high-quality datasets to data consumers.
It is easy to integration Data Contracts for defining the requirements and the Data Contract CLI to automate acceptance tests.
Find the source code for the example project on GitHub.