Dieser Blogpost ist auch auf Deutsch verfügbar
What is a data product?
The term data product arises in the context of data mesh architecture.
In the data mesh approach, a team takes ownership of their analytical data. Instead of pouring all data into a single data lake to be analyzed by a central team, each team makes their own data available as data products.
A data product is a set of relevant information for the purpose of analysis. The format is initially of lesser importance.
A data product should meet the following criteria:
- Presentation of business data with description.
- Sufficiently current.
- Easy to access.
- Tested (high probability that correct information is being displayed).
- Designed on the basis of use cases or scenarios (no dumb copy of all data).
- Actively managed by the team supplying the data.
Data products might take the following forms:
- A set of tables provided in a service such as Google Big Query.
- A file (JSON, CSV or the like), that is regularly provided (e.g. in an S3 bucket).
- In principle, also some kind of report with tables or graphs.
Just one more requirement?
Data products are made available to other stakeholders. A team owns its own data, but stakeholders need this data for analyses.
Waiting until the first requirements are submitted would be one option. Considering data products right from the beginning would be another one.
This article shows the benefits of data products from the perspective of what they offer your own team.
Staring out without data products
Let’s begin here with a short story taken from real life. A project involved the provision of product data. This was to take place in multiple steps. The first step was to create a proxy that would read and transform the existing product data so that it could be made available in a new format. The next step was then to replace the existing product data sources. Ideally, the users of the product data would not notice a thing.
Interviews with subject matter and system experts were held in advance. Even a developer with experience in the previous system landscape was actively involved in the development. Hypotheses were suggested, and an attempt was made to test them with random sampling.
The go-live was divided into several steps. In every step, new questions and problems arose. The consequences: The number of questions grew. The workload required to answer these questions and carry out the necessary data research also continued to grow.
The research typically involved the following tools:
- SQL queries in the databases of the involved software systems that the team was responsible for.
- Individual research via the user interface of the team’s own software systems.
- Existing interfaces to the “old” software systems.
- Delegation of questions to system experts in the “old” software systems
Over time, the workload just kept growing.
- In some cases, the “old” software systems did not offer any ability to check data in bulk.
- The number of random samples used for testing kept increasing in order to identify patterns.
- The first scripts were written, but these could not be used by everyone in the team.
- The number of inquiries continued to increase.
This tied up an expanding amount of the team’s time.
One particular challenge: External data was required to evaluate the priority of the discovered problems. Product data for articles with high stock levels must naturally be checked more quickly then for articles with low stock levels.
If we fail to sufficiently consider from the start how we want to analyze our data, we will pay the price later.
How data products have helped
The first data products were created in order to deal with the problems.
- Updates from the anti-corruption layer [1]
- Product data, including status information
- Violations of product data quality
The first tables in Google Big Query were built. These finally made it possible to carry out quantitative analyses. By comparing the update messages with the product data, it was possible to check whether the data was stored correctly in the product information system. It was also possible to check whether unusual updates were being sent from the “old” systems.
To permit more detailed analyses, product data from the “old” system was integrated in the next step. Fortunately, this data was also made available as a data product. At this point, a majority of the product data journeys could be represented.
Typical questions were:
- Does the status in the source system match the expected status in the product information system?
- Are the received updates consistent with the data from the source system?
The critical point was the quantitative comparability.
- How many products with status X are there in the source system?
- How many data records in the “old” and new systems correspond to status Y?
The product owner of the team now also had the ability to provide support in the form of research and analysis. Piece by piece, the first reports were built in order to carry out analyses. It was now also possible to use BI tools for this, such as Microsoft Power BI and Google Looker.
- Hypotheses could be tested more accurately in this way than with only random sampling.
- Analyses could be carried out on data spanning multiple systems.
- Key figures could be developed based on the data.
- Tests could be integrated to check if product data was online or not.
- Reports could be made available to subject matter experts for their own research.
A separate article will be published about BI tools. This will also address the question of why development teams should come to grips with something like this.
Data products belong in the backlog
Maybe you were wondering as you read the last two sections: “Why wasn’t all of that done right from the beginning?”
That’s a good question. The simple answer? Everything costs time and money.
At the very start of a development project, a team is often under particularly high pressure. Talking about quality criteria and setting up scenarios? Discussing necessary reports and technical data definitions? There is often no time for all that, or at least that is the impression.
That’s why data products belong in the backlog and must be part of the refinement process. They also belong within the scope of roadmaps and in any meeting in which deliverables, target deadlines, and costs are discussed.
Tip: If no concrete requirements exist, it can be assumed that these will all arrive shortly before the target deadline.
Analysis capability promotes autonomy
Almost every organization has reports with key figures that cover varying levels of detail. Especially in large organizations, data products are the basis for data teams to provide such reports for the C-level and others. [2]
But when the topic is approached from this side, it might first give rise to a different thought. Data products are one more burden on teams. They represent yet another deliverable. This makes it important to examine the topic from the perspective of a specific team. My hypothesis is that, in the vast majority of cases, this workload will arise regardless. The question is only when and how it impacts the flow of the team.
As teams, however, we like to enjoy a high level of autonomy. But this also requires that we operate in line with the business goals of our organization. And how is that supposed to work if we don’t have an overview of the most important data concerning our own software product?
Summary
Every team should give thought to their data products. This should begin already at the very start of the development work, even if the requirements of the stakeholders are not yet known. When a data mesh approach is followed, data products are created almost automatically. But even if this approach is not taken, a team can make use of the data product concept for its own purposes.
Data products make the following possible:
- Answering questions more efficiently.
- Verifying requirements based on actual data.
- Further improving your own data.
- Taking more responsibility for your own product (outside-in perspective).