Dieser Artikel ist auch auf Deutsch verfügbar

TL;DR

We spoke with seven people from six different development projects. The projects were primarily implemented with architectures based on SCS and microservices. Two frequently employed approaches are the use of HTTP feeds and Apache Kafka as message-oriented middleware. Various decisions were made depending on non-functional requirements of the project, such as decoupling, capacity for higher data throughput, or system resilience.

Definition of terms

When it comes to asynchronous service-to-service communication, we frequently differentiate between event-driven and data-replicating systems. Put simply, it is sometimes sufficient in the first case to describe an object state exclusively via modifications to the object, while in the case of data replication the entire data aggregate is ideally republished in the event of an update.

Regardless of whether the services exchange events or data aggregates, the communication over a network results in an unavoidable latency before the state changes are received and processed by the consumers. The goal is to promptly achieve data consistency, but to emphasize that inconsistencies can arise in the meantime, the term “eventual consistency” is used. Depending on the use case, the timespan within which consistency must be achieved can vary, and this depends in turn on pull and push rates or the desired scalability of the overall system.

Apache Kafka

Fig. 1: Kafka architecture from a bird’s eye view
Fig. 1: Kafka architecture from a bird’s eye view

Apache Kafka is popular middleware used in INNOQ projects and elsewhere to provide highly scalable and robust data streams. Kafka uses a distributed and fault-tolerant architectural pattern that allows large quantities of data to be processed efficiently while simultaneously ensuring high availability. Figure 1 shows how message producers can pass information to a Kafka cluster. This transparently distributes the information among its brokers in so-called topics. Consumers can then pick up the messages from the cluster for further processing.

Kafka implements many features that are useful for handling events and data replication. However, because this inevitably results in a steep learning curve for the development team, it is advisable to use only the features required for the use case and to consider the team’s level of expertise when selecting the features.

It can be challenging to use Apache Kafka since scaling, configuration, and data security require careful planning and monitoring. Factors such as monitoring, fault tolerance, and integration with other systems must also be taken into consideration. Alternatively, a managed solution such as Amazons Managed Streaming for Apache Kafka or Confluent Cloud can also be utilized.

HTTP feeds

Fig. 2: HTTP feed architecture from a bird’s eye view
Fig. 2: HTTP feed architecture from a bird’s eye view

HTTP feeds are based on the standards of the Hypertext Transfer Protocol, which is supported by every web framework. The server supplies the data records or events in chronological order, which can then be polled by multiple consumers. As shown in Figure 2, central middleware is explicitly avoided, which makes the solution very lean since services can be loosely coupled without the need to introduce an additional infrastructure component. The consuming service is responsible for keeping track of the point to which the feed has been read.

There is no standardized method for implementing HTTP feeds. The implementation depends on the use case and the requirements of the software system. For an event-based system, it can be very useful to place a strong emphasis on cacheability to allow event sourcing by the consumers without re-accessing the feed.

An overview of the topics to be considered when specifying an HTTP feed protocol is provided by Jochen Christ at http-feeds.org.

Project overview

We examined six different projects from the areas of e-commerce, retail, and industry automation in which INNOQ was involved. All projects have in common that they were implemented in a service-oriented architecture with SCS. Spring Boot with Java or Kotlin was used in all cases.

In terms of team size, the projects were very diverse at 8 to 80 persons. The project size appears relevant for the technology selection to the extent that organizations embarking on larger projects may already provide a Kafka installation. This significantly lowers the barrier to utilizing Kafka since the required infrastructure and operational processes already exist, making it more attractive as a choice. Moreover, the setup and operation are easier to manage in large projects versus small ones, where sufficient staff may simply be lacking.

The listed reasons for the integration of self-contained systems via asynchronous communication are also identical in every case. The primary goal here is to avoid runtime dependencies in the inter-service communication and to thereby profit from the associated advantages, such as better scalability compared with synchronous requests and greater resilience. In addition, the named technologies were always used in the examined projects for data replication.

In the case of HTTP feeds, the lack of dependency on central middleware and the interoperability between networks were listed as reasons. Reasons stated for choosing Kafka included additional features provided by the software out of the box. One example of this is consumer groups, i.e. dividing of the data streams into subsets suitable for parallel processing, resulting in vertical scalability for the consumer processes, as illustrated in Figure 3.

Fig. 3: Consumer groups
Fig. 3: Consumer groups

Project details

The challenges in the various projects are as diverse as the people working on them. It is therefore no surprise that a variety of solutions were utilized for the integration of services. An overview of the unique aspects of the individual projects is provided below.

Project A: HTTP feeds for SCS integration, Kafka to prevent race conditions

Project environment

There are 5 self-contained systems, each handled by 1 to 3 teams of about 8 people each. The team of the interviewed person was responsible for an SCS with about 10 microservices when he left the project.

Architecture solution

Various guidelines were defined for the macroarchitecture. For instance, there were requirements for the user interface as well as for communication between the systems. The system needed to be event-driven as far as possible, with the services integrated on a pull basis via HTTP feeds. This was desired in the interests of a lean and flexible solution without any additional middleware.

The interviewed team used Kafka within its SCS. The team established the requirement that events be scalably processed in the correct order, which is ensured by the serialization and partitioning concept of Kafka. This means that events relating to a specific process are processed in sequence and race conditions can be avoided.

Satisfaction with the selected solution

Some problems were encountered with the batch processing, especially in a specific import process. There were millions of simultaneous data updates, which led to delays of up to an hour. Originally, the team proposed serverless use of AWS Lambda, which might have been a good alternative. This would probably have solved the batch problem but slowed down the development overall. In general, however, the team is very satisfied with its architecture solution.

Project B: Data replication with HTTP feeds and Amazon S3

Project environment

The team generally consists of a total of 9 people. The entire application consists of a collection of about 21 self-contained systems.

Architecture solution

The non-functional requirements consist of loose coupling, high performance, data replication, and solution simplicity. For this reason, the communication between the services was accomplished primarily via HTTP feeds, with complete snapshots of the current data records always written to the HTTP feed. Due to a guideline covering the integration of existing company applications in a different data center, however, there were exceptions in which data was written in XML or JSON format to an AWS S3 bucket by one system and read by the other system. Here it proved that other approaches might be more robust since, for instance, schema validation is difficult to implement.

Satisfaction with the selected solution

The project participants would choose this solution again. All of the developers were well satisfied with the integration via HTTP feeds. Due to schema problems with the file-based solution, however, the general satisfaction was somewhat more reserved but still good.

Project C: HTTP feeds with log compaction and shared database

Project environment

The purpose of the software is the management of products and distribution of orders to a dealer network. It is gradually replacing the existing ERP system.

The team grew continuously over time, consisting of as many as 10 persons. There are 6 closed self-contained systems, each containing multiple microservices, generally 1 to 3 per component.

Architecture solution

The SCS is integrated via HTTP feeds. The reasons for choosing HTTP feeds were decoupling, fail-safety, data consistency, and low complexity. It is a lean solution that enables data replication. Potential inconsistencies were accepted since a pooling interval of 10 seconds is sufficient.

Kafka log compaction was copied for further simplification of the feeds. The workload for implementing this feature was very low with the HTTP feed implementation in this project because the feeds always publish complete data records, meaning that past events can be simply deleted via the key of the referenced data record.

Within a microservice, a common database is used, with data passed to processes via read-only tables. It is important here that these processes run on the same codebase, in other words that they are understood as part of the microservice and are also deployed simultaneously. Using this pattern for the integration of multiple services is not advisable since schema changes would then require simultaneous deployment of various services.

Satisfaction with the selected solution

There were a few problems with schema versions attributable to a faulty interface. Overall, the participants are very satisfied with the solution. The use of middleware like Kafka would have meant an unjustifiable overhead for this use case.

Project D: Solution for schema changes in Kafka topics

Project environment

The project encompasses 5 areas, with each team consisting of 8 to 10 people. The 5 project areas generally consist of 3 to 4 microservices.

Architecture solution

The asynchronous communication between the services was more or less provided by the SCS architecture model since this form of coupling is preferred here. Kafka was evaluated but not initially selected due to its complexity. As the project was migrated to AWS at a later point and a managed Kafka was available with Amazon MSK, this was eventually used due to its ability to handle large quantities of data.

In our interview, we also discussed schema changes within Kafka topics. To ensure error-free communication, the consumer and producer must agree on a shared, unambiguous message schema. Whether the producer or the consumer defines the message schema can be determined by the project situation, the working style of the teams, or technical requirements. This is of most interest when incompatible schema changes are encountered. To ensure loose coupling of the systems, existing consumers of the messages must be able to continue processing them until they have been adapted to the changes. A flexible approach that proved itself well in the project is shown in Figure 4. This involves assigning a schema version to the messages and implementing the consumers such that they only read the versions for which they have been implemented. In the event of changes, the messages are distributed by the producers in both the old and new versions until all consumers have been adapted to the new schema.

Fig. 4: Possible handling of schema changes
Fig. 4: Possible handling of schema changes

If you would like to learn more about this topic, Goran recently wrote a blog post on Schema Evolution with Avro.

Satisfaction with the selected solution

The selected architecture solution was rated as very good. There were many questions initially, and onboarding of the developers was time-consuming since they did not have extensive experience with Kafka, but the system proved itself well over time. The consequences of the eventual consistency and the described method for handling schema changes required a certain amount of rethinking, but the systems are now stable. The participants are very satisfied with the solution and would choose it again.

Project E: HiveMQ as MQTT gateway for Kafka and cacheable HTTP feeds

Project environment

The team consists of a mixture of 8–10 persons from INNOQ and internal colleagues and is responsible for the development and operation of 3 service components. This project involves processing data from a large number of IoT devices.

Architecture solution

Individual devices send MQTT messages via an HAProxy to HiveMQ, and from there they are transmitted via an importer to Kafka and finally to the database. The data processing from and to Kafka as well as enrichment of the data is realized with both plain Java implementations and with the actor framework Akka, which ensures robust concurrency. The Kafka cluster was operated in Kubernetes itself with the help of an operator.

The reasons for using Kafka were reliability, scalability, and robustness. This meant that individual components could be delivered independently of each other. Downtimes due to deployments, or the like, do not result in data loss. Competing consumers were implemented to ensure that individual components can scale differently. The time decoupling and the purportedly infinite buffering offer a safety margin for making mistakes and reloading data in an emergency – even if that has not been necessary yet.

The communication between the components of the service takes place via HTTP feeds. The implementation of the HTTP feeds explicitly ensures that these are immutable and therefore easily saved in the cache to be reused for event sourcing. This means that a service can indempotently restore the state of another service at any time by evaluating the feed from the start, which can be useful for disaster recovery, for example.

The customer required the applications to be provider-independent and suitable for hosting in a hybrid cloud environment. HTTP feeds are particularly well suited for integrating services between various VPCs because the interface can be easily and securely made publicly accessible.

Satisfaction with the selected solution

Kafka streaming could be considered as an alternative to Akka. Kafka Streams is a Java library developed as part of the Apache Kafka ecosystem, and it allows developers to create real-time data processing applications for processing, transforming, and analyzing data streams. However, this technology was not yet available at the time of implementation.

The architecture solution itself was rated as very good, meaning that the architects were also very satisfied with the decisions afterwards. Satisfaction with the software in general was rated as excellent. In other words, the software worked flawlessly and fulfilled its purpose.

Insight: Error handling can be a particular challenge

One frequent challenge with asynchronous communication is error handling. This was discussed in most of the interviews, which is why we wanted to summarize what we learned for you.

In contrast to the conversation pattern implemented by REST, a messaging pattern does not allow a direct response, such as an HTTP status, to be sent back to the producer. In event-driven systems in particular, the communication is not directed to an explicit consumer. Although this behavior is intended, there is sometimes a need for a return channel, such as to inform the producer of faulty data records.

Error handling on the consumer side is also especially challenging. Message consumers can respond to errors in a variety of ways. With sequential processing of an HTTP feed, the processing can be completely stopped until the producer has corrected the error. Because this increases the coupling to the source, it is only possible when there is access to the producer system. The advantage of this approach is that the processing order of the messages remains guaranteed.

Another option is to use a so-called dead letter queue, or DLQ. As illustrated in Figure 5, all messages that cannot be processed are written to this queue. In the case of Kafka, this would be a separate topic that the message is sent to in the course of the error handling. This message can then be analyzed and further processed. DLQs should always establish a reference to the message consumer to simplify error tracking and correction.

Fig. 5: Dead letter queue
Fig. 5: Dead letter queue

Because there is no way to directly respond to a producer, good monitoring is all the more important. Various options exist here, which can naturally also be combined. For example, a notification by email or to a Slack channel can be triggered when the number of error logs exceeds a specific threshold. A dedicated error service such as Sentry could be used here, which can provide additional assistance with error analysis. When working with DLQs, it is possible to create a metric with a tool like Grafana that reads the size of the individual DLQs. Sufficient alerting is also a must.

Conclusion

Asynchronous communication plays a crucial role in the integration of services within modern architecture. By utilizing this approach, companies can reduce runtime dependencies and robustly replicate data from other services.

One method of asynchronous communication is the use of HTTP feeds. This technique allows services to exchange events and messages via standardized HTTP requests and responses. The advantage of HTTP feeds is that they do not require a central infrastructure component and have a relatively flat learning curve for developers. This simplicity makes it easy to establish and scale communication between services.

Another powerful solution for event-based integration over an asynchronous channel is Apache Kafka. Kafka offers numerous benefits, including reduced latency, higher throughput, and a robust, scalable architecture. With Kafka, services can process events in real-time while leveraging various existing features and integrations. This makes Kafka a popular choice for companies needing scalable and reliable service integration.

Overall, event-based integration through asynchronous service-to-service communication offers numerous benefits. It enables companies to integrate services more efficiently, improve individual scalability through loose coupling and patterns such as consumer groups, and enhance communication flexibility. Whether using HTTP feeds or Apache Kafka, the choice of approach depends on the specific requirements and complexity of the project. However, with the correct implementation and understanding of the various options, companies can significantly benefit from the advantages of event-based integration through asynchronous communication.