How to build a scalable platform architecture for real-time data

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

As the SaaS industry grows rapidly, dynamic and adaptable architectures are needed to handle the influx of real-time data. Here's how to build them.

Translated from How To Build a Scalable Platform Architecture for Real-Time Data , author Christina Lin.

The software as a service (SaaS) industry is showing unstoppable growth, with the market size expected to reach US$317.555 billion in 2024 and almost triple to US$1.22887 billion by 2032 . This growth highlights the growing need for enhanced robust data strategies. This trend is driven by the increasing volume, velocity and diversity of data generated by enterprises and the integration of artificial intelligence.

However, this growing landscape brings several important challenges, such as managing peak traffic, transitioning from online transaction processing (OLTP) to online analytical processing (OLAP) in real time, ensuring self-service and decoupling, and becoming cloud agnostic. and multi-region deployment. Addressing these challenges requires a sophisticated architectural framework that ensures high availability and robust failover mechanisms without compromising system performance.

The reference architecture in this article details how to build a scalable, automated, and flexible data platform to support the growing SaaS industry. This architecture supports the technical needs of processing large-scale data while also aligning with business needs for agility, cost-effectiveness, and regulatory compliance.

Technical challenges of data-intensive SaaS services

As the demand for services and data volumes continues to grow, several common challenges arise in the SaaS industry.

Handling peaks and bursts of traffic is critical to efficiently allocating resources to cope with variable traffic patterns. This requires isolating workloads, scaling during peak workloads, and reducing compute resources during off-peak hours while preventing data loss.

Maintaining OLTP in real-time to OLAP means seamlessly supporting OLTP, which manages large volumes of fast transactions with a focus on data integrity, and OLAP systems that support rapid analytical insights. This dual support is critical to supporting complex analytical queries and maintaining peak performance. It also plays a key role in preparing datasets for machine learning (ML).

Enabling self-service and decoupling requires empowering teams with self-service capabilities to create and manage topics and clusters without relying heavily on a central IT team. This speeds development while allowing applications and services to be decoupled and achieve independent scalability.

Promoting cloud agnosticism and stability enables agility and the ability to operate across different cloud environments such as AWS , Microsoft Azure or

How to build a SaaS-friendly architecture

To address these challenges, large SaaS companies often adopt an architectural framework that involves running multiple clusters spanning multiple regions and managed by a custom-developed control plane. The design of the control plane enhances the flexibility of the underlying infrastructure while simplifying the complexity of the applications connected to it.

While this strategy is critical for high availability and a robust failover mechanism, it can also become too complex to maintain uniform performance and data integrity across a geographically distributed cluster, let alone without impacting performance or introducing latency. The challenges of scaling up or down resources arise.

Additionally, certain scenarios may require data to be isolated within a specific cluster for compliance or security reasons. To help you build a robust, flexible architecture that avoids these complexities, I'll walk you through some suggestions.

1. Establish a stable foundation

A major challenge for SaaS services is allocating resources to handle various traffic patterns, including high-frequency and high-volume online queries, data insertion, and internal data exchange.

Converting traffic into asynchronous processes is a common solution that allows for more efficient scaling and rapid allocation of computing resources. Data streaming platforms such as Apache Kafka are ideal for efficiently managing massive amounts of data. But managing a distributed data platform like Kafka brings its own set of challenges. Kafka's system is notorious for its technical complexity, as it requires managing cluster coordination, synchronization, and scaling, as well as additional security and recovery protocols. Challenges in Kafka

The Java Virtual Machine (JVM) in Kafka can also cause unpredictable latency spikes, primarily due to the JVM's garbage collection process. Managing the JVM's memory allocation and tuning for Kafka's high throughput requirements is notoriously tedious and can impact the overall stability of the Kafka broker.

Another obstacle is Kafka's data policy management. This includes managing data retention policies, log compression, and data deletion while balancing storage costs, performance, and compliance to some extent.

In short, effectively managing Kafka-based systems in a SaaS environment is tricky. As a result, many SaaS companies are turning to Kafka alternatives that provide highly scalable data streaming without the need for external dependencies such as JVM or ZooKeeper.

2. Enable self-service streaming data

There is growing demand for self-service solutions that allow developers to create themes from development to production. The infrastructure or platform service should provide a solution with centralized control, provide login details and automate the rapid creation and deployment of resources across various platforms and stages.

This raises the need for a control plane, which comes in many forms. Some control planes are only used to manage the lifecycle of a cluster or topic and assign permissions on the streaming platform. Other control planes add a layer of abstraction by virtualizing targets and hiding infrastructure details from users and clients.

When a topic is registered in the control plane of the self-service data platform, different computing resource optimization strategies are applied depending on the stage of the environment. In development, topics often share clusters with other processes, data retention is less emphasized, and most data is discarded within a few days.

However, in production, resource allocation must be carefully planned based on traffic volume. This planning includes determining the number of partitions for consumers, setting data retention policies, deciding on data location, and considering whether you need a dedicated cluster for specific use cases.

For the control plane, it is very helpful to automate the lifecycle management process of the streaming platform. This enables the control plane to autonomously debug agents, monitor performance metrics, and start or stop partition rebalancing to maintain platform availability and stability at scale.

3. Real-time support for OLTP and OLAP

The shift from batch processing to real-time analysis makes integrating OLAP systems into existing infrastructure critical. However, these systems typically handle large amounts of data and require complex data models for in-depth multidimensional analysis.

OLAP relies on multiple data sources, and depending on the maturity of the company, there is usually a data warehouse or data lake to store the data, as well as batch processing pipelines that run periodically (usually nightly) to move data from the data sources. This process merges data from various OLTP systems and other sources - a process that can become complex in maintaining data quality and consistency.

Today, OLAP also integrates AI models with large data sets. Most distributed data processing engines and streaming databases now support real-time consumption, aggregation, summarization, and analysis of streaming data from sources such as Kafka or Redpanda. This trend has led to the rise of extract, transform, load (ETL) and extract, load, transform (ELT) pipelines for real-time data, as well as change data capture (CDC) pipelines that stream event logs from databases.

Real-time pipelines, typically implemented in Java , Python or Golang, require careful planning. To optimize the lifecycle of these pipelines, SaaS companies are embedding pipeline lifecycle management into their control planes to optimize monitoring and resource alignment.

4. Understand (and optimize) the data pipeline lifecycle

The first step is to choose a technology stack and determine the level of freedom and customization that users creating pipelines will enjoy. Ideally, allow them to select various technologies for different tasks and implement guardrails to limit pipeline construction and expansion.

The following is a brief overview of the stages involved in the pipeline life cycle.

Build and test

Source code is pushed to a Git repository, either directly by pipeline developers or through custom tools in the control plane. This code is then compiled into binary code or an executable program using a language such as C++, Java, or C#. After compilation, the code is packaged into an artifact, a process that may also involve bundling authorized dependencies and configuration files.

The system then executes automated tests to verify the code. During testing, the control plane creates temporary topics specifically for this purpose, and these topics are destroyed as soon as testing is complete.

deploy

Artifacts are deployed to virtual machines (such as Kubernetes ) or streaming databases, depending on the technology stack. Some platforms offer more creative approaches to release strategies, such as blue/green deployments, which enable fast rollback and minimize downtime. Another strategy is canary release, where a new version is applied to only a small portion of the data, thereby reducing the impact of potential problems.

The disadvantages of these strategies are that rollbacks can be challenging and it can be difficult to isolate the data affected by the new version. Sometimes it's simpler to perform a full release and roll back the entire dataset.

Expand

Many platforms support automatic scaling, such as adjusting the number of running instances based on CPU usage, but the level of automation varies. Some platforms provide this functionality natively, while others require manual configuration, such as setting the maximum number of parallel tasks or worker processes per job.

During deployment, the control plane provides default settings based on anticipated demand, but continues to monitor metrics closely. It then allocates additional resources to the topic by scaling the number of worker processes, tasks, or instances as needed.

monitor

Monitoring the right metrics in your pipeline and maintaining observability are primary ways to detect problems early. Here are some key metrics you should proactively monitor to ensure the efficiency and reliability of your data processing pipeline.

Resource indicators

CPU and memory usage are critical to understanding how resources are being consumed.
Disk I/O is important for evaluating the efficiency of data storage and retrieval operations.

Throughput and latency

Input/output records measure data processing rate per second.
Records processed per second represent the processing power of the system.
End-to-end latency is the total time it takes from data input to output, which is critical to real-time processing performance.

Backpressure and hysteresis

These help identify bottlenecks in data processing and prevent potential slowdowns.

Error rate

Tracking error rates helps maintain data integrity and system reliability

5. Improve reliability, redundancy and resiliency

Enterprises prioritize high availability, disaster recovery, and resiliency to maintain continued operations during disruptions. Most data streaming platforms already have strong safeguards and deployment strategies built in, primarily by extending clusters across multiple partitions, data centers, and cloud-agnostic availability zones.

However, it involves trade-offs, such as increased latency, potential data duplication, and higher costs. Here are some suggestions when planning for high availability, disaster recovery, and resiliency.

High availability

An automated deployment process managed by the control plane plays a key role in establishing a robust high availability strategy . This strategy ensures that pipelines, connectors, and streaming platforms are strategically distributed across availability zones or partitions based on cloud provider or data center.

It is critical for data platforms to distribute all data pipelines across multiple availability zones (AZs) to reduce risk. Continuity is supported by running redundant copies of pipelines in different AZs to maintain uninterrupted data processing in the event of partition failure.

Streaming platforms underlying the data architecture should follow suit and automatically replicate data across multiple AZs to improve resiliency. Solutions like Redpanda can automate data distribution across partitions, improving the platform's reliability and fault tolerance.

However, consider the potential associated network bandwidth costs, taking into account the location of your applications and services. For example, keeping pipelines close to data stores can reduce network latency and overhead while reducing costs.

disaster recovery

Faster failure recovery comes with higher costs due to increased data replication, resulting in higher bandwidth overhead and requires an always-on (active-active) setting, doubling hardware usage. Not all streaming technologies offer this functionality, but enterprise-grade platforms like Redpanda support backing up data and cluster metadata to cloud object storage.

elasticity

In addition to high availability and disaster recovery, some global enterprises require regional deployment strategies to ensure that their data storage and processing comply with specific geographic regulations. Instead, companies that want to share data in real-time across different regions with minimal management often create a shared cluster that enables agents to replicate and distribute data across regions.

However, this approach incurs significant network cost and latency as data is continuously transferred to the following partitions. To alleviate data traffic, follower fetch instructs the data consumer to read data from the geographically closest follower partition.

Additionally, scaling clusters for data backfill improves load balancing across data centers. This scalability is critical for managing growing data volumes and network traffic, helping enterprises scale without sacrificing performance or reliability.

in conclusion

As companies transform through digital transformation, real-time data becomes increasingly critical in guiding decision-making. This involves extracting deeper insights from massive data sets , enabling more accurate forecasts, streamlining automated decision-making processes and delivering more personalized services – all while optimizing costs and operations.

One option is to adopt a reference architecture that includes a scalable data streaming platform such as Redpanda , a plug-and-play Kafka replacement implemented in C++. It enables companies to avoid real-time by facilitating seamless scaling, a management API that supports lifecycle automation , tiered storage to reduce storage costs , remote read replicas to simplify setting up cost-effective read-only clusters, and seamless geo-distribution. Data processing complexity.

With the right technology, SaaS providers can enhance their services, improve customer experience, and increase their competitive advantage in the digital marketplace. Future strategies should continue to optimize these systems for greater efficiency and adaptability so that SaaS platforms can thrive in a data-driven world.

This article was first published on Yunyunzhongsheng ( https://yylives.cc/ ), everyone is welcome to visit.