Scaling a backend application can be challenging, especially when it’s built around WebSocket servers handling a large number of user sessions. On its own, a single WebSocket server can’t handle all the users, right? So, how do we scale the application effectively to meet increasing demands?

When scaling horizontally, several issues arise. For instance, imagine that User A is connected to Server X, and User B is connected to Server Y. How do they communicate with each other? The challenge here is maintaining seamless communication between users connected to different servers.

Redis Streams: The First Solution

Initially, I turned to Redis Streams to solve this problem. Redis Streams is a powerful data structure in Redis that supports message queuing and processing. Here’s how I used it:

  • Centralized Messaging Channel: All microservices subscribed to a Redis stream, and whenever a WebSocket server needed to broadcast a message, it published it to the stream.
  • Message Durability: Since Redis Streams persist messages, any subscriber that temporarily went offline could catch up on missed messages.

This worked fine in the early stages, but during stress testing, the system slowed down. Why?

  • Storage Overhead: Redis Streams persist messages, and as the volume of messages grew, so did the memory and storage usage.
  • Throughput Limits: Redis, being single-threaded, struggled to handle the concurrent load effectively, creating a bottleneck.

Redis Pub/Sub: A Lightweight Alternative

To address the bottlenecks, I switched to Redis Pub/Sub. Unlike Streams, Pub/Sub doesn’t persist messages—it simply broadcasts them to all subscribed clients in real-time.

  • Improved Performance: By eliminating message storage, Pub/Sub reduced memory usage and improved message throughput.
  • Simpler Design: The absence of persistence meant fewer concerns about memory management.

However, Redis Pub/Sub also has limitations:

  • No Message Durability: If a subscriber is offline when a message is published, the message is lost.
  • Single-Threaded Nature: Redis is inherently single-threaded, meaning performance still doesn’t scale linearly with the number of cores or nodes.

Building a Peer-to-Peer Service Discovery

To bypass the limitations of Redis, I attempted to build my own peer-to-peer service discovery system. The idea was to enable direct TCP/IP connections between microservices for communication, eliminating the need for a centralized message broker.

This approach offered:

  • Direct Communication: Services could talk to each other without an intermediary, reducing latency.
  • Customizability: I had complete control over the messaging layer, tailoring it to the application’s specific needs.

However, the drawbacks soon became apparent:

  • Increased Complexity: Managing connections, retries, and private networks added significant operational overhead.
  • Scalability Challenges: As the number of servers grew, maintaining the mesh network became cumbersome. Every new server needed to establish connections with all existing servers, leading to an exponential number of connections.

Introducing NATS

Realizing the challenges of peer-to-peer architecture, I turned to NATS, a high-performance, lightweight messaging system designed for cloud-native applications. NATS offered a balance between simplicity and scalability, addressing the pain points of both Redis and my custom solution.

Why NATS?

  1. Performance

    NATS is built for speed. It uses a highly efficient event loop and leverages multi-core processors, unlike Redis, which is single-threaded. NATS can handle millions of messages per second with low latency.

  2. Scalability

    NATS supports horizontal scaling through clustering. Servers can be clustered to form a single logical bus, and clients can connect to any server in the cluster. This makes it easy to scale across multiple servers and even data centers.

  3. Fault Tolerance With NATS JetStream, you can enable message persistence, acknowledgments, and replay, providing durability similar to Redis Streams but with better performance under heavy loads

  4. Simplicity

    NATS is lightweight and easy to deploy, often requiring minimal configuration. It doesn’t demand the operational complexity of a peer-to-peer system or the storage management overhead of Redis Streams.

  5. Multi-Modal Messaging

    NATS supports:

    • Pub/Sub: For real-time communication.
    • Request/Reply: Ideal for RPC-like use cases.
    • Queue Groups: For load balancing consumers.
  6. Auto-Pruning and Optimization

    Unlike Redis Streams, NATS JetStream can automatically prune old messages based on policies like time or storage limits, simplifying maintenance and ensuring optimal resource usage.

  7. Security Features

    NATS comes with built-in support for:

    • Mutual TLS for secure communication.
    • Token-based and user/password authentication.
    • Fine-grained access control for topics.

Conclusion

Scaling a WebSocket-based backend is no small feat, and there’s no one-size-fits-all solution. Redis worked well in the early stages but hit its limits as the application grew. Building a custom peer-to-peer system taught me valuable lessons about complexity and trade-offs.

NATS emerged as the right tool for the job, striking a balance between performance, scalability, and simplicity. Its cloud-native design and robust feature set make it an excellent choice for modern distributed systems. If you’re building scalable, real-time applications, I highly recommend exploring NATS.