Welcome!

Java Authors: Don MacVittie, Maureen O'Gara, Liz McMillan, Walter H. Pinson, III, Yakov Werde

Related Topics: Java

Java: Article

Slow Receivers in a Distributed Management System

Slow receivers explained

In connection-oriented protocols like TCP, the sender has to copy data into its kernel buffers and send it out to each receiver individually. The send completes only when the data has been delivered to the receiver's kernel socket buffers. If the receiver's socket buffers are full, the send blocks until the buffers become available, slowing down the performance of other receivers that can't get messages from this sender because the sender is blocked trying to send a message to this slow receiver.

In connection-less protocols like reliable multicast, the sender sends the data out to the multicast network by copying it out once to the Ethernet card and then broadcasting it out on the network with the appropriate time to live parameters. The sender isn't bogged down by receivers at the network buffer level.

Protocol reliability is achieved by having the sender maintain a buffer of sent messages and wait to get ACK messages from the receivers that they've gotten a particular message. The senders' buffer is the limiting factor when it comes to retransmissions to receivers that can't pick up data from the receiver buffers fast enough and then request the sender to retransmit the lost data. Even in this case, one can see that senders end up spending CPU cycles and memory resources that slow receivers and bog down system throughput. Slow receivers are often referred to as crybaby receivers in network parlance.

Slow Receivers & Cache Consistency
The ability to receive and process every piece of relevant data is critical to the functioning of a distributed cache. It's assumed that the messages coming in are relevant to the receiver and to maintain cache consistency, it's essential to make attempts to process the incoming data and provide some cache consistency guarantees to the consuming application.

At the same time, this desire to receive and process every message can result in a system that runs at the speed of the slowest consumer, clearly something that most distributed applications wouldn't want to tolerate.

The solution is to define the consistency level that the cache elements in an application need and then provide a solution that deals with receiver slowness.

But before looking at solutions, let's consider the situations that result in a slow receiver.

  • Born slow receiver: Consider a system that's comprised of 16 servers and a desktop machine that needs to get data from one or more servers. If the desktop is configured as a peer, clearly its CPU will be unable to keep up with the flood of messages coming from the servers. Eventually (eventually is usually a couple of seconds at most in such systems), the desktop application's socket buffers fill up, bringing the publishing servers to a standstill, even though there are other consuming servers that can keep up with the publishers' rate.
  • Slow decline into slow receivership: In this system, all nodes start out as equals. Activity levels on different nodes, however, tend to vary, and one of the nodes ends up having to deal with heap fragmentation issues or garbage collection issues or one of its threads starts to run hot. In this kind of system, the performance drop is gradual and it takes a bit longer for its affects to be felt by the rest of the system, but the end result is the same nonetheless.
  • Poorly written applications: This category of slow receivers usually has two components. The component that picks up data from the socket buffers and hands it over for processing, and the component that does the actual processing. The first component works fine but the second is unable to keep up. This kind of slow receiver usually dies a quick death but the effects are felt later on. If the failed application was a server, the clients that it was processing quickly fail over to the other servers decreasing their throughput.
  • Receiver living in a hostile neighborhood: TCP applications are like well-mannered suburban drivers, making way for one another, going at the same speed as everyone else and generally living a fair life from a network bandwidth perspective. When a multicast application steps into the TCP neighborhood, unless the multicast application is designed to have some group control rate, the network suddenly looks a lot like the crowded streets of big cities, where fairness is no longer the norm. In such cases, a previously well-behaved TCP receiver starts to look more and more like a slow receiver slowing down the whole system.
Detecting a Slow Receiver
For every message sent from a sender to a receiver, the sender maintains some stats on the average time to completion. When the time-to-completion stat starts showing an upward trend and breaches a threshold, the sender flags that receiver as a slow receiver. This sort of detection works well in connection-oriented environments where the sender and receiver share a connection.

In connection-less environments, the sender has to maintain stats on the number of retransmission requests made by the receiver and, when that crosses a certain threshold, tag the receiver as a slow receiver.

A third class of slow receiver detection isn't really detection. Rather. a slow receiver on failing to keep up with the rest of the system or finding excessive use of memory in its application announces itself as a slow receiver, allowing the rest of the system to activate policies that have been configured for slow receivers.

Each member of the distributed system has stats that let the member detect that it's entering into slow receiver mode and can be configured with policies to deal with it.

Dealing with Slow Receivers
When it comes to slow receivers, there is no "one size fits all" policy that works (that works well anyway). The options that the system has once it encounters a slow receiver depend on its data consistency policy. What this implies is that a node has set certain data consistency expectations with other system members. These expectations play a major role in deciding how the member will be dealt with once it goes into slow receiver mode.

The slow receiver can choose to drop data, fire data loss notifications to the application, and catch up if the problem was temporary. This implies that not every update coming into the system has to be processed in order and if the application has to fetch data from the cache, it will be fetched from other nodes on-demand.

The slow receiver can send out a notification to other nodes stating that it is unable to accept any data until further notice. The remaining nodes would then ignore the member until they get a notice that the member was again open for business. Cache misses on other nodes wouldn't be directed to this node, and data on the slow receiver would be considered suspect for the rest of the system, even though the local cache on the slow receiver would continue to serve the application and clients it was attached to.


More Stories By Sudhir Menon

Sudhir Menon is the director of engineering for GemStone Systems and works closely with various development teams (both onsite and offshore) working on the Gemfire Enterprise Data Fabric. With over 17 years of cutting-edge software experience with marquee firms like Gemstone, Intel, EDS and CenterSpan communications, he is one of the key architects for the Gemfire Enterprise Data Fabric. His expertise in distributed data management spans multiple languages (Java, C++ and .NET) and multiple platforms and he has architected and developed network stacks for the last 10+ years.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.