As part of a broader organisational restructure, data networking research at Swinburne University of Technology has moved from the Centre for Advanced Internet Architecture (CAIA) to the Internet For Things (I4T) Research Lab.

Although CAIA no longer exists, this website reflects CAIA's activities and outputs between March 2002 and February 2017, and is being maintained as a service to the broader data networking research community.

Incast congestion control

Background

TCP employs dynamic flow and congestion control to balance a flow's throughput against the impact on the network and receiver, lowering throughput to protect the network and/or receiver as required. Originally designed for operation over the public Internet, TCP's ubiquitous deployment means it is constantly finding itself in environments which fall well outside of the original design constraints. While the protocol's impressive robustness allows it to operate in these environments, performance is often significantly sacrificed.

Datacenter networks are one such environment, and pose difficult challenges for regular TCP to overcome. Current datacenters typically have Round Trip Times (RTT) measured in microseconds, bandwidth in Gbps, large host fan in/fan out and switch port buffers only capable of buffering a very small fraction of line-rate traffic.

One problem experienced by TCP operating in such an environment is known as incast [1]. The commonly used “divide and conquer” approach to serving data from large machine clusters routinely triggers this problem.

After an aggregation node has sent data requests to appropriate nodes in the cluster hierarchy, the responses return asynchronously. There is a reasonable probability that the responses may all arrive at about the same time (a microburst), triggering momentary congestion at the switch port or kernel of the aggregating node. At high data rates, such an event can result in buffer exhaustion and severe drop-tail packet loss. Making matters worse, TCP's millisecond granularity timers can take many orders of magnitude longer than the real network RTT to detect and recover from such events, increasing the time taken to service client queries and wasting useful equipment cycles.

Related work

Both academia and industry are paying increased attention to the challenge of keeping datacenter networks performing optimally while migrating to commodity network technologies. The consolidation of switching fabrics into a single, low-cost, high-speed Ethernet-based fabric has driven the development and specification of Datacenter Bridging [2]. Work has also been progressing separately on changes to TCP itself to make it perform better in datacenter environments.

Deployment of more finely-grained TCP timers [3] addresses the issue of “impedance mismatch” between the TCP state machine operating in milliseconds and the datacenter network operating on microsecond timescales. DCTCP [4] uses Explicit Congestion Notification (ECN) signals from the network to adjust TCP's behaviour to avoid the accumulation of standing queues in network devices along the path between sender and receiver. ICTCP [5] adds receiver-driven flow fate-sharing information into the congestion control loop, which is closely aligned with the avenue we wish to further explore.

These proposals represent the state of the art in datacenter TCP research and importantly, are not mutually exclusive of one another. A key point is that there is no silver bullet, and improving the status quo requires a multifaceted engineering effort across multiple aspects of TCP operation.

In addition to the TCP engineering work being undertaken, some have taken to examining datacenter traffic in more detail to understand the problem space better [6,7].A systematic, detailed examination of production network microbursts would be of immense use.

Over recent years CAIA has also been engaged in a range of related TCP congestion control research projects funded by Cisco and FreeBSD Foundation [8,9,10] –  empirically evaluating loss-based and delay-based TCP algorithms, publishing academic peer-reviewed papers and releasing upgrades to the FreeBSD network stack for community evaluation and use.

References

[1] Nagle, D., and Serenyi, D. and Matthews, A., “The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage”, SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing.
[2] IEEE 802.1 Data Center Bridging Task Group, http://www.ieee802.org/1/pages/dcbridges.html
[3] Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D. G., Ganger, G. R., Gibson, G. A., and Mueller, B., “Safe and effective fine-grained TCP retransmissions for datacenter communication”, SIGCOMM '09: Proceedings of the ACM SIGCOMM 2009 conference on Data communication.
[4] Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and Sridharan, M., “DCTCP: Efficient Packet Transport for the Commoditized Data Center”, ACM SIGCOMM 2010.
[5] Wu, Haitao and Feng, Zhenqian and Guo, Chuanxiong and Zhang, Yongguang, “ICTCP: Incast Congestion Control for TCP in data center networks”, ACM CONEXT 2010.
[6] Chen, Y., Griffith, R., Liu, J., Katz, R. H., and Joseph, A. D., “Understanding TCP incast throughput collapse in datacenter networks”, WREN '09: Proceedings of the 1st ACM workshop on Research on enterprise networking.
[7] Kandula, Srikanth and Sengupta, Sudipta and Greenberg, Albert and Patel, Parveen and Chaiken, Ronnie, “The nature of data center traffic: measurements & analysis”, IMC '09: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference.
[8] NewTCP R&D Project, http://caia.swin.edu.au/urp/newtcp/
[9] ETCP R&D Project, http://caia.swin.edu.au/freebsd/etcp09/
[10] 5CC R&D Project, http://caia.swin.edu.au/freebsd/5cc/

Last Updated: Friday 30-Aug-2013 15:04:51 AEST | Maintained by: Lawrence Stewart (lastewart@swin.edu.au) | Authorised by: Grenville Armitage (garmitage@swin.edu.au)