Background
TCP
employs dynamic flow and congestion control to balance a flow's
throughput against the impact on the network and receiver, lowering
throughput to protect the network and/or receiver as required.
Originally designed for operation over the public Internet, TCP's
ubiquitous deployment means it is constantly finding itself in
environments which fall well outside of the original design
constraints. While the protocol's impressive robustness allows it to
operate in these environments, performance is often significantly
sacrificed.
Datacenter networks are one such environment, and pose difficult
challenges for regular TCP to overcome. Current datacenters typically
have Round Trip Times (RTT) measured in microseconds, bandwidth in
Gbps, large host fan in/fan out and switch port buffers only capable of
buffering a very small fraction of line-rate traffic.
One problem experienced by TCP operating in such an environment is
known as incast [1]. The commonly used “divide and conquer” approach to
serving data from large machine clusters routinely triggers this
problem.
After an aggregation node has sent data requests to appropriate nodes
in the cluster hierarchy, the responses return asynchronously. There is
a reasonable probability that the responses may all arrive at about the
same time (a microburst), triggering momentary congestion at the switch
port or kernel of the aggregating node. At high data rates, such an
event can result in buffer exhaustion and severe drop-tail packet loss.
Making matters worse, TCP's millisecond granularity timers can take
many orders of magnitude longer than the real network RTT to detect and
recover from such events, increasing the time taken to service client
queries and wasting useful equipment cycles.
Related work
Both academia and industry are paying increased attention to the
challenge of keeping datacenter networks performing optimally while
migrating to commodity network technologies. The consolidation of
switching fabrics into a single, low-cost, high-speed Ethernet-based
fabric has driven the development and specification of Datacenter
Bridging [2]. Work has also been progressing separately on changes to
TCP itself to make it perform better in datacenter environments.
Deployment of more finely-grained TCP timers [3] addresses the issue of
“impedance mismatch” between the TCP state machine operating in
milliseconds and the datacenter network operating on microsecond
timescales. DCTCP [4] uses Explicit Congestion Notification (ECN)
signals from the network to adjust TCP's behaviour to avoid the
accumulation of standing queues in network devices along the path
between sender and receiver. ICTCP [5] adds receiver-driven flow
fate-sharing information into the congestion control loop, which is
closely aligned with the avenue we wish to further explore.
These proposals represent the state of the art in datacenter TCP
research and importantly, are not mutually exclusive of one another. A
key point is that there is no silver bullet, and improving the status
quo requires a multifaceted engineering effort across multiple aspects
of TCP operation.
In addition to the TCP engineering work being undertaken, some have
taken to examining datacenter traffic in more detail to understand the
problem space better [6,7].A systematic, detailed examination of
production network microbursts would be of immense use.
Over recent years CAIA has also been engaged in a range of related TCP
congestion control research projects funded by Cisco and FreeBSD
Foundation [8,9,10] – empirically evaluating loss-based and
delay-based TCP algorithms, publishing academic peer-reviewed papers
and releasing upgrades to the FreeBSD network stack for community
evaluation and use.
References
[1] Nagle, D., and Serenyi, D. and Matthews, A., “The Panasas
ActiveScale Storage Cluster: Delivering Scalable High Bandwidth
Storage”, SC '04: Proceedings of the 2004 ACM/IEEE conference on
Supercomputing.
[2] IEEE 802.1 Data Center Bridging Task Group,
http://www.ieee802.org/1/pages/dcbridges.html
[3] Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D.
G., Ganger, G. R., Gibson, G. A., and Mueller, B., “Safe and effective
fine-grained TCP retransmissions for datacenter communication”, SIGCOMM
'09: Proceedings of the ACM SIGCOMM 2009 conference on Data
communication.
[4] Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P.,
Prabhakar, B., Sengupta, S., and Sridharan, M., “DCTCP: Efficient
Packet Transport for the Commoditized Data Center”, ACM SIGCOMM 2010.
[5] Wu, Haitao and Feng, Zhenqian and Guo, Chuanxiong and Zhang,
Yongguang, “ICTCP: Incast Congestion Control for TCP in data center
networks”, ACM CONEXT 2010.
[6] Chen, Y., Griffith, R., Liu, J., Katz, R. H., and Joseph, A. D.,
“Understanding TCP incast throughput collapse in datacenter networks”,
WREN '09: Proceedings of the 1st ACM workshop on Research on enterprise
networking.
[7] Kandula, Srikanth and Sengupta, Sudipta and Greenberg, Albert and
Patel, Parveen and Chaiken, Ronnie, “The nature of data center traffic:
measurements & analysis”, IMC '09: Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference.
[8] NewTCP R&D Project,
http://caia.swin.edu.au/urp/newtcp/
[9] ETCP R&D Project,
http://caia.swin.edu.au/freebsd/etcp09/
[10] 5CC R&D Project,
http://caia.swin.edu.au/freebsd/5cc/