TCP notes and incast recommendations
Matthew Macy
mmacy at nextbsd.org
Fri Nov 27 01:57:46 UTC 2015
In an effort to be somewhat current on the state TCP I've collected a small bibliography. I've tried
to summarize RFCs and papers that I believe to be important and provide some general background for
others who do not have a deeper familiarity with TCP or congestion control - in particular as impacts DCTCP.
Recommendations references phabricator changes.
Table Of Contents:
I) - A Roadmap for Transmission Control Protocol (TCP)
Specification Documents (RFC 7414)
II) - Metrics for the Evaluation of Congestion Control Mechanisms
(RFC 5166)
III) - TCP Congestion Control (RFC 5681)
IV) - Computing TCP's Retransmission Timer (RFC 6298)
V) - Increasing TCP's Initial Window (RFC 6928)
VI) - TCP Extensions for High Performance [RTO updates
and changes to RFC 1323] (RFC 7323)
VII) - Updating TCP to Support Rate-Limited Traffic
[Congestion Window Validation] (RFC 7661)
VIII) - Active Queue Management (AQM)
IX) - Explicit Congestion Notification (ECN)
X) - AccurateECN (AccECN)
XI) - Incast Causes and Solutions
XII) - Data Center Transmission Control Protocol (DCTCP)
XIII) - Incast TCP (ICTCP)
XIV) - Quantum Congestion Notification (QCN)
XV) - Recommendations
A Roadmap for Transmission Control Protocol (TCP)
Specification Documents [important]:
https://tools.ietf.org/html/rfc7414
A correct and efficient implementation of the Transmission Control
Protocol (TCP) is a critical part of the software of most Internet
hosts. As TCP has evolved over the years, many distinct documents
have become part of the accepted standard for TCP. At the same time,
a large number of experimental modifications to TCP have also been
published in the RFC series, along with informational notes, case
studies, and other advice.
As an introduction to newcomers and an attempt to organize the
plethora of information for old hands, this document contains a
roadmap to the TCP-related RFCs. It provides a brief summary of the
RFC documents that define TCP. This should provide guidance to
implementers on the relevance and significance of the standards-track
extensions, informational notes, and best current practices that
relate to TCP.
This roadmap includes a brief description of the contents of each
TCP-related RFC [N.B. I only include an excerpt of the summary for those
that I consider interesting or important]. In some cases, we simply supply
the abstract or a key summary sentence from the text as a terse description.
In addition, a letter code after an RFC number indicates its category in the
RFC series (see BCP 9 [RFC2026] for explanation of these categories):
S - Standards Track (Proposed Standard, Draft Standard, or Internet
Standard)
E - Experimental
I - Informational
H - Historic
B - Best Current Practice
U - Unknown (not formally defined)
[2.] Core Functionality
A small number of documents compose the core specification of TCP.
These define the required core functionalities of TCP's header
parsing, state machine, congestion control, and retransmission
timeout computation. These base specifications must be correctly
followed for interoperability.
RFC 793 S: "Transmission Control Protocol", STD 7 (September 1981)
(Errata)
This is the fundamental TCP specification document [RFC793].
Written by Jon Postel as part of the Internet protocol suite's
core, it describes the TCP packet format, the TCP state machine
and event processing, and TCP's semantics for data transmission,
reliability, flow control, multiplexing, and acknowledgment.
RFC 1122 S: "Requirements for Internet Hosts - Communication Layers"
(October 1989)
This document [RFC1122] updates and clarifies RFC 793 (see above
in Section 2), fixing some specification bugs and oversights. It
also explains some features such as keep-alives and Karn's and
Jacobson's RTO estimation algorithms [KP87][Jac88][JK92]. ICMP
interactions are mentioned, and some tips are given for efficient
implementation. RFC 1122 is an Applicability Statement, listing
the various features that MUST, SHOULD, MAY, SHOULD NOT, and MUST
NOT be present in standards-conforming TCP implementations.
Unlike a purely informational roadmap, this Applicability
Statement is a standards document and gives formal rules for
implementation.
RFC 2460 S: "Internet Protocol, Version 6 (IPv6) Specification"
(December 1998) (Errata)
This document [RFC2460] is of relevance to TCP because it defines
how the pseudo-header for TCP's checksum computation is derived
when 128-bit IPv6 addresses are used instead of 32-bit IPv4
addresses. Additionally, RFC 2675 (see Section 3.1 of this
document) describes TCP changes required to support IPv6
jumbograms.
RFC 2873 S: "TCP Processing of the IPv4 Precedence Field" (June 2000)
(Errata)
This document [RFC2873] removes from the TCP specification all
processing of the precedence bits of the TOS byte of the IP
header. This resolves a conflict over the use of these bits
between RFC 793 (see above in Section 2) and Differentiated
Services [RFC2474].
RFC 5681 S: "TCP Congestion Control" (August 2009)
Although RFC 793 (see above in Section 2) did not contain any
congestion control mechanisms, today congestion control is a
required component of TCP implementations. This document
[RFC5681] defines congestion avoidance and control mechanism for
TCP, based on Van Jacobson's 1988 SIGCOMM paper [Jac88].
A number of behaviors that together constitute what the community
refers to as "Reno TCP" is described in RFC 5681. The name "Reno"
comes from the Net/2 release of the 4.3 BSD operating system.
This is generally regarded as the least common denominator among
TCP flavors currently found running on Internet hosts. Reno TCP
includes the congestion control features of slow start, congestion
avoidance, fast retransmit, and fast recovery.
RFC 5681 details the currently accepted congestion control
mechanism, while RFC 1122, (see above in Section 2) mandates that
such a congestion control mechanism must be implemented. RFC 5681
differs slightly from the other documents listed in this section,
as it does not affect the ability of two TCP endpoints to
communicate;
RFCs 2001 and 2581 are the conceptual precursors of RFC 5681. The
most important changes relative to RFC 2581 are:
(a) The initial window requirements were changed to allow larger
Initial Windows as standardized in [RFC3390] (see Section 3.2
of this document).
(b) During slow start and congestion avoidance, the usage of
Appropriate Byte Counting [RFC3465] (see Section 3.2 of this
document) is explicitly recommended.
(c) The use of Limited Transmit [RFC3042] (see Section 3.3 of
this document) is now recommended.
RFC 6093 S: "On the Implementation of the TCP Urgent Mechanism"
(January 2011)
This document [RFC6093] analyzes how current TCP stacks process
TCP urgent indications, ... and recommends against the use of urgent
mechanism.
RFC 6298 S: "Computing TCP's Retransmission Timer" (June 2011)
Abstract of RFC 6298 [RFC6298]: "This document defines the
standard algorithm that Transmission Control Protocol (TCP)
senders are required to use to compute and manage their
retransmission timer. It expands on the discussion in
Section 4.2.3.1 of RFC 1122 and upgrades the requirement of
supporting the algorithm from a SHOULD to a MUST." RFC 6298
updates RFC 2988 by _changing_ the initial RTO from _3s_ to _1s_
[emphasis mine].
RFC 6691 I: "TCP Options and Maximum Segment Size (MSS)" (July 2012)
This document [RFC6691] clarifies what value to use with the TCP
Maximum Segment Size (MSS) option when IP and TCP options are in
use.
[3.] Strongly Encouraged Enhancements
This section describes recommended TCP modifications that improve
performance and security. Section 3.1 represents fundamental changes
to the protocol. Sections 3.2 and 3.3 list improvements over the
congestion control and loss recovery mechanisms as specified in RFC
5681 (see Section 2). Section 3.4 describes algorithms that allow a
TCP sender to detect whether it has entered loss recovery spuriously.
Section 3.5 comprises Path MTU Discovery mechanisms. Schemes for
TCP/IP header compression are listed in Section 3.6. Finally,
Section 3.7 deals with the problem of preventing acceptance of forged
segments and flooding attacks.
[3.1.] Fundamental Changes
RFCs 2675 and 7323 represent fundamental changes to TCP by redefining
how parts of the basic TCP header and options are interpreted. RFC
7323 defines the Window Scale option, which reinterprets the
advertised receive window. RFC 2675 specifies that MSS option and
urgent pointer fields with a value of 65,535 are to be treated
RFC 2675 S: "IPv6 Jumbograms" (August 1999) (Errata)
RFC 7323 S: "TCP Extensions for High Performance" (September 2014)
This document [RFC7323] defines TCP extensions for window scaling,
timestamps, and protection against wrapped sequence numbers, for
efficient and safe operation over paths with large bandwidth-delay
products. These extensions are commonly found in currently used
systems. The predecessor of this document, RFC 1323, was
published in 1992, and is deployed in most TCP implementations.
This document includes fixes and clarifications based on the
gained deployment experience. One specific issued addressed in
this specification is a recommendation how to modify the algorithm
for estimating the mean RTT when timestamps are used. RFCs 1072,
1185, and 1323 are the conceptual precursors of RFC 7323.
[3.2.] Congestion Control Extensions
Two of the most important aspects of TCP are its congestion control
and loss recovery features. TCP treats lost packets as indicating
congestion-related loss and cannot distinguish between congestion-
related loss and loss due to transmission errors. Even when ECN is
in use, there is a rather intimate coupling between congestion
control and loss recovery mechanisms. There are several extensions
to both features, and more often than not, a particular extension
applies to both. In these two subsections, we group enhancements to
TCP's congestion control, while the next subsection focus on TCP's
loss recovery.
RFC 3168 S: "The Addition of Explicit Congestion Notification (ECN)
to IP" (September 2001)
This document [RFC3168] defines a means for end hosts to detect
congestion before congested routers are forced to discard packets.
Although congestion notification takes place at the IP level, ECN
requires support at the transport level (e.g., in TCP) to echo the
bits and adapt the sending rate. This document updates RFC 793
(see Section 2 of this document) to define two previously unused
flag bits in the TCP header for ECN support.
RFC 3390 S: "Increasing TCP's Initial Window" (October 2002)
This document [RFC3390] specifies an increase in the permitted
initial window for TCP from one segment to three or four segments
during the slow start phase, depending on the segment size.
RFC 3465 E: "TCP Congestion Control with Appropriate Byte Counting
(ABC)" (February 2003)
This document [RFC3465] suggests that congestion control use the
number of bytes acknowledged instead of the number of
acknowledgments received. This change improves the performance of
TCP in situations where there is no one-to-one relationship
between data segments and acknowledgments (e.g., delayed ACKs or
ACK loss). ABC is recommended by RFC 5681 (see Section 2).
RFC 6633 S: "Deprecation of ICMP Source Quench Messages" (May 2012)
This document [RFC6633] formally deprecates the use of ICMP Source
Quench messages by transport protocols and recommends against the
implementation of [RFC1016].
[3.3.] Loss Recovery Extensions
For the typical implementation of the TCP fast recovery algorithm
described in RFC 5681 (see Section 2 of this document), a TCP sender
only retransmits a segment after a retransmit timeout has occurred,
or after three duplicate ACKs have arrived triggering the fast
retransmit. A single RTO might result in the retransmission of
several segments, while the fast retransmit algorithm in RFC 5681
leads only to a single retransmission. Hence, multiple losses from a
single window of data can lead to a performance degradation.
Documents listed in this section aim to improve the overall
performance of TCP's standard loss recovery algorithms. In
particular, some of them allow TCP senders to recover more
effectively when multiple segments are lost from a single flight of
data.
RFC 2018 S: "TCP Selective Acknowledgment Options" (October 1996)
(Errata)
When more than one packet is lost during one RTT, TCP may
experience poor performance since a TCP sender can only learn
about a single lost packet per RTT from cumulative
acknowledgments. This document [RFC2018] defines the basic
selective acknowledgment (SACK) mechanism for TCP, which can help
to overcome these limitations. The receiving TCP returns SACK
blocks to inform the sender which data has been received. The
sender can then retransmit only the missing data segments.
RFC 3042 S: "Enhancing TCP's Loss Recovery Using Limited Transmit"
(January 2001)
Abstract of RFC 3042 [RFC3042]: "This document proposes a new
Transmission Control Protocol (TCP) mechanism that can be used to
more effectively recover lost segments when a connection's
congestion window is small, or when a large number of segments are
lost in a single transmission window." This algorithm described
in RFC 3042 is called "Limited Transmit". Limited Transmit is
recommended by RFC 5681 (see Section 2 of this document).
RFC 6582 S: "The NewReno Modification to TCP's Fast Recovery
Algorithm" (April 2012)
This document [RFC6582] specifies a modification to the standard
Reno fast recovery algorithm, whereby a TCP sender can use partial
acknowledgments to make inferences determining the next segment to
send in situations where SACK would be helpful but isn't
available. Although it is only a slight modification, the NewReno
behavior can make a significant difference in performance when
multiple segments are lost from a single window of data.
RFC 6675 S: "A Conservative Loss Recovery Algorithm Based on
Selective Acknowledgment (SACK) for TCP" (August 2012)
This document [RFC6675] describes a conservative loss recovery
algorithm for TCP that is based on the use of the selective
acknowledgment (SACK) TCP option [RFC2018] (see above in
Section 3.3). The algorithm conforms to the spirit of the
congestion control specification in RFC 5681 (see Section 2 of
this document), but allows TCP senders to recover more effectively
when multiple segments are lost from a single flight of data.
RFC 6675 is a revision of RFC 3517 to address several situations
that are not handled explicitly before. In particular,
(a) it improves the loss detection in the event that the sender
has outstanding segments that are smaller than Sender Maximum
Segment Size (SMSS).
(b) it modifies the definition of a "duplicate acknowledgment" to
utilize the SACK information in detecting loss.
(c) it maintains the ACK clock under certain circumstances
involving loss at the end of the window.
3.4. Detection and Prevention of Spurious Retransmissions
Spurious retransmission timeouts are harmful to TCP performance and
multiple algorithms have been defined for detecting when spurious
retransmissions have occurred, but they respond differently with
regard to their manners of recovering performance. The IETF defined
multiple algorithms because there are trade-offs in whether or not
certain TCP options need to be implemented and concerns about IPR
status. The Standards Track RFCs in this section are closely related
to the Experimental RFCs in Section 4.5 also addressing this topic.
RFC 2883 S: "An Extension to the Selective Acknowledgement (SACK)
Option for TCP" (July 2000)
This document [RFC2883] extends RFC 2018 (see Section 3.3 of this
document). It enables use of the SACK option to acknowledge
duplicate packets. With this extension, called DSACK, the sender
is able to infer the order of packets received at the receiver
and, therefore, to infer when it has unnecessarily retransmitted a
packet. A TCP sender could then use this information to detect
spurious retransmissions (see [RFC3708]).
RFC 4015 S: "The Eifel Response Algorithm for TCP" (February 2005)
Abstract of RFC 4015 [RFC4015]: "Based on an appropriate detection
algorithm, the Eifel response algorithm provides a way for a TCP
sender to respond to a detected spurious timeout.
RFC 5682 S: "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
Spurious Retransmission Timeouts with TCP" (September
2009)
The F-RTO detection algorithm [RFC5682], originally described in
RFC 4138, provides an option for inferring spurious retransmission
timeouts. Unlike some similar detection methods (e.g., RFCs 3522
and 3708, both listed in Section 4.5 of this document), F-RTO does
not rely on the use of any TCP options. The basic idea is to send
previously unsent data after the first retransmission after a RTO.
If the ACKs advance the window, the RTO may be declared spurious.
[3.5.] Path MTU Discovery
RFC 1191 S: "Path MTU Discovery" (November 1990)
RFC 1981 S: "Path MTU Discovery for IP version 6" (August 1996)
RFC 4821 S: "Packetization Layer Path MTU Discovery" (March 2007)
Abstract of RFC 4821 [RFC4821]: "This document describes a robust
method for Path MTU Discovery (PMTUD) that relies on TCP or some
other Packetization Layer to probe an Internet path with
progressively larger packets.
[3.6.] Header Compression
Especially in streaming applications, the overhead of TCP/IP headers
could correspond to more than 50% of the total amount of data sent.
Such large overheads may be tolerable in wired LANs where capacity is
often not an issue, but are excessive for WANs and wireless systems
where bandwidth is scarce. Header compression schemes for TCP/IP
like RObust Header Compression (ROHC) can significantly compress this
overhead. It performs well over links with significant error rates
and long round-trip times.
RFC 1144 S: "Compressing TCP/IP Headers for Low-Speed Serial Links"
(February 1990)
RFC 6846 S: "RObust Header Compression (ROHC): A Profile for TCP/IP
(ROHC-TCP)" (January 2013)
3.7. Defending Spoofing and Flooding Attacks
By default, TCP lacks any cryptographic structures to differentiate
legitimate segments from those spoofed from malicious hosts.
Spoofing valid segments requires correctly guessing a number of
fields. The documents in this subsection describe ways to make that
guessing harder or to prevent it from being able to affect a
connection negatively.
RFC 4953 I: "Defending TCP Against Spoofing Attacks" (July 2007)
RFC 4987 I: "TCP SYN Flooding Attacks and Common Mitigations" (August
2007)
RFC 5925 S: "The TCP Authentication Option" (June 2010)
RFC 5926 S: "Cryptographic Algorithms for the TCP Authentication
Option (TCP-AO)" (June 2010)
RFC 5927 I: "ICMP Attacks against TCP" (July 2010)
RFC 5961 S: "Improving TCP's Robustness to Blind In-Window Attacks"
(August 2010)
RFC 6528 S: "Defending against Sequence Number Attacks" (February
2012)
[4.] Experimental Extensions
The RFCs in this section are either Experimental and may become
Proposed Standards in the future or are Proposed Standards (or
Informational), but can be considered experimental due to lack of
wide deployment. At least part of the reason that they are still
experimental is to gain more wide-scale experience with them before a
standards track decision is made.
[4.1.] Architectural Guidelines
As multiple flows may share the same paths, sections of paths, or
other resources, the TCP implementation may benefit from sharing
information across TCP connections or other flows. Some experimental
proposals have been documented and some implementations have included
the concepts.
RFC 2140 I: "TCP Control Block Interdependence" (April 1997)
RFC 3124 S: "The Congestion Manager" (June 2001)
This document [RFC3124] is a related proposal to RFC 2140 (see
above in Section 4.1). The idea behind the Congestion Manager,
moving congestion control outside of individual TCP connections,
represents a modification to the core of TCP, which supports
sharing information among TCP connections. Although a Proposed
Standard, some pieces of the Congestion Manager support
architecture have not been specified yet, and it has not achieved
use or implementation beyond experimental stacks, so it is not
listed among the standard TCP enhancements in this roadmap.
[4.2.] Fundamental Changes
Like the Standards Track documents listed in Section 3.1, there also
exist new Experimental RFCs that specify fundamental changes to TCP.
At the time of writing, the only example so far is TCP Fast Open that
deviates from the standard TCP semantics of [RFC793].
RFC 7413 E: "TCP Fast Open" (December 2014)
This document [RFC7413] describes TCP Fast Open that allows data
to be carried in the SYN and SYN-ACK packets and consumed by the
receiver during the initial connection handshake.
[4.3.] Congestion Control Extensions
TCP congestion control has been an extremely active research area for
many years (see RFC 5783 discussed in Section 7.6 of this document),
as it determines the performance of many applications that use TCP.
A number of Experimental RFCs address issues with flow start up,
overshoot, and steady-state behavior in the basic algorithms of RFC
5681 (see Section 2 of this document). In these subsections,
enhancements to TCP's congestion control are listed.
RFC 2861 E: "TCP Congestion Window Validation" (June 2000)
RFC 3540 E: "Robust Explicit Congestion Notification (ECN) Signaling
with Nonces" (June 2003)
RFC 3649 E: "HighSpeed TCP for Large Congestion Windows" (December
2003)
RFC 3742 E: "Limited Slow-Start for TCP with Large Congestion
Windows" (March 2004)
RFC 4782 E: "Quick-Start for TCP and IP" (January 2007) (Errata)
RFC 5562 E: "Adding Explicit Congestion Notification (ECN) Capability
to TCP's SYN/ACK Packets" (June 2009)
RFC 5690 I: "Adding Acknowledgement Congestion Control to TCP"
(February 2010)
RFC 6928 E: "Increasing TCP's Initial Window" (April 2013)
This document [RFC6928] proposes to increase the TCP initial
window from between 2 and 4 segments, as specified in RFC 3390
(see Section 3.2 of this document), to 10 segments with a fallback
to the existing recommendation when performance issues are
detected.
[4.4.] Loss Recovery Extensions
RFC 5827 E: "Early Retransmit for TCP and Stream Control Transmission
Protocol (SCTP)" (April 2010)
This document [RFC5827] proposes the "Early Retransmit" mechanism
for TCP (and SCTP) that can be used to recover lost segments when
a connection's congestion window is small. In certain special
circumstances, Early Retransmit reduces the number of duplicate
acknowledgments required to trigger fast retransmit to recover
segment losses without waiting for a lengthy retransmission
timeout.
RFC 6069 E: "Making TCP More Robust to Long Connectivity Disruptions
(TCP-LCD)" (December 2010)
RFC 6937 E: "Proportional Rate Reduction for TCP" (May 2013)
This document [RFC6937] describes an experimental Proportional
Rate Reduction (PRR) algorithm as an alternative to the widely
deployed Fast Recovery algorithm, to improve the accuracy of the
amount of data sent by TCP during loss recovery.
[4.5.] Detection and Prevention of Spurious Retransmissions
In addition to the Standards Track extensions to deal with spurious
retransmissions in Section 3.4, Experimental proposals have also been
documented.
RFC 3522 E: "The Eifel Detection Algorithm for TCP" (April 2003)
RFC 3708 E: "Using TCP Duplicate Selective Acknowledgement (DSACKs)
and Stream Control Transmission Protocol (SCTP) Duplicate
Transmission Sequence Numbers (TSNs) to Detect Spurious
Retransmissions" (February 2004)
RFC 4653 E: "Improving the Robustness of TCP to Non-Congestion
Events" (August 2006)
[4.6.] TCP Timeouts
RFC 5482 S: "TCP User Timeout Option" (March 2009)
[4.7.] Multipath TCP
MultiPath TCP (MPTCP) is an ongoing effort within the IETF that
allows a TCP connection to simultaneously use multiple IP addresses /
interfaces to spread their data across several subflows, while
presenting a regular TCP interface to applications. Benefits of this
include better resource utilization, better throughput and smoother
reaction to failures. The documents listed in this section specify
the Multipath TCP scheme, while the documents in Sections 7.2, 7.4,
and 7.5 provide some additional background information.
RFC 6356 E: "Coupled Congestion Control for Multipath Transport
Protocols" (October 2011)
RFC 6824 E: "TCP Extensions for Multipath Operation with Multiple
Addresses" (January 2013) (Errata)
[5.] TCP Parameters at IANA
RFC 2780 B: "IANA Allocation Guidelines For Values In the Internet
Protocol and Related Headers" (March 2000)
RFC 4727 S: "Experimental Values in IPv4, IPv6, ICMPv4, ICMPv6, UDP,
and TCP Headers" (November 2006)
RFC 6335 B: "Internet Assigned Numbers Authority (IANA) Procedures
for the Management of the Service Name and Transport
Protocol Port Number Registry" (August 2011)
RFC 6994 S: "Shared Use of Experimental TCP Options (August 2013)
[7.] Support Documents
This section contains several classes of documents that do not
necessarily define current protocol behaviors but that are
nevertheless of interest to TCP implementers. Section 7.1 describes
several foundational RFCs that give modern readers a better
understanding of the principles underlying TCP's behaviors and
development over the years. Section 7.2 contains architectural
guidelines and principles for TCP architects and designers. The
documents listed in Section 7.3 provide advice on using TCP in
various types of network situations that pose challenges above those
of typical wired links. Guidance for developing, analyzing, and
evaluating TCP is given in Section 7.4. Some implementation notes
and implementation advice can be found in Section 7.5. RFCs that
describe tools for testing and debugging TCP implementations or that
contain high-level tutorials on the protocol are listed Section 7.6.
The TCP Management Information Bases are described in Section 7.7,
and Section 7.8 lists a number of case studies that have explored TCP
performance.
7.4. Guidance for Developing, Analyzing, and Evaluating TCP
Documents in this section give general guidance for developing,
analyzing, and evaluating TCP. Some of the documents discuss, for
example, the properties of congestion control protocols that are
"safe" for Internet deployment as well as how to measure the
properties of congestion control mechanisms and transport protocols.
RFC 5033 B: "Specifying New Congestion Control Algorithms" (August
2007)
This document [RFC5033] considers the evaluation of suggested
congestion control algorithms that differ from the principles
outlined in RFC 2914 (see Section 7.2 of this document). It is
useful for authors of such algorithms as well as for IETF members
reviewing the associated documents.
RFC 5166 I: "Metrics for the Evaluation of Congestion Control
Mechanisms" (March 2008)
This document [RFC5166] discusses metrics that need to be
considered when evaluating new or modified congestion control
mechanisms for the Internet. Among other topics, the document
discusses throughput, delay, loss rates, response times, fairness,
and robustness for challenging environments.
RFC 6077 I: "Open Research Issues in Internet Congestion Control"
(February 2011)
This document [RFC6077] summarizes the main open problems in the
domain of Internet congestion control. As a good starting point
for newcomers, the document describes several new challenges that
are becoming important as the network grows, as well as some
issues that have been known for many years.
RFC 6181 I: "Threat Analysis for TCP Extensions for Multipath
Operation with Multiple Addresses" (March 2011)
This document [RFC6181] describes a threat analysis for Multipath
TCP (MPTCP) (see Section 4.7 of this document). The document
discusses several types of attacks and provides recommendations
for MPTCP designers how to create an MPTCP specification that is
as secure as the current (single-path) TCP.
RFC 6349 I: "Framework for TCP Throughput Testing" (August 2011)
From the Abstract of RFC 6349 [RFC6349]: "This framework describes
a practical methodology for measuring end-to-end TCP Throughput in
a managed IP network. The goal is to provide a better indication
in regard to user experience. In this framework, TCP and IP
parameters are specified to optimize TCP Throughput."
7.5. Implementation Advice
RFC 794 U: "PRE-EMPTION" (September 1981)
This document [RFC794] clarifies that operating systems need to
manage their limited resources, which may include TCP connection
state, and that these decisions can be made with application
input, but they do not need to be part of the TCP protocol
specification itself.
RFC 879 U: "The TCP Maximum Segment Size and Related Topics"
(November 1983)
RFC 1071 U: "Computing the Internet Checksum" (September 1988)
(Errata)
RFC 1624 I: "Computation of the Internet Checksum via Incremental
Update" (May 1994)
RFC 1936 I: "Implementing the Internet Checksum in Hardware" (April
1996)
RFC 2525 I: "Known TCP Implementation Problems" (March 1999)
RFC 2923 I: "TCP Problems with Path MTU Discovery" (September 2000)
RFC 3493 I: "Basic Socket Interface Extensions for IPv6" (February
2003)
RFC 6056 B: "Recommendations for Transport-Protocol Port
Randomization" (December 2010)
RFC 6191 B: "Reducing the TIME-WAIT State Using TCP Timestamps"
(April 2011)
RFC 6429 I: "TCP Sender Clarification for Persist Condition"
(December 2011)
RFC 6897 I: "Multipath TCP (MPTCP) Application Interface
Considerations" (March 2013)
7.6. Tools and Tutorials
RFC 1180 I: "TCP/IP Tutorial" (January 1991) (Errata)
This document [RFC1180] is an extremely brief overview of the TCP/
IP protocol suite as a whole. It gives some explanation as to how
and where TCP fits in.
RFC 1470 I: "FYI on a Network Management Tool Catalog: Tools for
Monitoring and Debugging TCP/IP Internets and
Interconnected Devices" (June 1993)
A few of the tools that this document [RFC1470] describes are
still maintained and in use today, for example, ttcp and tcpdump.
However, many of the tools described do not relate specifically to
TCP and are no longer used or easily available.
RFC 2398 I: "Some Testing Tools for TCP Implementors" (August 1998)
This document [RFC2398] describes a number of TCP packet
generation and analysis tools. Although some of these tools are
no longer readily available or widely used, for the most part they
are still relevant and usable.
RFC 5783 I: "Congestion Control in the RFC Series" (February 2010)
This document [RFC5783] provides an overview of RFCs related to
congestion control that had been published at the time. The focus
of the document is on end-host-based congestion control.
8. Undocumented TCP Features
There are a few important implementation tactics for the TCP that
have not yet been described in any RFC. Although this roadmap is
primarily concerned with mapping the TCP RFCs, this section is
included because an implementer needs to be aware of these important
issues.
Header Prediction
Header prediction is a trick to speed up the processing of
segments. Van Jacobson and Mike Karels developed the technique in
the late 1980s. The basic idea is that some processing time can
be saved when most of a segment's fields can be predicted from
previous segments. A good description of this was sent to the
TCP-IP mailing list by Van Jacobson on March 9, 1988 (see
[Jacobson] for the full message):
Quite a bit of the speedup comes from an algorithm that we
('we' refers to collaborator Mike Karels and myself) are
calling "header prediction". The idea is that if you're in the
middle of a bulk data transfer and have just seen a packet, you
know what the next packet is going to look like: It will look
just like the current packet with either the sequence number or
ack number updated (depending on whether you're the sender or
receiver). Combining this with the "Use hints" epigram from
Butler Lampson's classic "Epigrams for System Designers", you
start to think of the tcp state (rcv.nxt, snd.una, etc.) as
"hints" about what the next packet should look like.
If you arrange those "hints" so they match the layout of a tcp
packet header, it takes a single 14-byte compare to see if your
prediction is correct (3 longword compares to pick up the send
& ack sequence numbers, header length, flags and window, plus a
short compare on the length). If the prediction is correct,
there's a single test on the length to see if you're the sender
or receiver followed by the appropriate processing. E.g., if
the length is non-zero (you're the receiver), checksum and
append the data to the socket buffer then wake any process
that's sleeping on the buffer. Update rcv.nxt by the length of
this packet (this updates your "prediction" of the next
packet). Check if you can handle another packet the same size
as the current one. If not, set one of the unused flag bits in
your header prediction to guarantee that the prediction will
fail on the next packet and force you to go through full
protocol processing. Otherwise, you're done with this packet.
So, the *total* tcp protocol processing, exclusive of
checksumming, is on the order of 6 compares and an add.
Forward Acknowledgement (FACK)
FACK [MM96] includes an alternate algorithm for triggering fast
retransmit [RFC5681], based on the extent of the SACK scoreboard.
Its goal is to trigger fast retransmit as soon as the receiver's
reassembly queue is larger than the duplicate ACK threshold, as
indicated by the difference between the forward most SACK block
edge and SND.UNA. This algorithm quickly and reliably triggers
fast retransmit in the presence of burst losses -- often on the
first SACK following such a loss. Such a threshold-based
algorithm also triggers fast retransmit immediately in the
presence of any reordering with extent greater than the duplicate
ACK threshold. FACK is implemented in Linux and turned on per
default.
Congestion Control for High Rate Flows
In the last decade significant research effort has been put into
experimental TCP congestion control modifications for obtaining
high throughput with reduced startup and recovery times. Only a
few RFCs have been published on some of these modifications,
including HighSpeed TCP [RFC3649], Limited Slow-Start [RFC3742],
and Quick-Start [RFC4782] (see Section 4.3 of this document for
more information on each), but high-rate congestion control
mechanisms are still considered an open issue in congestion
control research. Some other schemes have been published as
Internet-Drafts, e.g. CUBIC [CUBIC] (the standard TCP congestion
control algorithm in Linux), Compound TCP [CTCP], and H-TCP [HTCP]
or have been discussed a little by the IETF, but much of the work
in this area has not been adopted within the IETF yet, so the
majority of this work is outside the RFC series and may be
discussed in other products of the IRTF Internet Congestion
Control Research Group (ICCRG).
Metrics for the Evaluation of Congestion Control Mechanisms
https://tools.ietf.org/html/rfc5166
Discusses the metrics to be considered in an evaluation
of new or modified congestion control mechanisms for the Internet.
These include metrics for the evaluation of new transport protocols,
of proposed modifications to TCP, of application-level congestion
control, and of Active Queue Management (AQM) mechanisms in the
router. This document is the first in a series of documents aimed at
improving the models that we use in the evaluation of transport
protocols.
Types Of Metrics:
- Throughput, Delay, and Loss Rates
- Throughput: can be measured as
- router-based metric of aggregate link utilization
- flow-based metric of per-connection transfer times
- user-based metric of utility functions or user wait times
- Goodput: sometimes distinguished from throughput where throughput
is the link utilization or flow rate in bytes per second; goodput
is the subset of throughput (also measured in Bytes/s) consisting
of useful traffic [i.e. excluding duplicate packets]
- Delay: Like throughput, delay can be measured as a router-based metric of
queueing delay over time, or as a flow-based metric in terms of
per-packet transfer times. Per-packet delay can also include delay
at the sender waiting for the transport protocol to send the packet.
For reliable transfer, the per-packet transfer time seen by the
application includes the possible delay of retransmitting a lost
packet.
- Packet Loss Rates: can be measured as a network-based or as a
flow-based metric. One network-related reason to avoid high steady-
state packet loss rates is to avoid congestion collapse in environments
containing paths with multiple congested links
- Response Times and Minimizing Oscillations
- Response to Changes: One of the key concerns in the design of congestion
control mechanisms has been the response times to sudden congestion in the
network. On the one hand, congestion control mechanisms should
respond reasonably promptly to sudden congestion from routing or
bandwidth changes or from a burst of competing traffic. At the same
time, congestion control mechanisms should not respond too severely
to transient changes, e.g., to a sudden increase in delay that will
dissipate in less than the connection's round-trip time.
- Minimizing Oscillations: One goal is that of stability, in terms of
minimizing oscillations of queueing delay or of throughput. In practice,
stability is frequently associated with rate fluctuations or variance.
Rate variations can result in fluctuations in router queue size and
therefore of queue overflows. These queue overflows can cause loss
synchronizations across coexisting flows and periodic under-utilization
of link capacity, both of which are considered to be general signs of
network instability. Thus, measuring the rate variations of flows is
often used to measure the stability of transport protocols. To measure
rate variations, [JWL04], [RX05], and [FHPW00] use the coefficient of
variation (CoV) of per-flow transmission rates, and [WCL05] suggests the
use of standard deviations of per-flow rates. Since rate variations are
a function of time scales, it makes sense to measure these rate variations
over various time scales.
- Fairness and Convergence
- Fairness between Flows: let x_i be the throughput for the i-th connection.
- Jain's fairness index: The fairness index in [JCH84] is:
(( sum_i x_i )^2) / (n * sum_i ( (x_i)^2 )),
where there are n users. This fairness index ranges from 0 to 1, and
it is maximum when all users receive the same allocation. This index
is k/n when k users equally share the resource, and the other n-k
users receive zero allocation.
- The product measure:
product_i x_i
the product of the throughput of the individual connections, is also
used as a measure of fairness. (In some contexts x_i is taken as the
power of the i-th connection, and the product measure is referred to
as network power.) The product measure is particularly sensitive to
segregation; the product measure is zero if any connection receives
zero throughput. [N.B. If one normalizes to actual bandwidth by taking
the Nth root of the product, where N = number of connections, this is
the geometric mean. The geometric mean will be less than the arithmetic
mean unless all flows have equivalent throughput.]
- Epsilon-fairness: A rate allocation is defined as epsilon-fair if
(min_i x_i) / (max_i x_i) >= 1 - epsilon
Epsilon-fairness measures the worst-case ratio between any two throughput
rates [ZKL04]. Epsilon-fairness is related to max-min fairness.
- Fairness between Flows with Different Resource Requirements
- Max-min fairness: In order to satisfy the max-min fairness criteria,
the smallest throughput rate must be as large as possible. Given
this condition, the next-smallest throughput rate must be as large as
possible, and so on. Thus, the max-min fairness gives absolute
priority to the smallest flows. (Max-min fairness can be explained
by the progressive filling algorithm, where all flow rates start at
zero, and the rates all grow at the same pace. Each flow rate stops
growing only when one or more links on the path reach link capacity.)
- Proportional fairness: A feasible allocation, x, is
defined as proportionally fair if, for any other feasible allocation
x*, the aggregate of proportional changes is zero or negative:
sum_i ( (x*_i - x_i)/x_i ) <= 0.
"This criterion favours smaller flows, but less emphatically than
max-min fairness" [K01]. (Using the language of utility functions,
proportional fairness can be achieved by using logarithmic utility
functions, and maximizing the sum of the per-flow utility functions;
see [KMT98] for a fuller explanation.)
- Minimum potential delay fairness: Minimum potential delay fairness
has been shown to model TCP [KS03], and is a compromise between
max-min fairness and proportional fairness. An allocation, x, is
defined as having minimum potential delay fairness if:
sum_i (1/x_i)
is smaller than for any other feasible allocation. That is, it would
minimize the average download time if each flow was an equal-sized
file.
- Comments on Fairness
- Trade-offs between fairness and throughput: The fairness measures in
the section above generally measure both fairness and throughput,
giving different weights to each. Potential trade-offs between
fairness and throughput are also discussed by Tang, et al. in
[TWL06], for a framework where max-min fairness is defined as the
most fair. In particular, [TWL06] shows that in some topologies,
throughput is proportional to fairness, while in other topologies,
throughput is inversely proportional to fairness.
- Fairness and the number of congested links: Some of these fairness
metrics are discussed in more detail in [F91]. We note that there is
not a clear consensus for the fairness goals, in particular for
fairness between flows that traverse different numbers of congested
links [F91]. Utility maximization provides one framework for
describing this trade-off in fairness.
- Fairness and round-trip times: One goal cited in a number of new
transport protocols has been that of fairness between flows with
different round-trip times [KHR02] [XHR04]. We note that there is
not a consensus in the networking community about the desirability of
this goal, or about the implications and interactions between this
goal and other metrics [FJ92] (Section 3.3). One common argument
against the goal of fairness between flows with different round-trip
times has been that flows with long round-trip times consume more
resources; this aspect is covered by the previous paragraph.
Researchers have also noted the difference between the RTT-unfairness
of standard TCP, and the greater RTT-unfairness of some proposed
modifications to TCP [LLS05].
- Fairness and packet size: One fairness issue is that of the relative
fairness for flows with different packet sizes. Many file transfer
applications will use the maximum packet size possible; in contrast,
low-bandwidth VoIP flows are likely to send small packets, sending a
new packet every 10 to 40 ms., to limit delay. Should a small-packet
VoIP connection receive the same sending rate in *bytes* per second
as a large-packet TCP connection in the same environment, or should
it receive the same sending rate in *packets* per second? This
fairness issue has been discussed in more detail in [RFC3714], with
[RFC4828] also describing the ways that packet size can affect the
packet drop rate experienced by a flow.
- Convergence times: Convergence times concern the time for convergence
to fairness between an existing flow and a newly starting one, and
are a special concern for environments with high-bandwidth long-delay
flows. Convergence times also concern the time for convergence to
fairness after a sudden change such as a change in the network path,
the competing cross-traffic, or the characteristics of a wireless
link. As with fairness, convergence times can matter both between
flows of the same protocol, and between flows using different
protocols [SLFK03]. One metric used for convergence times is the
delta-fair convergence time, defined as the time taken for two flows
with the same round-trip time to go from shares of 100/101-th and
1/101-th of the link bandwidth, to having close to fair sharing with
shares of (1+delta)/2 and (1-delta)/2 of the link bandwidth [BBFS01].
A similar metric for convergence times measures the convergence time
as the number of round-trip times for two flows to reach epsilon-
fairness, when starting from a maximally-unfair state [ZKL04].
TCP Congestion Control (RFC 5681):
http://www.rfc-editor.org/rfc/rfc5681.txt
Specifies four TCP congestion algorithms: slow start, congestion
avoidance, fast retransmit and fast recovery. They were devised
in [Jac88] and [Jac90]. Their use with TCP is standardized in
[RFC1122].
In addition the document specifies what TCP connections should do after
a relatively long idle period, as well as clarifying some of the issues
pertaining to TCP ACK generation.
Obsoletes [RFC2581], which in turn obsoleted [RFC2001].
The slow start and congestion avoidance algorithms MUST be used by the
TCP sender to control the amount of outstanding data being injected into
the network. These add three state variables.
- Congestion Window (cwnd): a sender-side limit on the amount of data
the sender can transmit before receiving an ACK.
- Receiver's Advertised Window (rwnd): a receiver-side limit o the amount
of outstanding data.
- Slow Start Threshold (ssthresh): used to determine whether the slow start
or congestion avoidance algorithm is used to control data transmission.
Slow Start: Used to determine available link capacity at the beginning of a
transfer, after repairing loss detected by the retransmission timer, or
[potentially] after a long idle period. It is additionally used to start the
"ACK clock".
- SMSS: Sender Maximum Segment Size
- IW: Initial Window, the initial value of cwnd, MUST be set using the
following guidelines as an upper bound
If SMSS > 2190 bytes:
IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
If SMSS <= 1095 bytes:
IW = 4 * SMSS bytes and MUST NOT be more than 4 segments
- Ssthresh:
- SHOULD be set arbitrarily high (e.g., to the size of the largest
possible advertised window), but ssthresh MUST be reduced in response
to congestion.
- The slow start algorithm is used when cwnd < ssthresh, while the
congestion avoidance algorithm is used when cwnd > ssthresh. When
cwnd and ssthresh are equal, the sender may use either slow start or
congestion avoidance.
- When a TCP sender detects segment loss using the retransmission timer
and the given segment has not yet been resent once by way of the
retransmission timer, the value of ssthresh MUST be set to no more
than the value given in equation (4):
ssthresh = max (FlightSize / 2, 2*SMSS) (4)
Where Flightsize is the amount of outstanding data in the network.
- Growing cwnd: During slow start, a TCP increments cwnd by at most SMSS
bytes for each ACK received that cumulatively acknowledges new data.
Slow start ends when cwnd reaches or exceeds ssthresh.
- Traditionally TCP implementations have increased cwnd by precisely
SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
that TCP implementations increase cwnd, per:
cwnd += min (N, SMSS) (2)
where N is the number of previously unacknowledged bytes acknowledged
in the incoming ACK.
Congestion Avoidance: during congestion avoidance, cwnd is incremented by
roughly 1 full-sized segment per RTT. Congestion avoidance continues until
congestion is detected. The basic guidelines for incrementing cwnd are:
- MAY increment cwnd by SMSS bytes
- SHOULD increment cwnd per equation (2) once per RTT
- MUST NOT increment cwnd by more than SMSS bytesb
[RFC3465] allows for cwnd increases of more than SMSS bytes for incoming
acknowledgments during slow start on an experimental basis; however, such
behavior is not allowed as part of the standard.
Another common formula that a TCP MAY use to update cwnd during
congestion avoidance is given in equation (3):
cwnd += SMSS*SMSS/cwnd (3)
This adjustment is executed on every incoming ACK that acknowledges
new data. Equation (3) provides an acceptable approximation to the
underlying principle of increasing cwnd by 1 full-sized segment per
RTT.
Upon a timeout (as specified in [RFC2988]) cwnd MUST be
set to no more than the loss window, LW, which equals 1 full-sized
segment (regardless of the value of IW). Therefore, after
retransmitting the dropped segment the TCP sender uses the slow start
algorithm to increase the window from 1 full-sized segment to the new
value of ssthresh, at which point congestion avoidance again takes over.
Fast Retransmit/Fast Recovery: A TCP receiver SHOULD send an immediate
duplicate ACK when an out-of-order segment arrives. The purpose of this ACK
is to inform the sender that a segment was received out-of-order and which
sequence number is expected. In addition, a TCP receiver SHOULD send an
immediate ACK when the incoming segment fills in all or part of a gap in the
sequence space. This will generate more timely information for a sender
recovering from a loss through a retransmission timeout, a fast retransmit, or
an advanced loss recovery algorithm.
The TCP sender SHOULD use the "fast retransmit" algorithm to detect and repair
loss, based on incoming duplicate ACKs. The fast retransmit algorithm uses the
arrival of 3 duplicate ACKs as an indication that a segment has been lost.
TCP then performs a retransmission of what appears to be the missing segment,
without waiting for the retransmission timer to expire.
The fast retransmit and fast recovery algorithms are implemented
together as follows.
1. On the first and second duplicate ACKs received at a sender, a
TCP SHOULD send a segment of previously unsent data per [RFC3042]
provided that the receiver's advertised window allows, the total
FlightSize would remain less than or equal to cwnd plus 2*SMSS,
and that new data is available for transmission. Further, the
TCP sender MUST NOT change cwnd to reflect these two segments
[RFC3042]. Note that a sender using SACK [RFC2018] MUST NOT send
new data unless the incoming duplicate acknowledgment contains
new SACK information.
2. When the third duplicate ACK is received, a TCP MUST set ssthresh
to no more than the value given in equation (4). When [RFC3042]
is in use, additional data sent in limited transmit MUST NOT be
included in this calculation.
3. The lost segment starting at SND.UNA MUST be retransmitted and
cwnd set to ssthresh plus 3*SMSS. This artificially "inflates"
the congestion window by the number of segments (three) that have
left the network and which the receiver has buffered.
4. For each additional duplicate ACK received (after the third),
cwnd MUST be incremented by SMSS. This artificially inflates the
congestion window in order to reflect the additional segment that
has left the network.
Note: [SCWA99] discusses a receiver-based attack whereby many
bogus duplicate ACKs are sent to the data sender in order to
artificially inflate cwnd and cause a higher than appropriate
sending rate to be used. A TCP MAY therefore limit the number of
times cwnd is artificially inflated during loss recovery to the
number of outstanding segments (or, an approximation thereof).
Note: When an advanced loss recovery mechanism (such as outlined
in section 4.3) is not in use, this increase in FlightSize can
cause equation (4) to slightly inflate cwnd and ssthresh, as some
of the segments between SND.UNA and SND.NXT are assumed to have
left the network but are still reflected in FlightSize.
5. When previously unsent data is available and the new value of
cwnd and the receiver's advertised window allow, a TCP SHOULD
send 1*SMSS bytes of previously unsent data.
6. When the next ACK arrives that acknowledges previously
unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
set in step 2). This is termed "deflating" the window.
This ACK should be the acknowledgment elicited by the
retransmission from step 3, one RTT after the retransmission
(though it may arrive sooner in the presence of significant out-
of-order delivery of data segments at the receiver).
Additionally, this ACK should acknowledge all the intermediate
segments sent between the lost segment and the receipt of the
third duplicate ACK, if none of these were lost.
Note: This algorithm is known to generally not recover efficiently
from multiple losses in a single flight of packets
RTO:
https://tools.ietf.org/html/rfc6298
Does not modify the behaviour in RFC 5681.
The RTO is a function of two state variables, SRTT and RTTVAR. The
following constants are used for calculations:
G <- clock granularity in seconds
K <- 4
[(2.1)] Until a round-trip time (RTT) measurment has been made for a segment
sent between the sender and the receiver, the sender SHOULD set RTO <- 1 second,
[i.e. not the outdated 3s currently in FreeBSD] - the "backing off" on repeated
retransmission still applies.
[(2.2)] When the first RTT measurement R is made, the host MUST set
SRTT <- R
RTTVAR <- R/2
RTO <- SRTT + max (G, K*RTTVAR)
[(2.3)] When a subsequent RTT measurement R' is made, a host must set
RTTVAR <- (1 - beta)*RTTVAR + beta * |SRTT - R'|
SRTT <- (1 - alpha)*SRTT + alpha*R'
The value of SRTT used in updating RTTVAR is the one prior to the update
in the second assignment - i.e. the updates are done RTTVAR then SRTT.
The above calculation SHOULD be done with alpha=1/8 and beta=1/4 (as
suggested in [JK88]). [N.B. Should these values be smaller in the data
center so that the SRTT maintains a longer memory and isn't compromised
by a transient microburst?].
[(2.4)] Whenever RTO is computed, if it is less than 1 second, then the
RTO SHOULD be rounded up to 1 second. [See the incast section
for why this is unequivocally wrong in the data center]
Traditionally, TCP implementations use coarse grain clocks to
measure the RTT and trigger the RTO, which imposes a large
minimum value on the RTO. Research suggests that a large
minimum RTO is needed to keep TCP conservative and avoid
spurious retransmissions [AP99]. Therefore, this specification
requires a large minimum RTO as a conservative approach, while
at the same time acknowledging that at some future point,
research may show that a smaller minimum RTO is acceptable or
superior. [Vasudevan09 (incast section) clearly shows this to
be the case.]
Note that a TCP implementation MAY clear SRTT and RTTVAR after
backing off the timer multiple times as it is likely that the current
SRTT and RTTVAR are bogus in this situation. Once SRTT and RTTVAR
are cleared, they should be initialized with the next RTT sample
taken per (2.2) rather than using (2.3).
[(7)] Changes from RFC 2988
This document reduces the initial RTO from the previous 3 seconds
[PA00] to 1 second, unless the SYN or the ACK of the SYN is lost, in
which case the default RTO is reverted to 3 seconds before data
transmission begins.
Increasing TCP's intial window:
http://www.rfc-editor.org/rfc/rfc3390.txt
http://www.rfc-editor.org/rfc/rfc6928.txt
Proposes an experiment to increase the permitted TCP
initial window (IW) from between 2 and 4 segments, as specified in
RFC 3390, to 10 segments with a fallback to the existing
recommendation when performance issues are detected. It discusses
the motivation behind the increase, the advantages and disadvantages
of the higher initial window, and presents results from several
large-scale experiments showing that the higher initial window
improves the overall performance of many web services without
resulting in a congestion collapse.
TCP Modification:
- The upper bound for the initial window will be:
min (10*MSS, max (2*MSS, 14600))
- This change applies to the initial window of the connection in the
first round-trip time (RTT) of data transmission during or following
the TCP three-way handshake.
- all the test results described in this document were based
on the regular Ethernet MTU of 1500 bytes. Future study of the
effect of a different MTU may be needed to fully validate (1) above.
- [In contrast to RFC 3390 and RFC 5681] The proposed change to reduce the
default retransmission timeout (RTO) to 1 second [RFC6298] increases the
chance for spurious SYN or SYN/ACK retransmission, thus unnecessarily
penalizing connections with RTT > 1 second if their initial window is
reduced to 1 segment. For this reason, it is RECOMMENDED that
implementations refrain from resetting the initial window to 1 segment,
unless there have been more than one SYN or SYN/ACK retransmissions or
true loss detection has been made.
- TCP implementations use slow start in as many as three different
ways: (1) to start a new connection (the initial window); (2) to
restart transmission after a long idle period (the restart window);
and (3) to restart transmission after a retransmit timeout (the loss
window). The change specified in this document affects the value of
the initial window. Optionally, a TCP MAY set the restart window to
the minimum of the value used for the initial window and the current
value of cwnd (in other words, using a larger value for the restart
window should never increase the size of cwnd). These changes do NOT
change the loss window, which must remain 1 segment of MSS bytes (to
permit the lowest possible window size in the case of severe congestion).
- To limit any negative effect that a larger initial
window may have on links with limited bandwidth or buffer space,
implementations SHOULD fall back to RFC 3390 for the restart window
(RW) if any packet loss is detected during either the initial window
or a restart window, and more than 4 KB of data is sent.
4. Background
- According to the latest report from Akamai [AKAM10],
the global broadband (> 2 Mbps) adoption has surpassed 50%,
propelling the average connection speed to reach 1.7 Mbps, while the
narrowband (< 256 Kbps) usage has dropped to 5%. In contrast, TCP's
initial window has remained 4 KB for a decade [RFC2414],
corresponding to a bandwidth utilization of less than 200 Kbps per
connection, assuming an RTT of 200 ms.
- A large proportion of flows on the Internet are short web
transactions over TCP and complete before exiting TCP slow start.
- applications have responded to TCP's "slow" start.
Web sites use multiple subdomains [Bel10] to circumvent HTTP 1.1
regulation on two connections per physical host [RFC2616]. As of
today, major web browsers open multiple connections to the same site
(up to six connections per domain [Ste08] and the number is growing).
This trend is to remedy HTTP serialized download to achieve
parallelism and higher performance. But it also implies that today
most access links are severely under-utilized, hence having multiple
TCP connections improves performance most of the time.
- persistent connections and pipelining are designed to
address some of the above issues with HTTP [RFC2616]. Their presence
does not diminish the need for a larger initial window, e.g., data
from the Chrome browser shows that 35% of HTTP requests are made on
new TCP connections. Our test data also shows significant latency
reduction with the large initial window even in conjunction with
these two HTTP features [Duk10].
5. Advantages of Larger Initial Windows
- Reducing Latency
An increase of the initial window from 3 segments to 10 segments
reduces the total transfer time for data sets greater than 4 KB by up
to 4 round trips.
The table below compares the number of round trips between IW=3 and
IW=10 for different transfer sizes, assuming infinite bandwidth, no
packet loss, and the standard delayed ACKs with large delayed-ACK
timer.
---------------------------------------
| total segments | IW=3 | IW=10 |
---------------------------------------
| 3 | 1 | 1 |
| 6 | 2 | 1 |
| 10 | 3 | 1 |
| 12 | 3 | 2 |
| 21 | 4 | 2 |
| 25 | 5 | 2 |
| 33 | 5 | 3 |
| 46 | 6 | 3 |
| 51 | 6 | 4 |
| 78 | 7 | 4 |
| 79 | 8 | 4 |
| 120 | 8 | 5 |
| 127 | 9 | 5 |
---------------------------------------
For example, with the larger initial window, a transfer of 32
segments of data will require only 2 rather than 5 round trips to
complete.
- Recovering Faster from Loss on Under-Utilized or Wireless Links
A greater-than-3-segment initial window increases the chance to
recover packet loss through Fast Retransmit rather than the lengthy
initial RTO [RFC5681]. This is because the fast retransmit algorithm
requires three duplicate ACKs as an indication that a segment has
been lost rather than reordered. While newer loss recovery
techniques such as Limited Transmit [RFC3042] and Early Retransmit
[RFC5827] have been proposed to help speeding up loss recovery from a
smaller window, both algorithms can still benefit from the larger
initial window because of a better chance to receive more ACKs.
8. Mitigation of Negative Impact
Much of the negative impact from an increase in the initial window is
likely to be felt by users behind slow links with limited buffers.
The negative impact can be mitigated by hosts directly connected to a
low-speed link advertising an initial receive window smaller than 10
segments. This can be achieved either through manual configuration
by the users or through the host stack auto-detecting the low-
bandwidth links.
Additional suggestions to improve the end-to-end performance of slow
links can be found in RFC 3150 [RFC3150].
RTO & High Performance:
https://tools.ietf.org/html/rfc7323
Updates the venerable RFC 1361.
[Also in RFC1361]
An additional mechanism could be added to the TCP, a per-host
cache of the last timestamp received from any connection. This
value could then be used in the PAWS mechanism to reject old
duplicate segments from earlier incarnations of the connection,
if the timestamp clock can be guaranteed to have ticked at least
once since the old connection was open. This would require that
the TIME-WAIT delay plus the RTT together must be at least one
tick of the sender's timestamp clock. Such an extension is not
part of the proposal of this RFC.
Appendix G. RTO Calculation Modification
Taking multiple RTT samples per window would shorten the history
calculated by the RTO mechanism in [RFC6298], and the below algorithm
aims to maintain a similar history as originally intended by
[RFC6298].
It is roughly known how many samples a congestion window worth of
data will yield, not accounting for ACK compression, and ACK losses.
Such events will result in more history of the path being reflected
in the final value for RTO, and are uncritical. This modification
will ensure that a similar amount of time is taken into account for
the RTO estimation, regardless of how many samples are taken per
window:
ExpectedSamples = ceiling(FlightSize / (SMSS * 2))
alpha' = alpha / ExpectedSamples
beta' = beta / ExpectedSamples
Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs".
Instead of using alpha and beta in the algorithm of [RFC6298], use
alpha' and beta' instead:
RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'|
SRTT <- (1 - alpha') * SRTT + alpha' * R'
(for each sample R')
Appendix H. Changes from RFC 1323
Several important updates and clarifications to the specification in
RFC 1323 are made in this document. The [important] technical changes are
summarized below:
(d) The description of which TSecr values can be used to update the
measured RTT has been clarified. Specifically, with timestamps,
the Karn algorithm [Karn87] is disabled. The Karn algorithm
disables all RTT measurements during retransmission, since it is
ambiguous whether the <ACK> is for the original segment, or the
retransmitted segment. With timestamps, that ambiguity is
removed since the TSecr in the <ACK> will contain the TSval from
whichever data segment made it to the destination.
(e) RTTM update processing explicitly excludes segments not updating
SND.UNA. The original text could be interpreted to allow taking
RTT samples when SACK acknowledges some new, non-continuous
data.
(f) In RFC 1323, Section 3.4, step (2) of the algorithm to control
which timestamp is echoed was incorrect in two regards:
(1) It failed to update TS.Recent for a retransmitted segment
that resulted from a lost <ACK>.
(2) It failed if SEG.LEN = 0.
In the new algorithm, the case of SEG.TSval >= TS.Recent is
included for consistency with the PAWS test.
(g) It is now recommended that the Timestamps option is included in
<RST> segments if the incoming segment contained a Timestamps
option.
(h) <RST> segments are explicitly excluded from PAWS processing.
(j) Snd.TSoffset and Snd.TSclock variables have been added.
Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This
allows the starting points for timestamp values to be randomized
on a per-connection basis. Setting Snd.TSoffset to zero yields
the same results as [RFC1323]. Text was added to guide
implementers to the proper selection of these offsets, as
entirely random offsets for each new connection will conflict
with PAWS.
Congestion Window Validation (CWV):
http://www.ietf.org/proceedings/69/slides/tcpm-7.pdf
https://tools.ietf.org/html/rfc7661
Provides a mechanism to address issues that arise when
TCP is used for traffic that exhibits periods where the sending rate
is limited by the application rather than the congestion window. This
RFC provides an experimental update to TCP that allows a TCP sender to
restart quickly following a rate-limited interval. This method is
expected to benefit applications that send rate-limited traffic using
TCP while also providing an appropriate response if congestion is
experienced.
Motivation:
Standard TCP states that a TCP sender SHOULD set cwnd to no more than
the Restart Window (RW) before beginning transmission if the TCP
sender has not sent data in an interval exceeding the retransmission
timeout, i.e., when an application becomes idle [RFC5681]. [RFC2861]
notes that this TCP behaviour was not always observed in current
implementations. Experiments confirm this to still be the case (see
[Bis08]).
Congestion Window Validation (CWV) [RFC2861] introduced the term
"application-limited period" for the time when the sender sends less
than is allowed by the congestion or receiver windows.
Standard TCP does not impose additional restrictions on the growth of
the congestion window when a TCP sender is unable to send at the
maximum rate allowed by the cwnd. In this case, the rate-limited
sender may grow a cwnd far beyond that corresponding to the current
transmit rate, resulting in a value that does not reflect current
information about the state of the network path the flow is using.
Use of such an invalid cwnd may result in reduced application
performance and/or could significantly contribute to network
congestion.
Active Queue Management (AQM):
Active Queue Management is an effort to avoid the latency increases (and increase in time in the
feedback loop) and bursty losses caused by naive tail drop in intermediate buffering. The concept
was introduced along with a discussion of the queue management algorithm "RED" (Random Early
Detect/Drop) by RFC 2309. The most current RFC is 7567.
The usual mix of long high throughput and short low latency flows place conflicting demands on
the queue occupancy of a switch:
o The queue must be short enough that it does not impose excessive
latency on short flows.
o The queue must be long enough to buffer sufficient data for the
long flows to saturate the path capacity.
o The queue must be short enough to absorb incast bursts without
excessive packet loss.
RED:
The RED algorithm itself consists of two main parts: estimation of
the average queue size and the decision of whether or not to drop an
incoming packet.
(a) Estimation of Average Queue Size
RED estimates the average queue size, either in the forwarding
path using a simple exponentially weighted moving average (such
as presented in Appendix A of [Jacobson88]), or in the
background (i.e., not in the forwarding path) using a similar
mechanism.
(b) Packet Drop Decision
In the second portion of the algorithm, RED decides whether or
not to drop an incoming packet. It is RED's particular
algorithm for dropping that results in performance improvement
for responsive flows. Two RED parameters, minth (minimum
threshold) and maxth (maximum threshold), figure prominently in
this decision process. Minth specifies the average queue size
*below which* no packets will be dropped, while maxth specifies
the average queue size *above which* all packets will be
dropped. As the average queue size varies from minth to maxth,
packets will be dropped with a probability that varies linearly
from 0 to maxp.
Recommendations on Queue Management and Congestion Avoidance
in the Internet
https://tools.ietf.org/html/rfc2309
IETF Recommendations Regarding Active Queue Management
https://tools.ietf.org/html/rfc7567
https://en.wikipedia.org/wiki/Active_queue_management
Explicit Congestion Notification (ECN):
At its core ECN in TCP allows compliant routers to provide compliant senders with notification
of "virtual drops" as a congestion indicator to halve its congestion window. This allows the
sender to not wait for the retransmit timeout or repeated ACKS to learn of a congestion
event and allows the receiver to avoid latency induced by drop/retransmit. ECN relies on some
form of AQM in the intermediate routers/switches to determine the marking the CE (congestion
encountered) bit IP header, it is then the receiver's responsibility to mark the ECE (ECN-Echo)
in the TCP header of the subsequent ACK. The receiver will continue to send packets marked with
the ECE bit until it receives a packet with the CWR (Congestion Window Reduced) bit set. Note
that although this last design decision makes it robust in the presence of ack loss (the
original version ECN specifies that ACKs / SYNs / SYN-ACKs not be marked as ECN capable and
thus are not eligible for marking), it limits the use of ECN to once per RTT. As we'll see
later this leads to interoperability issues with DCTCP.
ECN is negotiated at connection time. In FreeBSD it is configured by a sysctl defaulting to off
for all connections. Enabling the sysctl enables it for all connections. The last time a survey
was done, 2.7% of the internet would not respond to a SYN negotiating ECN. This isn't fatal as
subsequent SYNs will switch to not requesting ECN. This just adds the default RTO to connection
establishment (3s in FreeBSD, 1s per RFC6298 - discussed later).
Linux has some very common sense configurability improvements. Its ECN knob takes on _3_ values:
0) no request / no accept 1) no request / accept 2) request / accept. The default is (1),
supporting it for those adventurous enough to request it. The route command can specify ECN by
subnet. In effect allowing servers / clients to only use it within a data center or between
compliant data centers.
ECN sees very little usage due to continued compatibility concerns. Although the difficulty of
correctly tuning maxth and minth in RED and many other AQM mechanisms is not specific to ECN,
RED et al are necessary to use ECN and thus further add to associated difficulties of its use.
Talks:
More Accurate ECN Feedback in TCP (AccECN)
- https://www.ietf.org/proceedings/90/slides/slides-90-tcpm-10.pdf
ECN is slow, does not report condition extent, just it's existence. It lacks inter-
operability with DCTCP. Need to add mechanism for negotiating finer-grained,
adaptive congestion notification.
RFCS:
A Proposal to add Explicit Congestion Notification (ECN) to IP
- https://tools.ietf.org/html/rfc2481
Initial proposal.
The Addition of Explicit Congestion Notification (ECN) to IP
- https://tools.ietf.org/html/rfc3168
Elaboration and further specification of how to tie it in to TCP.
Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets
- https://tools.ietf.org/html/rfc5562
Sometimes referred to as ECN+. This extends ECN to SYN/ACK packets. Note that SYN
packets are still not covered, being considered a potential security hole.
Accurate ECN (AccECN)
Problem Statement and Requirements for Increased Accuracy
in Explicit Congestion Notification (ECN) Feedback
- https://tools.ietf.org/html/rfc7560
Problem Statement and Requirements for Increased Accuracy
in Explicit Congestion Notification (ECN) Feedback
"A primary motivation
for this document is to intervene before each proprietary
implementation invents its own non-interoperable handshake, which
could lead to _de facto_ consumption of the few flags or codepoints
that remain available for standardizing capability negotiation."
Incast:
The term was coined in [PANFS] for the case of increasing the number of
simultaneously initiated, effectively barrier synchronized, fan-in flows
in to a single port to the point where the instantaneous switch / NIC buffering
capacity was exceeded. Thus causing a decline in aggregate bandwidth as the need
for re-transmits increases. This is further exacerbated by tail-drop behavior in
the switch whereby multiple losses within individual streams exceeds the re-
covery abilities of duplicate ACKs or SACK, leading to RTOs before the flow is
resumed.
The Panasas ActiveScale Storage Cluster - Delivering Scalable
High Bandwidth Storage [PANFS]
- http://acm.supercomputing.org/sc2004/schedule/pdfs/pap207.pdf
Focuses on the Object-based Storage Device (OSD) component backing the PanFS
distributed file system. PanFS runs on the client, backend storage consists of
networked block devices (OSD). The intelligence consists in how stripes are laid
out across OSD. PanFS relies on a Metadata Server (MDS) to control the interaction
of clients with the objects on OSDs and maintain cache coherency.
Scalable bandwidth is achieved through aggregation by striping data across many
OSDs. Although in principle it would be desirable to stripe files as widely as
possible. In practice, in their 1Gbps testbed (this is 2004) bandwidth scaled
linearly from 3 to 7 OSDs but then after 14 OSDs aggregate bandwidth actually
decreases. With a 10ms disk access latency, if just one OSD experienced enough
packet loss to result in one 200ms RTO the system would suffer a 10x decrease in
performance.
Changes to address the incast problem:
- Reduce the minRTO from 200ms to 50ms.
- Tuning the _individual, socket buffer size. While a client must have a large
aggregate receive buffer size, each individual stream's receive buffer should
be relatively small. Thus they reduced the clients' (per OSD) receive socket
buffer to under 64K.
- To reduce the size of a single synchronized incast response PanFS implements
a two level striping pattern. The first level is optimized for RAID's parity
update performance and read overhead. The second level of striping is designed
to resist incast induced bandwidth penalties by stacking successive parity
stripes that are stacked in the same subset of objects. They call N sequential
parity stripes that are stacked in the same set of objects a 'visit', because
a client repeatedly feteches data from just a few OSDs (whose number is
controlled by parity stripe width) for a while, then moves on to the next set
of OSDs. This striping pattern minimizes simultaneous fan-in and thus the
potential for incast. Typically PanFS stripes about 1GB of data per visit,
using a round-robin layout algorithm of visits across all OSDs.
Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems
- https://www.usenix.org/legacy/event/fast08/tech/full_papers/phanishayee/phanishayee_html/
Attempts to do a more general analysis of incast than [PANFS]. Analysis is based on
the model of a cluster-based storage system with data blocks striped over a number of
servers. They refer to a single block fragmented over multiple servers as a Server
Request Unit (SRU). A subsequent block request will only be made after the client
has received all the data for the current block. They refer to such reads as
'synchronized reads'. The paper makes three contributions to the literature:
- Explores the root causes of incast, characterizing it under a variety of
conditions (buffer space, varying number of servers, etc.). Buffer space can
delay the onset of Incast, but any particular switch configuration will have
some maximum number of servers that can send simultaneously before
throughput collapse occurs.
- Reproduce incast collapse on 3 different models of switches. In some cases
disabling QoS can help delay incast by freeing up packet buffers for
general switching.
- Demonstrate applicability of simulation by showing that the throughput
collapse curve produced by ns-2 with a simulated 32KB buffer closely
matches that shown by the HP Procurve 2848 with QoS disabled.
- Analysis of TCP traces obtained from simulation reveals that TCP re-
transmission timeouts are the primary cause of incast.
- Displays the effect of varying the switch buffer size. Doubling the size
of the switch's output port buffer doubles the number of servers that can
be supported before the system experiences incast.
- TCP performs well in settings without synchronized reads, which can
be modelled by an infinite SRU size. Running netperf across many servers
does not induce incast. With larger SRU sizes servers can use the spare
link capacity made available by any stalled flow waiting for a timeout
event.
- Examines the effectiveness of existing TCP variants (e.g. Reno, NewReno,
SACK, and limited transmit). Although the move from Reno to NewReno
improves performance, none of the additional improvements help. When TCP
loses all packets in its window or loses retransmissions, no clever loss
recovery algorithms can help.
- Examine a set of techniques that are moderately effective in masking Incast,
such as drastically reducing TCP's retransmission timeout timer. None of
these techniques are without drawbacks.
- reducing RTOmin from 200ms to 200us improves throughput by an order of
magnitde for 8-32 servers. However, at the time of the paper Linux and
BSD TCP implementations were unable to provide a timer of sufficient
granularity to calculate RTT at less than the system clock frequency.
Understanding TCP Incast Throughput Collapse in Datacenter Networks
- http://conferences.sigcomm.org/sigcomm/2009/workshops/wren/papers/p73.pdf
Proposes an analytical model of limited generality based on the results
observed in two test beds.
- Observed little benefit from disabling delayed acks
- Observed a much shallower decline in throughput after 4 servers with 1ms
minRTO vs 200ms minRTO. No benefit was shown for 200us over 1ms. [The
next paper concludes that this was because the calculated RTO never went
below 5ms, so a 200us minRTO was equivalent to disabling minRTO in this
setting].
- For large RTO timer values, reducing the RTO timer value is a first-order
mitigation. For smaller RTO timer values, intelligently controlling the
inter-packet wait time [pacing] becomes crucial.
- Observes two regions of throughput increase. Following the initial
throughput decline there is an increasing region. They reason that: As
the number of senders increase, 'T' increases, and there is less
overlap in the RTO periods for different senders. This means
the impact of RTO events is less severe - a mitigating effect.
(Prob(enter RTO at t) = { 1/T : d < t < d + T, 0: otherwise} - d is the
delay for congestion info to propagate back to the sender and T is the
width of the uniform distribution in time.)
- The smaller the RTO timer values, the faster the rate of recovery between
the throughput minimum and the second order throughput maximum. For smaller
RTO timer values, the same increase in 'T' will have a larger mitigating
effect. Hence, as the number of senders increases, the same increase in 'T'
will result in a faster increase in the goodput for smaller RTO timer
values.
- After the second order goodput maximum, the slope of throughput decrease is the
same for different RTO timer values. When 'T' becomes comparable or larger than
the RTO timer value, the amount of interference between retransmits after RTO
and transmissions before RTO no longer depends on the value of the RTO timer.
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication
- https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ekrevat/docs/SIGCOMMIncast.pdf
Effectively makes the case for using high resolution timers to neable microsecond
granularity TCP timeouts. They claim that they demonstrate that this technique is
effective in avoiding TCP incast collapse in both simulation and real-world
experiments.
- Prototype uses Linux's high resolution kernel timers.
- Demonstrate that this change prevents incast collapse in practice for up
to 47 senders.
- Demonstrate that simply reducing RTOmin in today's [2009] TCP
implementations without also improving the timing granularity does not
prevent TCP incast.
- Even without incast patterns, the RTO can determine observed performance.
Simple example: They started ten bulk-data transfer TCP flows from ten
clients to one server. They then had another client issue small
request packets for 1KB of data from the server, waiting for
the response before sending the next request. Approximately
1% of these requests experienced a TCP timeout, delaying
the response by at least 200ms. Finer-grained re-transmission handling
can improve the performance of latency sensitive applications.
Evaluating Throughput with Fine-Grained RTO:
- to be maximally effective timers must operate on a granularity close to
the RTT of the network.
- Jacobson RTO Estimation:
- The standard RTO estimator [V. Jacobson, 98] tracks a smoothed
estimate of the round-trip time, and sets the timeout to this RTT
estimate plus 4 times the mean deviation (a simpler calculation
than the standard deviation, and given a normal distribution of
prediction errors mdev = sqrt(pi/2)*sdev).
- RTO = SRTT + (4xRTTMDEV)
- Two factors set lower bounds on the value that the RTO can achieve:
- the explicit configuration parameter RTOmin
- the implicit effects of the granularity with which the RTT is
measured and with which the kernel sets and checks timers.
Most implementations track RTTs and timers at a granularity
of 1ms or larger. Thus the minimum achievable RTO is 5ms.
- In Simulation (simulate one client with multiple servers connected
through a single switch with an unloaded RTT of 100us, each node has
a 1Gbps link, the switch buffers have 32KB of space per output port,
and a random timer scheduling delay of up to 20us to account for
real-world variance):
- With an RTOmin of 200ms throughput drops by an order of magnitude
with 8 concurrent senders.
- Reducing RTOmin to 1ms is effective for 8-16 concurrent senders,
fully utilizing the client's link. However, throughput declines
as the number of servers is increased. 128 concurrent senders
use only 50% of the available link bandwidth even with a 1ms
RTOmin.
- In Real Clusters (sixteen node cluster w/ HP Procurve 2848 &
48 node cluster w/ Force10 S50 switch - all nodes 1Gbps and a
client to server RTT of ~100us):
- Modified the Linux 2.6.28 kernel to use 'microsecond-accurate'
timers with microsecond granularity RTT estimation.
- For all configurations, throughput drops with increasing RTOmin
above 1ms. For 8 and 16 concurrent senders, the default RTOmin
of 200ms results in nearly 2 orders of magnitude drop in through-
put.
- Results show identical performance for RTOmin values of 200us and
1 ms. Although teh baseline RTTs can be between 50-100us, increased
congestion causes RTTs to rise to 400us on average with spikes as
high as 850us. Thus the higher RTTs combined with increased RTT
variance causes the RTO estimator to set timeouts of 1-3ms and an
RTOmin below 1ms will not lead to shorter retransmission times.
In effect, specifying an RTOmin <= 1ms is equivalent to eliminating
RTOmin.
Next-Generation Datacenters:
- 10Gbps networks have smaller RTTs than 1Gbps - port-to-port latency
can be as low as 10us. In a sampling of an active storage node at
LANL 20% of RTTs are belowe 100us even when accounting for kernel
scheduling.
- smaller RTO values are required to avoid idle link time.
- Scaling to Thousands [simulating large numbers of servers on a 10Gbps network]
(reduce baseline RTTs from 100us to 20us, eliminate 20us timer scheduling
variance, increase link capacity to 10Gbps, set per-port buffer size to 32KB,
increase blocksize to 80MB to ensure each flow can saturate a 10Gbps link,
vary the number of servers from 32 to 2048):
- Having an artificial bound of either 1ms or 200us results in low throughput
in a network whose RTTs are 20us - underscoring the requirement that
retransmission timeouts should be on the same timescale as network latency
to avoid incast collapse.
- Eliminating a lower bound on RTO performs well for up to 512 concurrent
senders. For 1024 servers and beyond, even the aggressively low RTO
configuration sees up to a 50% reduction in throughput resulting from
significant periods of link idle time caused by repeated, simultaneous,
successive timeouts.
- For incast communication the standard exponential backoff increase of
RTO can overshoot some portion of the time the link is actually idle.
Because only one flow must overshoot to delay the entire transfer,
the probability of overshooting increases with increased number of
flows.
- Decreased throughput for a large number of flows can be attributed to
many flows timing out simultaneously, backing off deterministically,
and retransmitting at the same time. While some flows are successful
on this retransmission, a majority of flows lose their retransmitted
packet and backoff by another factor of two, sometimes far beyond
when the link becomes idle.
- Desynchronizing Retransmissions
- Adding some randomness to the RTO will desynchronize retransmissions.
- Adding an adaptive randomize RTO to the scheduled timeout:
timeout = (RTO + (rand(0.5) x RTO)) x 2^backoff
performs well regardless of the number of concurrent senders.
Nonetheless, real-world variances my be large enough to avoid the
need for explicit randomization in practice.
- Do not evaluate the impact on wide area flows.
- Implementing fine-grained retransmissions
- Three changes to the Linux TCP stack were required:
- microsecond resolution time accounting to track RTTs with greater
precision - store microseconds in the TCP timestamp option
[timestamp resolution can go as high as 57ns without violating the
requirements of PAWS]
- redefinition of TCP constants - timer constants formerly defined in
terms of jiffies [ticks] are converted to absolute values (e.g. 1ms
instead of 1 jiffy)
- replacement of low-resolution timers with hrtimers - replace standard
timer objects in the socket structure with the hrtimer structure,
ensuring that all calls to set, reset, or clear timers use the
hrtimer functions.
- Results:
- Using the default 200ms RTOmin throughput plummets beyond 8
concurrent senders on both testbeds.
- On the 16 server testbed a 5ms jiffy-based RTOmin throughput begins
to drop at 8 servers to ~70% of link capacity and slowly decreases
thereafter. On the 47 server testbed [Force10 switch] the 5ms
RTOmin kernel obtained 70-80% throughput with a substantial
decline after 40 servers.
- TCP hrtimer implementation / microsecond RTO kernel is able to
saturate the link for up to 16/47 servers [total number in
both testbeds].
- Implications of Fine-Grained TCP Retransmissions:
- A receiver's delayed ACK timer should always fire before the senders
retransmission timer fires to prevent the sender form timing out
waiting for an ACK that is merely delayed. Current system protect
against this by setting the delayed ACK timer to a value (40ms)
that is safely under the RTOmin (200ms).
- A host with microsecend granularity retransmissions would periodically
experience an unnecessary timeout when communicating with unmodified
hosts in environments where the RTO is below 40ms (e.g., in the data
center and for short flows in the WAN), because the sender incorrectly
assumes that a loss has occurred. In practice the two consequences
are mitigated by newer TCP features and the limited circumstances in
which they occur (and bulk data transfer is essentially unimpacted by
the issue).
- The major potential effect of a spurious timeout is a loss of
performance: a flow that experiences a timeout will reduce
its slow-start threshold (ssthresh) by half, its window to one
and attempt to rediscover link capacity. It is important to
understand that spurious timeouts do not endanger network
stability through increased congestion [On estimating end-to-end
network path properties. SIGCOMM 99]. Spurious timeouts
occur not when the network path drops packets, but rather when
the path observers a sudden, higher delay.
- Several algorithms have been proposed to undo the effects of spurious
timeouts have been proposed and, in the case of F-RTO [Forward
RTO-Recovery RFC 4138], adopted in the Linux TCP implementation.
- When seeding torrents over a WAN there was no observable difference
in performance between the 200us and 200ms RTOmin [no penalty].
- Interaction with Delayed ACK in the Datacenter: For servers using a
reduced RTO in a datacenter environment, the server's retransmission
timer may expire long before an unmodied client's 40ms delayed ACK timer
expires. As a result, the server will timeout and resend the unacked
packet, cutting ssthresh in half and rediscovering link capacity using
slow-start. Because the client acknowledges the retransmitted segment
immediately, the server does not observe a coarse-grained 40ms delay,
only an unnecessary timeout.
- Although for full performance delayed acks should be disabled, unmodified
clients still achieve good performance and avoid incast when only the
servers implement fine-grained retransmissions.
Data Center Transmission Control Protocol (DCTCP):
The Microsoft & Stanford developed CC protocol uses simplified switch RED/ECN CE marking to
provide fine grained congestion notification to senders. RED is enabled in the switch but
minth=maxth=K, where K is an empirically determined constant that is a function of bandwidth
and desired switch utilization vs rate of convergence. Common values for K are 5 for 1Gbps
and 60 for 10Gbps. The value for 40Gbps is presumably on the order of 240. The sender's
congestion window is scaled back once per RTT as function of (#ECE/(#segments in window))/2.
In the degenerate case of all segments being marked window is scaled back a la a loss in
Reno. In the steady state latencies are much lower than in Reno due to considerably reduced
switch occupancy.
There is currently no mechanism for negotiating CC protocols and DCTCP's reliance on continuous
ECE notifications is incompatible with ECN's continuous repeating of the same ECE until a CWR
is received. In effect ECN support has to be sucessfully negotiated when establishing the
connection, but the receiver has to instead provide one ECE per new CE seen.
RFC:
Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters
https://tools.ietf.org/pdf/draft-ietf-tcpm-dctcp-00.pdf
The window scaling constant is referred to as 'alpha'. Alpha=0 corresponds
to no congestion, alpha=1 corresponds to a loss event in Reno or an ECE mark in standard
ECN - resulting in a halving of the congestion window. 'g' is the feedback gain, 'M' is the
fraction of bytes marked to bytes sent. Alpha and the congestion window 'cwnd' are calculated
as follows:
alpha = alpha * (1 - g) + g * M
cwnd = cwnd * (1 - alpha/2)
To cope with delayed acks DCTCP specifies the following state machine - CE refers to DCTCP.CE,
a new Boolean TCP state variable, "DCTCP Congestion Encountered" - which is initialized to
false and stored in the Transmission Control Block (TCB).
Send immediate
ACK with ECE=0
.----. .-------------. .---.
Send 1 ACK / v v | | \
for every | .------. .------. | Send 1 ACK
m packets | | CE=0 | | CE=1 | | for every
with ECE=0 | ’------’ ’------’ | m packets
\ | | ^ ^ / with ECE=1
’---’ ’------------’ ’----’
Send immediate
ACK with ECE=1
The clear implication of this is that if the ack is delayed by more than m, as in different
assumptions between peers or dropped ACKs, the signal can underestimate the level of encountered
congestion. None of the literature suggests that this has been a problem in practice.
[Section 3.4 of RFC]
Handling of SYN, SYN-ACK, RST Packets
[RFC3168] requires that a compliant TCP MUST NOT set ECT on SYN or
SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets,
but maintains the restriction of no ECT on SYN packets. Both these
RFCs prohibit ECT in SYN packets due to security concerns regarding
malicious SYN packets with ECT set. These RFCs, however, are
intended for general Internet use, and do not directly apply to a
controlled datacenter environment. The switching fabric can drop TCP
packets that do not have the ECT set in the IP header. If SYN and
SYN-ACK packets for DCTCP connections do not have ECT set, they will
be dropped with high probability. For DCTCP connections, the sender
SHOULD set ECT for SYN, SYN-ACK and RST packets.
[Section 4]
Implementation Issues
- the implementation must choose a suitable estimation gain (feedback gain)
- [DCTCP10] provides a theoretical basis for its selection, in practice
more practical to select empirically by network/workload
- The Microsoft implementation uses a fixed estimation gain of 1/16
- the implementation must decide when to use DCTCP. DCTCP may not be
suitable or supported for all peers.
- It is RECOMMENDED that the implementation deal with loss episodes in
the same way as conventional TCP.
- To prevent incast throughput collapse, the minimum RTO (MinRTO) should be
lowered significantly. The default value of MinRTO in Windows is 300ms,
Linux 200ms, and FreeBSD 233ms. A lower MinRTO requires a correspondingly
lower delayed ACK timeout on the receiver. Thus, it is RECOMMENDED that an
implementation allow configuration of lower timeouts for DCTCP connections.
- It is also RECOMMENDED that an implementation allow configuration of
restarting the congestion window (cwnd) of idle DCTCP connections as described
in [RFC5681].
- [RFC3168] forbids the ECN-marking of pure ACK packets, because of the
inability of TCP to mitigate ACK-path congestion and protocol-wise
preferential treatment by routers. However, dropping pure ACKs -
rather than ECN marking them - has disadvantages for typical
datacenter traffic patterns. Dropping of ACKs causes subsequent re-
transmissions. It is RECOMMENDED that an implementation provide a
configuration knob that forces ECT to be set on pure ACKs.
[Section 5]
Deployment Issues
- DCTCP and conventional TCP congestion control do not coexist well in
the same network. In DCTCP, the marking threshold is set to a very
low value to reduce queueing delay, and a relatively small amount of
congestion will exceed the marking threshold. During such periods of
congestion, conventional TCP will suffer packet loss and quickly and
drastically reduce cwnd. DCTCP, on the other hand, will use the
fraction of marked packets to reduce cwnd more gradually. Thus, the
rate reduction in DCTCP will be much slower than that of conventional
TCP, and DCTCP traffic will gain a larger share of the capacity
compared to conventional TCP traffic traversing the same path. It is
RECOMMENDED that DCTCP traffic be segregated from conventional TCP traffic.
[MORGANSTANLEY] describes a deployment that uses the IP DSCP bits to
segregate the network such that AQM is applied to DCTCP traffic, whereas
TCP traffic is managed via drop-tail queueing.
- Since DCTCP relies on congestion marking by the switches, DCTCP can
only be deployed in datacenters where the entire network
infrastructure supports ECN. The switches may also support
configuration of the congestion threshold used for marking. The
proposed parameterization can be configured with switches that
implement RED. [DCTCP10] provides a theoretical basis for selecting
the congestion threshold, but as with the estimation gain, it may be
more practical to rely on experimentation or simply to use the
default configuration of the device. DCTCP will degrade to loss-
based congestion control when transiting a congested drop-tail link.
- DCTCP requires changes on both the sender and the receiver, so both
endpoints must support DCTCP. Furthermore, DCTCP provides no
mechanism for negotiating its use, so both endpoints must be
configured through some out-of-band mechanism to use DCTCP. A
variant of DCTCP that can be deployed unilaterally and only requires
standard ECN behavior has been described in [ODCTCP][BSDCAN], but
requires additional experimental evaluation.
[Section 6]
Known Issues
- DCTCP relies on the sender’s ability to reconstruct the stream of CE
codepoints received by the remote endpoint. To accomplish this,
DCTCP avoids using a single ACK packet to acknowledge segments
received both with and without the CE codepoint set. However, if one
or more ACK packets are dropped, it is possible that a subsequent ACK
will cumulatively acknowledge a mix of CE and non-CE segments. This
will, of course, result in a less accurate congestion estimate.
o Even with an inaccurate congestion estimate, DCTCP may still
perform better than [RFC3168].
o If the estimation gain is small relative to the packet loss rate,
the estimate may not be too inaccurate.
o If packet loss mostly occurs under heavy congestion, most drops
will occur during an unbroken string of CE packets, and the
estimate will be unaffected
- The effect of packet drops on DCTCP under real world conditions has not been
analyzed.
- Much like standard TCP, DCTCP is biased against flows with longer
RTTs. A method for improving the fairness of DCTCP has been proposed
in [ADCTCP], but requires additional experimental evaluation.
Papers:
Data Center TCP [DCTCP10]
- http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf
The original DCTCP SIGCOMM paper by Stanford and Microsoft Research. It is very accessible
even for those of us not well versed in CC protocols.
- reduce minRTO to 10ms.
- suggest that K > (RTT * C)/7, where C is the sending rate in packets per second.
Attaining the Promise and Avoiding the Pitfalls of TCP
in the Datacenter [MORGANSTANLEY]
- https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.pdf
Real world experience deploying DCTCP on Linux at Morgan Stanley.
- reduce minRTO to 5ms.
- reduce delayed ACK to 1ms.
- Only ToR switches support ECN marking, higher level switches purely tail-drop.
Tests show that DCTCP successfully resorts to loss-based congestion control when
transiting a congested drop-tail link.
- Find that setting ECT on SYN and SYN-ACK is critical for the practical
deployment of DCTCP. Under load, DCTCP would fail to establish network
connections in the absence of ECT in SYN and SYN-ACK packets. (DCTCP+)
- Without correct receive buffer tuning DCTCP will converge _faster_ than TCP,
rather than the theoretical 1.4 x TCP.
Per-packet latency in ms
TCP DCTCP+
Mean 4.01 0.0422
Median 4.06 0.0395
Maximum 4.20 0.0850
Minimum 3.32 0.0280
sigma 0.167 0.0106
Extensions to FreeBSD Datacenter TCP for Incremental
Deployment Support [BSDCAN]
- https://www.bsdcan.org/2015/schedule/attachments/315_dctcp-bsdcan2015-paper.pdf
Proposes a variant of DCTCP that can be deployed only on one endpoint of a connection,
provided the peer is ECN-capable.
ODTCP changes:
- In order to facilitate one-sided deployment, a DCTCP
sender should set the CWR mark after receiving an ECE-
marked ACK once per RTT. It is safe in two-sided deploy-
ments, because a regular DCTCP receiver will simply ig-
nore the CWR mark.
- A a one-sided DCTCP receiver should always delay an ACK for
incoming packets marked with CWR, which is the only indication
of recovery exit.
DCTCP improvements:
- ECE processing: Under standard ECN an ACK with an ECE mark will
trigger congestion recovery. When this happens a sender stops
increasing cwnd for one RTT. For DCTCP there is no reason for
this response. ECEs are used, not for detecting congestion
events, but to quantify the extent of congestion and react
proportionally. Thus, there is no need to stop cwnd from in-
creasing.
- Set initial value of alpha to 0 (i.e. don't halve cwnd on first
ECE seen).
- Idle Periods: The same tradeoffs regarding "slow-start restart"
apply to alpha. The FreeBSD implementation re-initializes alpha
after an idle period longer than the RTO.
- Timeouts and Packet Loss: The DCTCP specification defines the
update interval for alpha as one RTT. To track this DCTCP compares
received ACKs against the sequence numbers of outgoing packets.
This is not robust in the face of packet loss. The FreeBSD
implementation addresses this by updating alpha when it detects
duplicate ACKs or timeouts.
Data Center TCP (DCTCP)
- http://www.ietf.org/proceedings/80/slides/iccrg-3.pdf
Case studies, workloads, latency and flow completion time of TCP vs DCTCP.
Interesting set of slides worth skimming.
- Small (10-100KB & 100KB - 1MB) background flows complete in ~45% less
time than TCP.
- 99th %ile & 99.9th %ile query flows are 2/3rds and 4/7ths respectively
- large (1-10MB & > 10MB) flows unchanged
- query completion time with 10 to 1 background incast unchanged with
DCTCP, ~5x slower with TCP
Analysis of DCTCP: Stability, Convergence, and Fairness [ADCTCP]
- http://sedcl.stanford.edu/files/dctcp-analysis.pdf
Follow up mathematical analysis of DCTCP using a fluid model. Contains
interesting graphs showing how the gain factor affects the convergence rate
between two flows.
- Analyzes the convergence of DCTCP sources to their fair share, obtaining
an explicit characterization of the convergence rate.
- Proposes a simple change to DCTCP suggested by the fluid model which
significantly improves DCTCP's RTT-fairness. It suggests updating the
congestion window continuously rather than once per RTT.
- Finds that with a marking threshold, K, of about 17% of the bandwidth-
delay product, DCTCP achieves 100% throughput, and that even for values
of K as small as 1% of the bandwidth-delay product, its throughput is
at least 94%.
- Show that DCTCP's convergence rate is no more than a factor 1.4 slower than
TCP
Using Data Center TCP (DCTCP) in the Internet [ADCTCP]
- http://www.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_GLOBECOM_14_40260.pdf
Investigates what would be needed to deploy DCTCP incrementally outside the data
center.
- Proposes finer resolution for alpha value
- Allow the congestion window to grow in the CWR state (similar to [BSDCAN])
- Continuous update of alpha: Define a smaller gain factor (1/2^8 instead of 1/2^4)
to permit an EWMA updated every packet. However, g should actually be a function
of number of packets in flight.
- Progressive congestion window reduction: Similar to [ADCTCP], reduce the congestion
window on the reception of each ECE.
- develops a formula for AQM RED parameters that always results in equal sharing
between DCTCP and non-DCTCP.
Incast Transmission Control Protocol (ICTCP):
In ICTCP the receiver plays a direct role in estimating the per-flow available bandwidth
and actively re-sizes each connection's receive window accordingly.
- http://research.microsoft.com/pubs/141115/ictcp.pdf
Quantum Congestion Notification (QCN):
Congestion control in ethernet. Introduced as part of the IEEE 802.2 Standards
Body discussions for Data Center Bridging [DCB] motivated by the needs of FCoE.
The initial congestion control protocol was standardized as 802.1Qau. Unlike
the single bit of congestion information per-packet in TCP QCN uses 6-bits.
The algorithm is composed of two main parts: Switch or Control Point (CP)
Dynamics and Rate Limiter or Reaction Point (RP) Dynamics.
- The CP Algorithm runs at the network nodes. Its objective is to maintain the
node's buffer occupancy at the operating point 'Beq'. It computes a con-
gestion measure Fb and randomly samples an incoming packet with a probability
proportional to the severity of the congestion. The node sends a 6-bit
quantized value of Fb back to the source of the sampled packet.
- B: Value of the current queue length
- Bold: Value of the buffer occupancy when the last feedback message was
generated.
- w: a non-negative constant, equal to 2 for the baseline implementation
- Boff = B - Beq
- Bd = B - Bold
- Fb = Boff + w*Bd
- essentially equivalent to the PI AQM. The first term is the offset
from the target operating point and the second term is proportional
to the rate at which the queue size is changing.
When Fb < 0, there is no congestion, and no feedback messages are sent.
When Fb >= 0, then either the buffers or the link is oversubscribed, and
control action needs to be taken.
- The RP algorithm runs on end systems (NICs) and controls the rate at which
ethernet packets are transmitted. Unlike TCP, the RP algorithm does not
get positive ACKs from the network and thus needs alternative mechanisms
for increasing its sending rate.
- Current Rate (Rc): The transmission rate of the source
- Target Rate (Rt): The transmission rate of the source just before the
arrival of the last feedback message
- Gain (Gd): a constant chosen so that Gd*|Fbmax| = 1/2 - that is to say
the rate can decrease by at most 50%. Only 6 bits are available for
feedback so Fbmax = 64, and thus Gd = 1/128.
- Byte counter: A counter at the RP for counting transmitted bytes; used
to time rate increases
- Timer: A clock at the RP used for timing rate increases.
Rate Decreases:
A rate decrease is only done when a feedback message is received:
- Rt <- Rc
- Rt <- Rc*(1 - Gd*|Fb|)
Rate Increases:
Rate Increase is done in two phases: Fast Recovery and Active Increase.
Fast Recovery (FR): The source enters the FR state immediately after a
rate decrease event - at which point the Byte Counter is reset. FR
consists of 5 cycles, in each of which 150KB of data (assuming full-
sized regular frames) are transmitted (100 packets of 1500 bytes each),
as counted by the Byte Counter. At the end of each cycle, Rt remains
unchanged, and Rc is updated as follows:
Rc <- (Rc + Rt)/2
The rationale being that, when congested, Rate Decrease messages are
sent by the CP once every 100 packets. Thus the absence of a Rate
Decrease message during this interval indicates that the CP is no
longer congested.
Active Increase (AI): After 5 cycles of FR, the source enters the AI
state when it probes for extra bandwidth. AI consists of multiple
cycles of 50 packets each. Rt and Rc are updated as follows:
- Rt <- Rt + Rai
- Rc <- (Rc + Rt)/2
- Rai: a constant set to 5Mbps by default.
When Rc is extremely small after a rate decrease the time required to
send out 150 KB can be excessive. To increase the rate of increase
the source also uses a timer that is used as follows:
1) reset timer when rate decrease message arrives
2) source enters FR and counts out 5 cycles of T ms duration
(T = 10ms in baseline implementation), and in the AI state,
each cycle is T/2 ms long
3) in the AI state, Rc is updated when _either_ the Byte Counter
or the Timer completes a cycle.
4) The source is is in teh AI state iff either the Byte Counter
or the timer is in teh AI state.
5) if _both_ the Byte Counter and the Timer ar in AI the source is
said to be in Hyper-Active Increase (HAI). In this case, at the
completion of the ith Byte Counter and Timer cycle, Rt and Rc
are updated:
- Rt <- Rt + i*Rhai
- Rc <- (Rc + Rt) / 2
- Rhai: 50Mbps in the baseline
[Taken from "Internet Congestion Control" by Subir Varna, ch. 8]
Performance of Quantized Congestion Notification in TCP Incast Scenarios of
Data Centers
- http://eprints.networks.imdea.org/131/1/Performance_of_Quantized_Congestion_Notification_-_2010_EN.pdf
Using the QCN pseudocode version released by Rong Pan [IEEE EDCS-608482]
simulated the performance of QCN at 1Gbps under a number of incast scenarios,
reaching the conclusion that the the default QCN behaviors will not scale
to large number of flows with full link utilization. It goes on to propose
a small number of changes to the QCN algorithm that _will_ support a large
number of flows at full link utilization. However, there is no indication in the
literature that these ideas have been taken any further in practice. A survey
paper written in 2014 [A Survey on TCP Incast in Data Center Networks] indicates
that these problems still exist. It is unclear what the current state of the art
is in shipping hardware.
http://www.ieee802.org/3/ar/public/0505/bergamasco_1_0505.pdf
http://www.ieee802.org/1/files/public/docs2007/au-bergamasco-ecm-v0.1.pdf
http://www.cs.wustl.edu/~jain/papers/ftp/bcn.pdf
http://www.cse.wustl.edu/~jain/papers/ftp/icc08.pdf
Recommendations:
RFC 6298:
- change starting RTO from 3s to 1s (in /dctcp) D4294
- DO NOT round RTO up to 1s counter to the suggestions here (long done)
- simplify setting of minRTO sysctl to eliminate "slop" component" D4294
(in /dctcp)
RFC 6928:
- increase initial / idle window to 10 segments when connecting to (done by hiren)
data center peers
RFC 7323:
- stop truncating SRTT prematurely on low-latency conections, D4293
see appendix G to calculate reduce potentially detrimental
fluctuations in calculated RTO
Incast:
- do SW TSO only
- add rudimentary pacing by interleaving streams
- fine grained timers D4292
- scale RTO down to same granularity as RTT (patch in progres)
ECN:
- change default to allow ECN on incoming connections
- set ECT on _ALL_ packets sent by a host using a DCTCP connection
- add facility to enable ECN by subnet
DCTCP:
- add facility to enable DCTCP by subnet
- set ECT on _ALL_ packets used by a host using a DCTCP connection
- update TCP to use microsecond granularity timers to timestamps (patch in progress)
- when using current coarse-grained timers reduce minRTO to 3ms D4294
when using DCTCP, if fine-grained timers are available disable
minRTO when using DCTCP
- reduce delack to 1/5th of min(minRTO, RTO) (reduced to 1/2 in /dctcp) D4294
ICTCP:
- if there is time investigate it's use and the ability to use
the socket buffer sizing to communicate the amount of anticipated
data for purposes of TCB's sharing the port's connection optimally
More information about the freebsd-net
mailing list