DCTCP implementation

Mon Mar 31 05:37:12 UTC 2014

Hi FreeBSD developpers,

I'm Midori Kato. I'm working on the DCTCP implementation in the FreeBSD 
with Lars Eggert. I mail you because I would like to ask you a code 
review and testing. The attached patch is not good enough to test our 
code. Please give me your message. I will send an ECN marking 
implmenetation in dummynet and test scripts personally to you.

A DCTCP paper is published in SIGCOMM 2010. DCTCP is also published as 
an IETF draft [1]. In our implementation, there are a change to the 
modular congestion control framework and five changes from the original 
DCTCP algorithm. I briefly describe each of them as below.

<1> A change for the modular congestion control framework
DCTCP uses the different ECN processing from RFC3168. We need three 
functions to do the proper DCTCP ECN processing.
   a) The kernel decides whether an ECE flag should be set in the 
outgoing TCP segment. (tcp_input.c)
   b) The kernel controls congestion if an ECE flag is set in the 
arriving TCP segment. (tcp_input.c)
   c) After the outgoing TCP segment is generated, the kernel decides 
whether an ECT bit should be set in an ECN field of IP header in the 
outgoing packet. (tcp_output.c)
The current framework has no housekeeping functions for (a) and (b). 
Therefore, I add two functions into the moduler cc framework: 
ecnpkt_handler() and ect_handler().
   - ecnpkt_handler() allows the kernel to do the additional ECN 
processing by snooping ECN field in IP and TCP headers. As an option, 
this function check a delayed ACK flag, which tells whether this 
function is in the delayed ACK. This function returns an integer value. 
When the return value is set, the kernel force to disable delayed ACK.
  - ect_handler() allows the kernel to use the different rule from 
RFC3168 in terms of an ECT marking in the outgoing segment. This 
function returns an integer value. If the value is set, an ECT bit is 
set to the outgoing segment.

<2> Five changes from the original DCTCP algorithm
In order to reflect the DCTCP motivation correctly, the following 
modifications are included in our patch. First four modifications are 
prepared for senders and the last modification is prepared for receivers.
   (1) ECE processing
   FreeBSD handles ECN as a congestion event but it's not true for DCTCP 
senders. A DCTCP sender uses ECN as a means to understand the extent of 
congestions. Therefore, a modified DCTCP sender never enters congestion 
recovery mode in any situations.

   (2) selective initial alpha value
   DCTCP defines alpha as a parameter to see the depth of a congestion. 
When the alpha value is large, it allows a saw-toothed CWND behavior to 
a DCTCP sender. A problem is that the alpha value is not reliable during 
a dozen of RTTs because there is no way to identify the depth of a 
congestion over a network from the beginning. When considering the alpha 
reliability, I think the initial alpha should be selective for 
applications by users. When a user chooses DCTCP for latency-sensitive 
applications, the initial alpha is preferred. Otherwise, DCTCP senders 
had better to set the initial alpha value to zero.
   The default alpha value is set to zero in our implementation.

   (3) alpha value initialization after an idle period
    The original DCTCP paper does not define how the sender behaves 
after idle time. A DCTCP sender resets alpha to the initial value when 
an idle time happens.

   The following changes is applied to eliminate a compatibility issue 
to standard ECN defined in RFC3465. Currently, DCTCP and standard ECN 
servers have no way to identify which mechanism is working on the peer. 
Thus, we eliminate the worst situation in a network mixing DCTCP 
senders/receivers and standard ECN senders/receivers.
   (4) Emitting CWRs at one-sided senders
   This change is applied for a situation when a sender uses DCTCP and a 
reciever uses standard ECN.
   Under the situation, we find that a DCTCP sender minimizes CWND. 
Fortunately, the current tcp_input()  function complement this change, 
thus, there is no modification in our patch.

   (5) delayed ACK at one-sided receivers
   This change is applied for a situation when a sender uses standard 
ECN and a reciever uses DCTCP. Under the situation, we find that a 
standard ECN sender increases smaller CWND than expected when the 
one-sided DCTCP receiver unsets delayed ACK against a packet with CWR 
flag. Thus, we always apply delayed ACK only when CWR flag is set in the 
arriving packet.

If you want to understand the detailed background of these 
modifications, see my thesis [2], especially in section 3 and 4.
I'm looking forward to hear from you!

Regards,
-- Midori

[1] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp
[2]https://eggert.org/students/kato-thesis.pdf
-------------- next part --------------

diff --git a/sys/modules/cc/Makefile b/sys/modules/cc/Makefile
index 7b851f5..7f4e94e 100644
--- a/sys/modules/cc/Makefile
+++ b/sys/modules/cc/Makefile
@@ -3,6 +3,7 @@
 SUBDIR=	cc_cdg \
 	cc_chd \
 	cc_cubic \
+	cc_dctcp \
 	cc_hd \
 	cc_htcp \
 	cc_vegas
diff --git a/sys/modules/cc/cc_dctcp/Makefile b/sys/modules/cc/cc_dctcp/Makefile
new file mode 100644
index 0000000..32919cd
--- /dev/null
+++ b/sys/modules/cc/cc_dctcp/Makefile
@@ -0,0 +1,9 @@
+# $FreeBSD$
+
+.include <bsd.own.mk>
+
+.PATH: ${.CURDIR}/../../../netinet/cc
+KMOD=	cc_dctcp
+SRCS=	cc_dctcp.c
+
+.include <bsd.kmod.mk>
diff --git a/sys/netinet/cc.h b/sys/netinet/cc.h
index 14b4a9d..381f94e 100644
--- a/sys/netinet/cc.h
+++ b/sys/netinet/cc.h
@@ -143,6 +143,13 @@ struct cc_algo {
 	/* Called when data transfer resumes after an idle period. */
 	void	(*after_idle)(struct cc_var *ccv);
 
+	/* Called for an additional ECN processing apart from RFC3168. */
+	int	(*ecnpkt_handler)(struct cc_var *ccv, uint8_t iptos, int cwr,
+		    int is_delayack);
+
+	/* Called when the host marks ECN capable transmission (ECT). */
+	int	(*ect_handler)(struct cc_var *ccv);
+
 	STAILQ_ENTRY (cc_algo) entries;
 };
 
diff --git a/sys/netinet/cc/cc_dctcp.c b/sys/netinet/cc/cc_dctcp.c
new file mode 100644
index 0000000..d8cd166
--- /dev/null
+++ b/sys/netinet/cc/cc_dctcp.c
@@ -0,0 +1,442 @@
+/*-
+ * Copyright (c) 2007-2008
+ * 	Swinburne University of Technology, Melbourne, Australia
+ * Copyright (c) 2009-2010 Lawrence Stewart <lstewart at freebsd.org>
+ * Copyright (c) 2014 Midori Kato <katoon at sfc.wide.ad.jp>
+ * Copyright (c) 2014 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * An implementation of the DCTCP algorithm for FreeBSD, based on
+ * "Data Center TCP (DCTCP)" by M. Alizadeh, A. Greenberg, D. A. Maltz,
+ * J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan.,
+ * in ACM Conference on SIGCOMM 2010, New York, USA,
+ * Originally released as the contribution of Microsoft Research project.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/malloc.h>
+#include <sys/module.h>
+#include <sys/socket.h>
+#include <sys/socketvar.h>
+#include <sys/sysctl.h>
+#include <sys/systm.h>
+
+#include <net/vnet.h>
+
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/cc.h>
+#include <netinet/tcp_seq.h>
+#include <netinet/tcp_var.h>
+
+#include <netinet/cc/cc_module.h>
+
+#define	CAST_PTR_INT(X)	(*((int*)(X)))
+
+static VNET_DEFINE(uint32_t, dctcp_shift_g) = 4;
+static VNET_DEFINE(uint32_t, dctcp_slowstart) = 0;
+#define V_dctcp_shift_g		VNET(dctcp_shift_g)
+#define	V_dctcp_slowstart	VNET(dctcp_slowstart)
+
+struct dctcp {
+	/* # of marked bytes during a RTT */
+	int     bytes_ecn;
+	/* # of acked bytes during a RTT */
+	int     bytes_total;
+	/* the fraction of marked bytes */
+	int     alpha;
+	/* CE state of the last segment */
+	int     ce_prev;
+	/* end sequence number of the current window */
+	int     save_sndnxt;
+	/* ECE flag in this segment */
+	int	is_ece;
+	/* ECE flag in the last segment */
+	int	ece_prev;
+	/* # of congestion events */
+	uint32_t	num_cong_events;
+};
+
+static MALLOC_DEFINE(M_dctcp, "dctcp data",
+    "Per connection data required for the dctcp algorithm");
+
+static void	dctcp_ack_received(struct cc_var *ccv, uint16_t type);
+static void	dctcp_after_idle(struct cc_var *ccv);
+static void	dctcp_cb_destroy(struct cc_var *ccv);
+static int	dctcp_cb_init(struct cc_var *ccv);
+static void	dctcp_cong_signal(struct cc_var *ccv, uint32_t type);
+static void	dctcp_conn_init(struct cc_var *ccv);
+static void	dctcp_post_recovery(struct cc_var *ccv);
+static int	dctcp_ecnpkt_handler(struct cc_var *ccv, uint8_t iptos, int cwr,
+		    int is_delayack);
+static int	dctcp_ecthandler(struct cc_var *ccv);
+static void	dctcp_update_alpha(struct cc_var *ccv);
+
+struct cc_algo dctcp_cc_algo = {
+	.name = "dctcp",
+	.ack_received = dctcp_ack_received,
+	.cb_destroy = dctcp_cb_destroy,
+	.cb_init = dctcp_cb_init,
+	.cong_signal = dctcp_cong_signal,
+	.conn_init = dctcp_conn_init,
+	.post_recovery = dctcp_post_recovery,
+	.ecnpkt_handler = dctcp_ecnpkt_handler,
+	.after_idle = dctcp_after_idle,
+	.ect_handler = dctcp_ecthandler,
+};
+
+static void
+dctcp_ack_received(struct cc_var *ccv, uint16_t type)
+{
+	struct dctcp *dctcp_data;
+	int bytes_acked = 0;
+
+	dctcp_data = ccv->cc_data;
+
+	/*
+	 * DCTCP doesn't regard with ECN as a congestion.
+	 * Thus, DCTCP always executes the ACK processing out
+	 * of congestion recovery.
+	 */
+	if (IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+		EXIT_CONGRECOVERY(CCV(ccv, t_flags));
+		newreno_cc_algo.ack_received(ccv, type);
+		ENTER_CONGRECOVERY(CCV(ccv, t_flags));
+	} else
+		newreno_cc_algo.ack_received(ccv, type);
+
+	/* Updates the fraction of marked bytes. */
+	if (CCV(ccv, t_flags) & TF_ECN_PERMIT) {
+
+		if (type == CC_DUPACK)
+			bytes_acked = CCV(ccv, t_maxseg);
+
+		if (type == CC_ACK)
+			bytes_acked = ccv->bytes_this_ack;
+
+		/* Update total bytes. */
+		dctcp_data->bytes_total += bytes_acked;
+
+		/* Update total marked bytes. */
+		if (dctcp_data->is_ece) {
+			if (!dctcp_data->ece_prev
+			    && bytes_acked > CCV(ccv, t_maxseg)) {
+				dctcp_data->bytes_ecn +=
+				    (bytes_acked - CCV(ccv, t_maxseg));
+			} else
+				dctcp_data->bytes_ecn += bytes_acked;
+			dctcp_data->ece_prev = 1;
+		} else {
+			if (dctcp_data->ece_prev
+			    && bytes_acked > CCV(ccv, t_maxseg))
+				dctcp_data->bytes_ecn += CCV(ccv, t_maxseg);
+			dctcp_data->ece_prev = 0;
+		}
+		dctcp_data->is_ece = 0;
+
+		/*
+		 * Update the fraction of marked bytes at the end of
+		 * current window size.
+		 */
+		if ((IN_FASTRECOVERY(CCV(ccv, t_flags)) &&
+		    SEQ_GEQ(ccv->curack, CCV(ccv, snd_recover))) ||
+		    (!IN_FASTRECOVERY(CCV(ccv, t_flags)) &&
+		    SEQ_GT(ccv->curack, dctcp_data->save_sndnxt)))
+			dctcp_update_alpha(ccv);
+	}
+}
+
+static void
+dctcp_after_idle(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+
+	dctcp_data = ccv->cc_data;
+
+	/* Initialize internal parameters after idle time */
+	dctcp_data->bytes_ecn = 0;
+	dctcp_data->bytes_total = 0;
+	dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+	dctcp_data->alpha = 0;
+	dctcp_data->is_ece = 0;
+	dctcp_data->ece_prev = 0;
+	dctcp_data->num_cong_events = 0;
+
+	dctcp_cc_algo.after_idle = newreno_cc_algo.after_idle;
+}
+
+static void
+dctcp_cb_destroy(struct cc_var *ccv)
+{
+	if (ccv->cc_data != NULL)
+		free(ccv->cc_data, M_dctcp);
+}
+
+static int
+dctcp_cb_init(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+
+	dctcp_data = malloc(sizeof(struct dctcp), M_dctcp, M_NOWAIT|M_ZERO);
+
+	if (dctcp_data == NULL)
+		return (ENOMEM);
+
+	/* Initialize some key variables with sensible defaults. */
+	dctcp_data->bytes_ecn = 0;
+	dctcp_data->bytes_total = 0;
+	dctcp_data->alpha = 0;
+	dctcp_data->save_sndnxt = 0;
+	dctcp_data->ce_prev = 0;
+	dctcp_data->is_ece = 0;
+	dctcp_data->ece_prev = 0;
+	dctcp_data->num_cong_events = 0;
+
+	ccv->cc_data = dctcp_data;
+	return (0);
+}
+
+/*
+ * Perform any necessary tasks before we enter congestion recovery.
+ */
+static void
+dctcp_cong_signal(struct cc_var *ccv, uint32_t type)
+{
+	struct dctcp *dctcp_data;
+	u_int win, mss;
+
+	dctcp_data = ccv->cc_data;
+	win = CCV(ccv, snd_cwnd);
+	mss = CCV(ccv, t_maxseg);
+
+	switch (type) {
+	case CC_NDUPACK:
+		if (!IN_FASTRECOVERY(CCV(ccv, t_flags))) {
+			if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+				CCV(ccv, snd_ssthresh) = mss *
+				    max(win / 2 / mss, 2);
+				dctcp_data->num_cong_events++;
+			} else {
+				/* cwnd has already updated as congestion
+				 * recovery. Reverse cwnd value using
+				 * snd_cwnd_prev and recalculate snd_ssthresh
+				 */
+				win = CCV(ccv, snd_cwnd_prev);
+				CCV(ccv, snd_ssthresh) =
+				    max(win / 2 / mss, 2) * mss;
+			}
+			ENTER_RECOVERY(CCV(ccv, t_flags));
+		}
+		break;
+	case CC_ECN:
+		/*
+		 * Save current snd_cwnd when the host encounters both
+		 * congestion recovery and fast recovery.
+		 */
+		CCV(ccv, snd_cwnd_prev) = win;
+		if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) {
+			if (V_dctcp_slowstart &&
+			    dctcp_data->num_cong_events++ == 0) {
+				CCV(ccv, snd_ssthresh) =
+				    mss * max(win / 2 / mss, 2);
+				dctcp_data->alpha = 1024;
+				dctcp_data->bytes_ecn = 0;
+				dctcp_data->bytes_total = 0;
+				dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+			} else
+				CCV(ccv, snd_ssthresh) = max((win - ((win *
+				    dctcp_data->alpha) >> 11)) / mss, 2) * mss;
+			CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh);
+			ENTER_CONGRECOVERY(CCV(ccv, t_flags));
+		}
+		dctcp_data->is_ece = 1;
+		break;
+	case CC_RTO:
+		if (CCV(ccv, t_flags) & TF_ECN_PERMIT) {
+			CCV(ccv, t_flags) |= TF_ECN_SND_CWR;
+			dctcp_update_alpha(ccv);
+			dctcp_data->save_sndnxt += CCV(ccv, t_maxseg);
+			dctcp_data->num_cong_events++;
+		}
+		break;
+	}
+}
+
+static void
+dctcp_conn_init(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+
+	dctcp_data = ccv->cc_data;
+
+	if (CCV(ccv, t_flags) & TF_ECN_PERMIT)
+		dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+}
+
+/*
+ * Perform any necessary tasks before we exit congestion recovery.
+ */
+static void
+dctcp_post_recovery(struct cc_var *ccv)
+{
+	dctcp_cc_algo.post_recovery = newreno_cc_algo.post_recovery;
+
+	if (CCV(ccv, t_flags) & TF_ECN_PERMIT)
+		dctcp_update_alpha(ccv);
+}
+
+static int
+dctcp_ecnpkt_handler(struct cc_var *ccv, uint8_t iptos, int cwr, int is_delayack)
+{
+	struct dctcp *dctcp_data;
+	int ret = 0;
+
+	dctcp_data = ccv->cc_data;
+	/*
+	 * DCTCP responses an ACK immediately
+	 * - when the CE state in between this segment
+	 *   and the last segment is not same
+	 * - when this segment sets the CWR flag
+	 */
+	switch (iptos & IPTOS_ECN_MASK) {
+	case IPTOS_ECN_CE:
+		if (!dctcp_data->ce_prev && is_delayack)
+			ret = 1;
+		dctcp_data->ce_prev = 1;
+		CCV(ccv, t_flags) |= TF_ECN_SND_ECE;
+		break;
+	case IPTOS_ECN_ECT0:
+		if (dctcp_data->ce_prev && is_delayack)
+			ret = 1;
+		CCV(ccv, t_flags) &= ~TF_ECN_SND_ECE;
+		dctcp_data->ce_prev = 0;
+		break;
+	case IPTOS_ECN_ECT1:
+		if (dctcp_data->ce_prev && is_delayack)
+			ret = 1;
+		CCV(ccv, t_flags) &= ~TF_ECN_SND_ECE;
+		dctcp_data->ce_prev = 0;
+		break;
+	}
+	if (cwr && is_delayack)
+		ret = 0;
+
+	return (ret);
+}
+
+static int
+dctcp_ecthandler(struct cc_var *ccv)
+{
+	/* DCTCP always marks ECT */
+	return (1);
+}
+
+/*
+ * Update the fraction of marked bytes named alpha. Then, initialize
+ * several internal parameters at the end of this function.
+ */
+static void
+dctcp_update_alpha(struct cc_var *ccv)
+{
+	struct dctcp *dctcp_data;
+	int alpha_prev;
+
+	dctcp_data = ccv->cc_data;
+
+	alpha_prev = dctcp_data->alpha;
+
+	dctcp_data->bytes_total = max(dctcp_data->bytes_total, 1);
+
+	/*
+	 * Update alpha: alpha = (1 - g) * alpha + g * F.
+	 * Alpha must be round to 0 - 1024.
+	 * XXXMIDORI Is more fine-grained alpha necessary?
+	 */
+	dctcp_data->alpha = min(alpha_prev - (alpha_prev >> V_dctcp_shift_g) +
+	    (dctcp_data->bytes_ecn << (10 - V_dctcp_shift_g)) /
+	    dctcp_data->bytes_total, 1024);
+
+	/* Initialize internal parameters for next alpha calculation */
+	dctcp_data->bytes_ecn = 0;
+	dctcp_data->bytes_total = 0;
+	dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+}
+
+static int
+dctcp_shift_g_handler(SYSCTL_HANDLER_ARGS)
+{
+	int error;
+	uint32_t new;
+
+	new = V_dctcp_shift_g ;
+	error = sysctl_handle_int(oidp, &new, 0, req);
+	if (error == 0 && req->newptr != NULL) {
+		if (CAST_PTR_INT(req->newptr) > 1)
+			error = EINVAL;
+		else
+			V_dctcp_shift_g = new;
+	}
+
+	return (error);
+}
+
+static int
+dctcp_slowstart_handler(SYSCTL_HANDLER_ARGS)
+{
+	int error;
+	uint32_t new;
+
+	new = V_dctcp_slowstart;
+	error = sysctl_handle_int(oidp, &new, 0, req);
+	if (error == 0 && req->newptr != NULL) {
+		if (CAST_PTR_INT(req->newptr) > 1)
+			error = EINVAL;
+		else
+			V_dctcp_slowstart = new;
+	}
+
+	return (error);
+}
+
+SYSCTL_DECL(_net_inet_tcp_cc_dctcp);
+SYSCTL_NODE(_net_inet_tcp_cc, OID_AUTO, dctcp, CTLFLAG_RW, NULL,
+    "dctcp congestion control related settings");
+
+SYSCTL_VNET_PROC(_net_inet_tcp_cc_dctcp, OID_AUTO, shift_g,
+    CTLTYPE_UINT|CTLFLAG_RW, &VNET_NAME(dctcp_shift_g), 4,
+    &dctcp_shift_g_handler,
+    "IU", "dctcp shift parameter");
+
+SYSCTL_VNET_PROC(_net_inet_tcp_cc_dctcp, OID_AUTO, slowstart,
+    CTLTYPE_UINT|CTLFLAG_RW, &VNET_NAME(dctcp_slowstart), 0,
+    &dctcp_slowstart_handler,
+    "IU", "half CWND reduction after the first slow start");
+
+DECLARE_CC_MODULE(dctcp, &dctcp_cc_algo);
diff --git a/sys/netinet/tcp_input.c b/sys/netinet/tcp_input.c
index 20c22ed..2822248 100644
--- a/sys/netinet/tcp_input.c
+++ b/sys/netinet/tcp_input.c
@@ -455,6 +455,32 @@ cc_post_recovery(struct tcpcb *tp, struct tcphdr *th)
 	tp->t_bytes_acked = 0;
 }
 
+/*
+ * Indicate whether this ack should be delayed.  We can delay the ack if
+ *	- there is no delayed ack timer in progress and
+ *	- our last ack wasn't a 0-sized window.  We never want to delay
+ *	  the ack that opens up a 0-sized window and
+ *		- delayed acks are enabled or
+ *		- this is a half-synchronized T/TCP connection.
+ */
+#define DELAY_ACK(tp)							\
+	((!tcp_timer_active(tp, TT_DELACK) &&				\
+	    (tp->t_flags & TF_RXWIN0SENT) == 0) &&			\
+	    (V_tcp_delack_enabled || (tp->t_flags & TF_NEEDSYN)))
+
+static void inline
+cc_ecnpkt_handler(struct tcpcb *tp, struct tcphdr *th, uint8_t iptos)
+{
+	INP_WLOCK_ASSERT(tp->t_inpcb);
+
+	if (CC_ALGO(tp)->ecnpkt_handler != NULL) {
+		if (CC_ALGO(tp)->ecnpkt_handler(tp->ccv, iptos,
+		    (th->th_flags & TH_CWR), DELAY_ACK(tp))) {
+			tcp_timer_activate(tp, TT_DELACK, tcp_delacktime);
+		}
+	}
+}
+
 static inline void
 tcp_fields_to_host(struct tcphdr *th)
 {
@@ -502,19 +528,6 @@ do { \
 #endif
 
 /*
- * Indicate whether this ack should be delayed.  We can delay the ack if
- *	- there is no delayed ack timer in progress and
- *	- our last ack wasn't a 0-sized window.  We never want to delay
- *	  the ack that opens up a 0-sized window and
- *		- delayed acks are enabled or
- *		- this is a half-synchronized T/TCP connection.
- */
-#define DELAY_ACK(tp)							\
-	((!tcp_timer_active(tp, TT_DELACK) &&				\
-	    (tp->t_flags & TF_RXWIN0SENT) == 0) &&			\
-	    (V_tcp_delack_enabled || (tp->t_flags & TF_NEEDSYN)))
-
-/*
  * TCP input handling is split into multiple parts:
  *   tcp6_input is a thin wrapper around tcp_input for the extended
  *	ip6_protox[] call format in ip6_input
@@ -1539,6 +1552,10 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th, struct socket *so,
 			TCPSTAT_INC(tcps_ecn_ect1);
 			break;
 		}
+
+		/* Process a packet differently from RFC3168. */
+		cc_ecnpkt_handler(tp, th, iptos);
+
 		/* Congestion experienced. */
 		if (thflags & TH_ECE) {
 			cc_cong_signal(tp, th, CC_ECN);
diff --git a/sys/netinet/tcp_output.c b/sys/netinet/tcp_output.c
index 00d5415..30e9b19 100644
--- a/sys/netinet/tcp_output.c
+++ b/sys/netinet/tcp_output.c
@@ -162,6 +162,18 @@ cc_after_idle(struct tcpcb *tp)
 		CC_ALGO(tp)->after_idle(tp->ccv);
 }
 
+static int inline
+cc_ect_handler(struct tcpcb *tp)
+{
+	INP_WLOCK_ASSERT(tp->t_inpcb);
+
+	if (CC_ALGO(tp)->ect_handler != NULL) {
+		if (CC_ALGO(tp)->ect_handler(tp->ccv))
+			return (1);
+	}
+	return (0);
+}
+
 /*
  * Tcp output routine: figure out what should be sent and send it.
  */
@@ -966,9 +978,15 @@ send:
 		 * If the peer has ECN, mark data packets with
 		 * ECN capable transmission (ECT).
 		 * Ignore pure ack packets, retransmissions and window probes.
+		 * Mark data packet with ECN capable transmission (ECT)
+		 * when CC_ALGO meets specific condition.
+		 * Or, if the peer has ECN, mark data packets with ECT
+		 * (RFC 3168). Ignore pure ack packets, retransmissions
+		 * and window probes.
 		 */
-		if (len > 0 && SEQ_GEQ(tp->snd_nxt, tp->snd_max) &&
-		    !((tp->t_flags & TF_FORCEDATA) && len == 1)) {
+		int mark_ect = cc_ect_handler(tp);
+		if (mark_ect || (len > 0 && SEQ_GEQ(tp->snd_nxt, tp->snd_max)
+		    && !((tp->t_flags & TF_FORCEDATA) && len == 1))) {
 #ifdef INET6
 			if (isipv6)
 				ip6->ip6_flow |= htonl(IPTOS_ECN_ECT0 << 20);