From nobody Tue Mar 12 11:56:47 2024 X-Original-To: dev-commits-src-main@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TvBvN1QqJz5Dhx5; Tue, 12 Mar 2024 11:56:48 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TvBvN1BPjz4Tym; Tue, 12 Mar 2024 11:56:48 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1710244608; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=B9UeDM6EHSSzWQmAmUD6MobtfJZDo1VJXmsAy4MQjHA=; b=bOdSWzqm2vd1puu3kGTLn2/NAx+Znyk6kMHyLD2Q6Gs+TjAQt4041iYqwfvOTNhtwT2vhj xA3kDCeQd9FXpXWD2tcH7g9PtvOuBviDOYvKYhErDI7osrF52xdiShgSBK3Mx/ha54jKi4 Uxxuht4V04ROy4rKRHllNhpL9pg8QqvUgjyGvz1CK3nrYdhndI/jXLYNnSnsLrUyf6K8W/ IJb2U4sE99YW4mAK8K8EURJ5Gj0XeN7r3wgzHTVoPLjm9TKTh+6Ff9THQTKto5qowBMpOP 7HDc0A/Jfc0x7RJA28UJfMYSrTushM8yqOsTtBv1+bEp5Lp2P0v+EMzHsiZ6vg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1710244608; a=rsa-sha256; cv=none; b=jo2WHrluLDPQXI+akOcagbhEeawP27SRiJK8Z3SidlhUkUtFgqbJnCqGV7EVUbSPF8snyG H83fTMFvRtosTZ2haAiaQMt/vZ7T5EWZGaQQY+WFQv6HARmZPS7tT84sfPYgvCDP8TJEvW 05IcTr9h/xy110JK8ZJQ1b46M9nkuHvru6CRWGybKs9kkL1NfNbcnjbLPrV5qp7TJ4N9Pe IwGBANVQ9MC+ZbpRp+K946xYbBid4YIoxI/6GmOEgAKdka8I20fDWID6gycUnH1Gz1Rr4F He9TIfPN0CcX/cfzF4ka5RJO75gvDfrNyG2lQlIrierOkI1aloE+e0/V/ZZnKA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1710244608; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=B9UeDM6EHSSzWQmAmUD6MobtfJZDo1VJXmsAy4MQjHA=; b=i+p7zNYOrhxU8UQ8g5O9jo7jp4ScankctGtu3AYMA2lkXuCZZiwddEU8RO+7E0BdSV++zg ti7lDzhE9Ws9IIwzuydN7gnKMq6NPiCNVV2AIl1h+LRPsSdlq8gVptZmHlM0rEOcCxJ0PL fG6yZabzRG6fBF81HG6UcjxmUE8HIUx/IwoK8hGTZERa6uyhwfcT45rMCVOBlv84gdkN+g Cmvdt2GvYUJAiqkqags4QGL/rvkt2KL9JwdE2tFlGHedCDs8hJxCDRO/LUoGbpxcP5TTqk C6Ds5HViP5lJuCt3IdvB94iiYxU4p/K5RKmV7aRfaioOlBeMv+t3tr93jzk8SA== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4TvBvN0nL5z1C6x; Tue, 12 Mar 2024 11:56:48 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 42CBuma2095821; Tue, 12 Mar 2024 11:56:48 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 42CBulX6095818; Tue, 12 Mar 2024 11:56:47 GMT (envelope-from git) Date: Tue, 12 Mar 2024 11:56:47 GMT Message-Id: <202403121156.42CBulX6095818@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Randall Stewart Subject: git: e18b97bd63a8 - main - Update to bring the rack stack with all its fixes in. List-Id: Commit messages for the main branch of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-main List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-main@freebsd.org X-BeenThere: dev-commits-src-main@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: rrs X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: e18b97bd63a8112625f7014d2326ecf533b710dd Auto-Submitted: auto-generated The branch main has been updated by rrs: URL: https://cgit.FreeBSD.org/src/commit/?id=e18b97bd63a8112625f7014d2326ecf533b710dd commit e18b97bd63a8112625f7014d2326ecf533b710dd Author: Randall Stewart AuthorDate: 2024-03-12 11:55:02 +0000 Commit: Randall Stewart CommitDate: 2024-03-12 11:55:02 +0000 Update to bring the rack stack with all its fixes in. This brings the rack stack up to the current level used at NF. Many fixes and improvements have been added. I also add in a fix to BBR to deal with the changes that have been in hpts for a while i.e. only one call no matter if mbuf queue or tcp_output. It basically does little except BBlogs and is a placemark for future work on doing path capacity measurements. With a bit of a struggle with git I finally got rack_pcm.c into place (apologies for not noticing this error). The LINT kernel is running on my box now .. sigh. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision:https://reviews.freebsd.org/D43986 --- sys/conf/files | 1 + sys/modules/tcp/rack/Makefile | 2 +- sys/netinet/tcp.h | 38 +- sys/netinet/tcp_log_buf.h | 9 +- sys/netinet/tcp_stacks/bbr.c | 4 +- sys/netinet/tcp_stacks/rack.c | 4443 +++++++++++++++++++++++++--------- sys/netinet/tcp_stacks/rack_pcm.c | 332 +++ sys/netinet/tcp_stacks/sack_filter.h | 5 + sys/netinet/tcp_stacks/tailq_hash.c | 33 +- sys/netinet/tcp_stacks/tailq_hash.h | 8 +- sys/netinet/tcp_stacks/tcp_rack.h | 135 +- sys/netinet/tcp_subr.c | 57 +- sys/netinet/tcp_syncache.c | 5 +- sys/netinet/tcp_usrreq.c | 7 + sys/netinet/tcp_var.h | 12 +- 15 files changed, 3920 insertions(+), 1171 deletions(-) diff --git a/sys/conf/files b/sys/conf/files index e57c82238380..c902bcfdbd52 100644 --- a/sys/conf/files +++ b/sys/conf/files @@ -4374,6 +4374,7 @@ netinet/tcp_stacks/rack.c optional inet tcphpts tcp_rack | inet6 tcphpts tcp_rac netinet/tcp_stacks/rack_bbr_common.c optional inet tcphpts tcp_bbr | inet tcphpts tcp_rack | inet6 tcphpts tcp_bbr | inet6 tcphpts tcp_rack netinet/tcp_stacks/sack_filter.c optional inet tcphpts tcp_bbr | inet tcphpts tcp_rack | inet6 tcphpts tcp_bbr | inet6 tcphpts tcp_rack netinet/tcp_stacks/tailq_hash.c optional inet tcphpts tcp_bbr | inet tcphpts tcp_rack | inet6 tcphpts tcp_bbr | inet6 tcphpts tcp_rack +netinet/tcp_stacks/rack_pcm.c optional inet tcphpts tcp_rack | inet6 tcphpts tcp_rack netinet/tcp_stats.c optional stats inet | stats inet6 netinet/tcp_subr.c optional inet | inet6 netinet/tcp_syncache.c optional inet | inet6 diff --git a/sys/modules/tcp/rack/Makefile b/sys/modules/tcp/rack/Makefile index c5bb20602337..d5f3ba170f68 100644 --- a/sys/modules/tcp/rack/Makefile +++ b/sys/modules/tcp/rack/Makefile @@ -5,7 +5,7 @@ STACKNAME= rack KMOD= tcp_${STACKNAME} -SRCS= rack.c sack_filter.c rack_bbr_common.c tailq_hash.c +SRCS= rack.c sack_filter.c rack_bbr_common.c tailq_hash.c rack_pcm.c SRCS+= opt_inet.h opt_inet6.h opt_ipsec.h SRCS+= opt_kern_tls.h diff --git a/sys/netinet/tcp.h b/sys/netinet/tcp.h index f9e561f6ce35..a8259fa30a3a 100644 --- a/sys/netinet/tcp.h +++ b/sys/netinet/tcp.h @@ -334,9 +334,22 @@ __tcp_set_flags(struct tcphdr *th, uint16_t flags) #define TCP_RACK_PACING_DIVISOR 1146 /* Pacing divisor given to rate-limit code for burst sizing */ #define TCP_RACK_PACE_MIN_SEG 1147 /* Pacing min seg size rack will use */ #define TCP_RACK_DGP_IN_REC 1148 /* Do we use full DGP in recovery? */ -#define TCP_RXT_CLAMP 1149 /* Do we apply a threshold to rack so if excess rxt clamp cwnd? */ +#define TCP_POLICER_DETECT 1149 /* Do we apply a thresholds to rack to detect and compensate for policers? */ +#define TCP_RXT_CLAMP TCP_POLICER_DETECT #define TCP_HYBRID_PACING 1150 /* Hybrid pacing enablement */ #define TCP_PACING_DND 1151 /* When pacing with rr_config=3 can sacks disturb us */ +#define TCP_SS_EEXIT 1152 /* Do we do early exit from slowtart if no b/w growth */ +#define TCP_DGP_UPPER_BOUNDS 1153 /* SS and CA upper bound in percentage */ +#define TCP_NO_TIMELY 1154 /* Disable/enable Timely */ +#define TCP_HONOR_HPTS_MIN 1155 /* Do we honor hpts min to */ +#define TCP_REC_IS_DYN 1156 /* Do we allow timely to change recovery multiplier? */ +#define TCP_SIDECHAN_DIS 1157 /* Disable/enable the side-channel */ +#define TCP_FILLCW_RATE_CAP 1158 /* Set a cap for DGP's fillcw */ +#define TCP_POLICER_MSS 1159 /* Policer MSS requirement */ +#define TCP_STACK_SPEC_INFO 1160 /* Get stack specific information (if present) */ +#define RACK_CSPR_IS_FCC 1161 +#define TCP_GP_USE_LTBW 1162 /* how we use lt_bw 0=not, 1=min, 2=max */ + /* Start of reserved space for third-party user-settable options. */ #define TCP_VENDOR SO_VENDOR @@ -447,6 +460,7 @@ struct tcp_info { u_int32_t tcpi_rcv_adv; /* Peer advertised window */ u_int32_t tcpi_dupacks; /* Consecutive dup ACKs recvd */ + u_int32_t tcpi_rttmin; /* Min observed RTT */ /* Padding to grow without breaking ABI. */ u_int32_t __tcpi_pad[14]; /* Padding. */ }; @@ -463,6 +477,20 @@ struct tcp_fastopen { #define TCP_FUNCTION_NAME_LEN_MAX 32 +struct stack_specific_info { + char stack_name[TCP_FUNCTION_NAME_LEN_MAX]; + uint64_t policer_last_bw; /* Only valid if detection enabled and policer detected */ + uint64_t bytes_transmitted; + uint64_t bytes_retransmitted; + uint32_t policer_detection_enabled: 1, + policer_detected : 1, /* transport thinks a policer is on path */ + highly_buffered : 1, /* transport considers the path highly buffered */ + spare : 29; + uint32_t policer_bucket_size; /* Only valid if detection enabled and policer detected */ + uint32_t current_round; + uint32_t _rack_i_pad[18]; +}; + struct tcp_function_set { char function_set_name[TCP_FUNCTION_NAME_LEN_MAX]; uint32_t pcbcnt; @@ -488,6 +516,7 @@ struct tcp_snd_req { uint64_t start; uint64_t end; uint32_t flags; + uint32_t playout_ms; }; union tcp_log_userdata { @@ -518,9 +547,12 @@ struct tcp_log_user { #define TCP_HYBRID_PACING_H_MS 0x0008 /* A client hint for maxseg is present */ #define TCP_HYBRID_PACING_ENABLE 0x0010 /* We are enabling hybrid pacing else disable */ #define TCP_HYBRID_PACING_S_MSS 0x0020 /* Clent wants us to set the mss overriding gp est in CU */ -#define TCP_HYBRID_PACING_SETMSS 0x1000 /* Internal flag that tellsus we set the mss on this entry */ +#define TCP_HAS_PLAYOUT_MS 0x0040 /* The client included the chunk playout milliseconds: deprecate */ +/* the below are internal only flags */ +#define TCP_HYBRID_PACING_USER_MASK 0x0FFF /* Non-internal flags mask */ +#define TCP_HYBRID_PACING_SETMSS 0x1000 /* Internal flag that tells us we set the mss on this entry */ #define TCP_HYBRID_PACING_WASSET 0x2000 /* We init to this to know if a hybrid command was issued */ - +#define TCP_HYBRID_PACING_SENDTIME 0x4000 /* Duplicate tm to last, use sendtime for catch up mode */ struct tcp_hybrid_req { struct tcp_snd_req req; diff --git a/sys/netinet/tcp_log_buf.h b/sys/netinet/tcp_log_buf.h index 1f5b7cf9b54f..2e91d9cbdf3c 100644 --- a/sys/netinet/tcp_log_buf.h +++ b/sys/netinet/tcp_log_buf.h @@ -267,7 +267,9 @@ enum tcp_log_events { TCP_RACK_TP_TRIGGERED, /* A rack tracepoint is triggered 68 */ TCP_HYBRID_PACING_LOG, /* Hybrid pacing log 69 */ TCP_LOG_PRU, /* TCP protocol user request 70 */ - TCP_LOG_END /* End (keep at end) 71 */ + TCP_POLICER_DET, /* TCP Policer detectionn 71 */ + TCP_PCM_MEASURE, /* TCP Path Capacity Measurement 72 */ + TCP_LOG_END /* End (keep at end) 72 */ }; enum tcp_log_states { @@ -371,10 +373,11 @@ struct tcp_log_dev_log_queue { #define TCP_TP_COLLAPSED_RXT 0x00000004 /* When we actually retransmit a collapsed window rsm */ #define TCP_TP_REQ_LOG_FAIL 0x00000005 /* We tried to allocate a Request log but had no space */ #define TCP_TP_RESET_RCV 0x00000006 /* Triggers when we receive a RST */ -#define TCP_TP_EXCESS_RXT 0x00000007 /* When we get excess RXT's clamping the cwnd */ +#define TCP_TP_POLICER_DET 0x00000007 /* When we detect a policer */ +#define TCP_TP_EXCESS_RXT TCP_TP_POLICER_DET /* alias */ #define TCP_TP_SAD_TRIGGERED 0x00000008 /* Sack Attack Detection triggers */ - #define TCP_TP_SAD_SUSPECT 0x0000000a /* A sack has supicious information in it */ +#define TCP_TP_PACED_BOTTOM 0x0000000b /* We have paced at the bottom */ #ifdef _KERNEL diff --git a/sys/netinet/tcp_stacks/bbr.c b/sys/netinet/tcp_stacks/bbr.c index 931beba7a262..934b35bd22d7 100644 --- a/sys/netinet/tcp_stacks/bbr.c +++ b/sys/netinet/tcp_stacks/bbr.c @@ -11529,7 +11529,9 @@ bbr_do_segment_nounlock(struct tcpcb *tp, struct mbuf *m, struct tcphdr *th, bbr_set_pktepoch(bbr, cts, __LINE__); bbr_check_bbr_for_state(bbr, cts, __LINE__, (bbr->r_ctl.rc_lost - lost)); if (nxt_pkt == 0) { - if (bbr->r_wanted_output != 0) { + if ((bbr->r_wanted_output != 0) || + (tp->t_flags & TF_ACKNOW)) { + bbr->rc_output_starts_timer = 0; did_out = 1; if (tcp_output(tp) < 0) diff --git a/sys/netinet/tcp_stacks/rack.c b/sys/netinet/tcp_stacks/rack.c index 49d946dbb63b..1fe07fa8d641 100644 --- a/sys/netinet/tcp_stacks/rack.c +++ b/sys/netinet/tcp_stacks/rack.c @@ -142,9 +142,12 @@ VNET_DECLARE(uint32_t, newreno_beta_ecn); #define V_newreno_beta VNET(newreno_beta) #define V_newreno_beta_ecn VNET(newreno_beta_ecn) +#define M_TCPFSB __CONCAT(M_TCPFSB, STACKNAME) +#define M_TCPDO __CONCAT(M_TCPDO, STACKNAME) -MALLOC_DEFINE(M_TCPFSB, "tcp_fsb", "TCP fast send block"); -MALLOC_DEFINE(M_TCPDO, "tcp_do", "TCP deferred options"); +MALLOC_DEFINE(M_TCPFSB, "tcp_fsb_" __XSTRING(STACKNAME), "TCP fast send block"); +MALLOC_DEFINE(M_TCPDO, "tcp_do_" __XSTRING(STACKNAME), "TCP deferred options"); +MALLOC_DEFINE(M_TCPPCM, "tcp_pcm_" __XSTRING(STACKNAME), "TCP PCM measurement information"); struct sysctl_ctx_list rack_sysctl_ctx; struct sysctl_oid *rack_sysctl_root; @@ -190,12 +193,24 @@ static int32_t rack_tlp_use_greater = 1; static int32_t rack_reorder_thresh = 2; static int32_t rack_reorder_fade = 60000000; /* 0 - never fade, def 60,000,000 * - 60 seconds */ -static uint32_t rack_clamp_ss_upper = 110; -static uint32_t rack_clamp_ca_upper = 105; -static uint32_t rack_rxt_min_rnds = 10; /* Min rounds if drastic rxt clamp is in place */ -static uint32_t rack_unclamp_round_thresh = 100; /* number of perfect rounds before we unclamp */ -static uint32_t rack_unclamp_rxt_thresh = 5; /* .5% and under */ -static uint64_t rack_rxt_clamp_thresh = 0; /* Do we do the rxt clamp thing */ +static uint16_t rack_policer_rxt_thresh= 0; /* 499 = 49.9%, 0 is off */ +static uint8_t rack_policer_avg_thresh = 0; /* 3.2 */ +static uint8_t rack_policer_med_thresh = 0; /* 1 - 16 */ +static uint16_t rack_policer_bucket_reserve = 20; /* How much % is reserved in the bucket */ +static uint64_t rack_pol_min_bw = 125000; /* 1mbps in Bytes per sec */ +static uint32_t rack_policer_data_thresh = 64000; /* 64,000 bytes must be sent before we engage */ +static uint32_t rack_policing_do_bw_comp = 1; +static uint32_t rack_pcm_every_n_rounds = 100; +static uint32_t rack_pcm_blast = 0; +static uint32_t rack_pcm_is_enabled = 1; +static uint8_t rack_req_del_mss = 18; /* How many segments need to be sent in a recovery episode to do policer_detection */ +static uint8_t rack_ssthresh_rest_rto_rec = 0; /* Do we restore ssthresh when we have rec -> rto -> rec */ + +static uint32_t rack_gp_gain_req = 1200; /* Amount percent wise required to gain to record a round has "gaining" */ +static uint32_t rack_rnd_cnt_req = 0x10005; /* Default number of rounds if we are below rack_gp_gain_req where we exit ss */ + + +static int32_t rack_rxt_scoreboard_clear_thresh = 2; static int32_t rack_dnd_default = 0; /* For rr_conf = 3, what is the default for dnd */ static int32_t rack_rxt_controls = 0; static int32_t rack_fill_cw_state = 0; @@ -217,9 +232,8 @@ static int32_t rack_do_hystart = 0; static int32_t rack_apply_rtt_with_reduced_conf = 0; static int32_t rack_hibeta_setting = 0; static int32_t rack_default_pacing_divisor = 250; -static int32_t rack_uses_full_dgp_in_rec = 1; static uint16_t rack_pacing_min_seg = 0; - +static int32_t rack_timely_off = 0; static uint32_t sad_seg_size_per = 800; /* 80.0 % */ static int32_t rack_pkt_delay = 1000; @@ -235,7 +249,7 @@ static int32_t rack_use_rsm_rfo = 1; static int32_t rack_max_abc_post_recovery = 2; static int32_t rack_client_low_buf = 0; static int32_t rack_dsack_std_based = 0x3; /* bit field bit 1 sets rc_rack_tmr_std_based and bit 2 sets rc_rack_use_dsack */ -static int32_t rack_bw_multipler = 2; /* Limit on fill cw's jump up to be this x gp_est */ +static int32_t rack_bw_multipler = 0; /* Limit on fill cw's jump up to be this x gp_est */ #ifdef TCP_ACCOUNTING static int32_t rack_tcp_accounting = 0; #endif @@ -247,8 +261,9 @@ static int32_t use_rack_rr = 1; static int32_t rack_non_rxt_use_cr = 0; /* does a non-rxt in recovery use the configured rate (ss/ca)? */ static int32_t rack_persist_min = 250000; /* 250usec */ static int32_t rack_persist_max = 2000000; /* 2 Second in usec's */ +static int32_t rack_honors_hpts_min_to = 1; /* Do we honor the hpts minimum time out for pacing timers */ +static uint32_t rack_max_reduce = 10; /* Percent we can reduce slot by */ static int32_t rack_sack_not_required = 1; /* set to one to allow non-sack to use rack */ -static int32_t rack_default_init_window = 0; /* Use system default */ static int32_t rack_limit_time_with_srtt = 0; static int32_t rack_autosndbuf_inc = 20; /* In percentage form */ static int32_t rack_enobuf_hw_boost_mult = 0; /* How many times the hw rate we boost slot using time_between */ @@ -282,7 +297,6 @@ static int32_t rack_rwnd_block_ends_measure = 0; static int32_t rack_def_profile = 0; static int32_t rack_lower_cwnd_at_tlp = 0; -static int32_t rack_limited_retran = 0; static int32_t rack_always_send_oldest = 0; static int32_t rack_tlp_threshold_use = TLP_USE_TWO_ONE; @@ -356,6 +370,7 @@ static int32_t rack_timely_no_stopping = 0; static int32_t rack_down_raise_thresh = 100; static int32_t rack_req_segs = 1; static uint64_t rack_bw_rate_cap = 0; +static uint64_t rack_fillcw_bw_cap = 3750000; /* Cap fillcw at 30Mbps */ /* Rack specific counters */ @@ -377,6 +392,7 @@ counter_u64_t rack_tlp_retran; counter_u64_t rack_tlp_retran_bytes; counter_u64_t rack_to_tot; counter_u64_t rack_hot_alloc; +counter_u64_t tcp_policer_detected; counter_u64_t rack_to_alloc; counter_u64_t rack_to_alloc_hard; counter_u64_t rack_to_alloc_emerg; @@ -440,7 +456,7 @@ rack_log_progress_event(struct tcp_rack *rack, struct tcpcb *tp, uint32_t tick, static int rack_process_ack(struct mbuf *m, struct tcphdr *th, struct socket *so, struct tcpcb *tp, struct tcpopt *to, - uint32_t tiwin, int32_t tlen, int32_t * ofia, int32_t thflags, int32_t * ret_val); + uint32_t tiwin, int32_t tlen, int32_t * ofia, int32_t thflags, int32_t * ret_val, int32_t orig_tlen); static int rack_process_data(struct mbuf *m, struct tcphdr *th, struct socket *so, struct tcpcb *tp, int32_t drop_hdrlen, int32_t tlen, @@ -454,6 +470,8 @@ static struct rack_sendmap *rack_alloc_limit(struct tcp_rack *rack, static struct rack_sendmap * rack_check_recovery_mode(struct tcpcb *tp, uint32_t tsused); +static uint32_t +rack_grab_rtt(struct tcpcb *tp, struct tcp_rack *rack); static void rack_cong_signal(struct tcpcb *tp, uint32_t type, uint32_t ack, int ); @@ -504,13 +522,14 @@ rack_log_ack(struct tcpcb *tp, struct tcpopt *to, static void rack_log_output(struct tcpcb *tp, struct tcpopt *to, int32_t len, uint32_t seq_out, uint16_t th_flags, int32_t err, uint64_t ts, - struct rack_sendmap *hintrsm, uint16_t add_flags, struct mbuf *s_mb, uint32_t s_moff, int hw_tls, int segsiz); + struct rack_sendmap *hintrsm, uint32_t add_flags, struct mbuf *s_mb, uint32_t s_moff, int hw_tls, int segsiz); static uint64_t rack_get_gp_est(struct tcp_rack *rack); + static void rack_log_sack_passed(struct tcpcb *tp, struct tcp_rack *rack, - struct rack_sendmap *rsm); + struct rack_sendmap *rsm, uint32_t cts); static void rack_log_to_event(struct tcp_rack *rack, int32_t to_num, struct rack_sendmap *rsm); static int32_t rack_output(struct tcpcb *tp); @@ -526,10 +545,10 @@ static int32_t rack_stopall(struct tcpcb *tp); static void rack_timer_cancel(struct tcpcb *tp, struct tcp_rack *rack, uint32_t cts, int line); static uint32_t rack_update_entry(struct tcpcb *tp, struct tcp_rack *rack, - struct rack_sendmap *rsm, uint64_t ts, int32_t * lenp, uint16_t add_flag, int segsiz); + struct rack_sendmap *rsm, uint64_t ts, int32_t * lenp, uint32_t add_flag, int segsiz); static void rack_update_rsm(struct tcpcb *tp, struct tcp_rack *rack, - struct rack_sendmap *rsm, uint64_t ts, uint16_t add_flag, int segsiz); + struct rack_sendmap *rsm, uint64_t ts, uint32_t add_flag, int segsiz); static int rack_update_rtt(struct tcpcb *tp, struct tcp_rack *rack, struct rack_sendmap *rsm, struct tcpopt *to, uint32_t cts, int32_t ack_type, tcp_seq th_ack); @@ -538,6 +557,10 @@ static int rack_do_close_wait(struct mbuf *m, struct tcphdr *th, struct socket *so, struct tcpcb *tp, struct tcpopt *to, int32_t drop_hdrlen, int32_t tlen, uint32_t tiwin, int32_t thflags, int32_t nxt_pkt, uint8_t iptos); + +static void +rack_peg_rxt(struct tcp_rack *rack, struct rack_sendmap *rsm, uint32_t segsiz); + static int rack_do_closing(struct mbuf *m, struct tcphdr *th, struct socket *so, struct tcpcb *tp, struct tcpopt *to, int32_t drop_hdrlen, @@ -720,6 +743,22 @@ rack_undo_cc_pacing(struct tcp_rack *rack) rack_swap_beta_values(rack, 4); } +static void +rack_remove_pacing(struct tcp_rack *rack) +{ + if (rack->rc_pacing_cc_set) + rack_undo_cc_pacing(rack); + if (rack->r_ctl.pacing_method & RACK_REG_PACING) + tcp_decrement_paced_conn(); + if (rack->r_ctl.pacing_method & RACK_DGP_PACING) + tcp_dec_dgp_pacing_cnt(); + rack->rc_always_pace = 0; + rack->r_ctl.pacing_method = RACK_PACING_NONE; + rack->dgp_on = 0; + rack->rc_hybrid_mode = 0; + rack->use_fixed_rate = 0; +} + static void rack_log_gpset(struct tcp_rack *rack, uint32_t seq_end, uint32_t ack_end_t, uint32_t send_end_t, int line, uint8_t mode, struct rack_sendmap *rsm) @@ -742,6 +781,8 @@ rack_log_gpset(struct tcp_rack *rack, uint32_t seq_end, uint32_t ack_end_t, log.u_bbr.pkts_out = line; log.u_bbr.cwnd_gain = rack->app_limited_needs_set; log.u_bbr.pkt_epoch = rack->r_ctl.rc_app_limited_cnt; + log.u_bbr.epoch = rack->r_ctl.current_round; + log.u_bbr.lt_epoch = rack->r_ctl.rc_considered_lost; if (rsm != NULL) { log.u_bbr.applimited = rsm->r_start; log.u_bbr.delivered = rsm->r_end; @@ -857,6 +898,7 @@ rack_init_sysctls(void) struct sysctl_oid *rack_measure; struct sysctl_oid *rack_probertt; struct sysctl_oid *rack_hw_pacing; + struct sysctl_oid *rack_policing; rack_attack = SYSCTL_ADD_NODE(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_sysctl_root), @@ -994,11 +1036,36 @@ rack_init_sysctls(void) "pacing", CTLFLAG_RW | CTLFLAG_MPSAFE, 0, "Pacing related Controls"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_pacing), + OID_AUTO, "pcm_enabled", CTLFLAG_RW, + &rack_pcm_is_enabled, 1, + "Do we by default do PCM measurements?"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_pacing), + OID_AUTO, "pcm_rnds", CTLFLAG_RW, + &rack_pcm_every_n_rounds, 100, + "How many rounds before we need to do a PCM measurement"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_pacing), + OID_AUTO, "pcm_blast", CTLFLAG_RW, + &rack_pcm_blast, 0, + "Blast out the full cwnd/rwnd when doing a PCM measurement"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_pacing), + OID_AUTO, "rnd_gp_gain", CTLFLAG_RW, + &rack_gp_gain_req, 1200, + "How much do we have to increase the GP to record the round 1200 = 120.0"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_pacing), + OID_AUTO, "dgp_out_of_ss_at", CTLFLAG_RW, + &rack_rnd_cnt_req, 0x10005, + "How many rounds less than rnd_gp_gain will drop us out of SS"); SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_pacing), - OID_AUTO, "fulldgpinrec", CTLFLAG_RW, - &rack_uses_full_dgp_in_rec, 1, - "Do we use all DGP features in recovery (fillcw, timely et.al.)?"); + OID_AUTO, "no_timely", CTLFLAG_RW, + &rack_timely_off, 0, + "Do we not use timely in DGP?"); SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_pacing), OID_AUTO, "fullbufdisc", CTLFLAG_RW, @@ -1017,13 +1084,13 @@ rack_init_sysctls(void) SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_pacing), OID_AUTO, "divisor", CTLFLAG_RW, - &rack_default_pacing_divisor, 4, + &rack_default_pacing_divisor, 250, "What is the default divisor given to the rl code?"); SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_pacing), OID_AUTO, "fillcw_max_mult", CTLFLAG_RW, - &rack_bw_multipler, 2, - "What is the multiplier of the current gp_est that fillcw can increase the b/w too?"); + &rack_bw_multipler, 0, + "What is the limit multiplier of the current gp_est that fillcw can increase the b/w too, 200 == 200% (0 = off)?"); SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_pacing), OID_AUTO, "max_pace_over", CTLFLAG_RW, @@ -1039,11 +1106,6 @@ rack_init_sysctls(void) OID_AUTO, "limit_wsrtt", CTLFLAG_RW, &rack_limit_time_with_srtt, 0, "Do we limit pacing time based on srtt"); - SYSCTL_ADD_S32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_pacing), - OID_AUTO, "init_win", CTLFLAG_RW, - &rack_default_init_window, 0, - "Do we have a rack initial window 0 = system default"); SYSCTL_ADD_U16(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_pacing), OID_AUTO, "gp_per_ss", CTLFLAG_RW, @@ -1079,6 +1141,11 @@ rack_init_sysctls(void) OID_AUTO, "rate_cap", CTLFLAG_RW, &rack_bw_rate_cap, 0, "If set we apply this value to the absolute rate cap used by pacing"); + SYSCTL_ADD_U64(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_pacing), + OID_AUTO, "fillcw_cap", CTLFLAG_RW, + &rack_fillcw_bw_cap, 3750000, + "Do we have an absolute cap on the amount of b/w fillcw can specify (0 = no)?"); SYSCTL_ADD_U8(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_sysctl_root), OID_AUTO, "req_measure_cnt", CTLFLAG_RW, @@ -1317,11 +1384,6 @@ rack_init_sysctls(void) OID_AUTO, "send_oldest", CTLFLAG_RW, &rack_always_send_oldest, 0, "Should we always send the oldest TLP and RACK-TLP"); - SYSCTL_ADD_S32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_tlp), - OID_AUTO, "rack_tlimit", CTLFLAG_RW, - &rack_limited_retran, 0, - "How many times can a rack timeout drive out sends"); SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_tlp), OID_AUTO, "tlp_cwnd_flag", CTLFLAG_RW, @@ -1355,6 +1417,26 @@ rack_init_sysctls(void) "timers", CTLFLAG_RW | CTLFLAG_MPSAFE, 0, "Timer related controls"); + SYSCTL_ADD_U8(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_timers), + OID_AUTO, "reset_ssth_rec_rto", CTLFLAG_RW, + &rack_ssthresh_rest_rto_rec, 0, + "When doing recovery -> rto -> recovery do we reset SSthresh?"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_timers), + OID_AUTO, "scoreboard_thresh", CTLFLAG_RW, + &rack_rxt_scoreboard_clear_thresh, 2, + "How many RTO's are allowed before we clear the scoreboard"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_timers), + OID_AUTO, "honor_hpts_min", CTLFLAG_RW, + &rack_honors_hpts_min_to, 1, + "Do rack pacing timers honor hpts min timeout"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_timers), + OID_AUTO, "hpts_max_reduce", CTLFLAG_RW, + &rack_max_reduce, 10, + "Max percentage we will reduce slot by for pacing when we are behind"); SYSCTL_ADD_U32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_timers), OID_AUTO, "persmin", CTLFLAG_RW, @@ -1434,11 +1516,6 @@ rack_init_sysctls(void) "features", CTLFLAG_RW | CTLFLAG_MPSAFE, 0, "Feature controls"); - SYSCTL_ADD_U64(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_features), - OID_AUTO, "rxt_clamp_thresh", CTLFLAG_RW, - &rack_rxt_clamp_thresh, 0, - "Bit encoded clamping setup bits CCCC CCCCC UUUU UULF PPPP PPPP PPPP PPPP"); SYSCTL_ADD_S32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_features), OID_AUTO, "hybrid_set_maxseg", CTLFLAG_RW, @@ -1474,6 +1551,53 @@ rack_init_sysctls(void) OID_AUTO, "hystartplusplus", CTLFLAG_RW, &rack_do_hystart, 0, "Should RACK enable HyStart++ on connections?"); + /* Policer detection */ + rack_policing = SYSCTL_ADD_NODE(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_sysctl_root), + OID_AUTO, + "policing", + CTLFLAG_RW | CTLFLAG_MPSAFE, 0, + "policer detection"); + SYSCTL_ADD_U16(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "rxt_thresh", CTLFLAG_RW, + &rack_policer_rxt_thresh, 0, + "Percentage of retransmits we need to be a possible policer (499 = 49.9 percent)"); + SYSCTL_ADD_U8(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "avg_thresh", CTLFLAG_RW, + &rack_policer_avg_thresh, 0, + "What threshold of average retransmits needed to recover a lost packet (1 - 169 aka 21 = 2.1)?"); + SYSCTL_ADD_U8(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "med_thresh", CTLFLAG_RW, + &rack_policer_med_thresh, 0, + "What threshold of Median retransmits needed to recover a lost packet (1 - 16)?"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "data_thresh", CTLFLAG_RW, + &rack_policer_data_thresh, 64000, + "How many bytes must have gotten through before we can start doing policer detection?"); + SYSCTL_ADD_U32(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "bwcomp", CTLFLAG_RW, + &rack_policing_do_bw_comp, 1, + "Do we raise up low b/w so that at least pace_max_seg can be sent in the srtt?"); + SYSCTL_ADD_U8(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "recmss", CTLFLAG_RW, + &rack_req_del_mss, 18, + "How many MSS must be delivered during recovery to engage policer detection?"); + SYSCTL_ADD_U16(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "res_div", CTLFLAG_RW, + &rack_policer_bucket_reserve, 20, + "What percentage is reserved in the policer bucket?"); + SYSCTL_ADD_U64(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_policing), + OID_AUTO, "min_comp_bw", CTLFLAG_RW, + &rack_pol_min_bw, 125000, + "Do we have a min b/w for b/w compensation (0 = no)?"); /* Misc rack controls */ rack_misc = SYSCTL_ADD_NODE(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_sysctl_root), @@ -1578,31 +1702,8 @@ rack_init_sysctls(void) OID_AUTO, "autoscale", CTLFLAG_RW, &rack_autosndbuf_inc, 20, "What percentage should rack scale up its snd buffer by?"); - SYSCTL_ADD_U32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_misc), - OID_AUTO, "rnds_for_rxt_clamp", CTLFLAG_RW, - &rack_rxt_min_rnds, 10, - "Number of rounds needed between RTT clamps due to high loss rates"); - SYSCTL_ADD_U32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_misc), - OID_AUTO, "rnds_for_unclamp", CTLFLAG_RW, - &rack_unclamp_round_thresh, 100, - "Number of rounds needed with no loss to unclamp"); - SYSCTL_ADD_U32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_misc), - OID_AUTO, "rxt_threshs_for_unclamp", CTLFLAG_RW, - &rack_unclamp_rxt_thresh, 5, - "Percentage of retransmits we need to be under to unclamp (5 = .5 percent)\n"); - SYSCTL_ADD_U32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_misc), - OID_AUTO, "clamp_ss_upper", CTLFLAG_RW, - &rack_clamp_ss_upper, 110, - "Clamp percentage ceiling in SS?"); - SYSCTL_ADD_U32(&rack_sysctl_ctx, - SYSCTL_CHILDREN(rack_misc), - OID_AUTO, "clamp_ca_upper", CTLFLAG_RW, - &rack_clamp_ca_upper, 110, - "Clamp percentage ceiling in CA?"); + + /* Sack Attacker detection stuff */ SYSCTL_ADD_U32(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_attack), @@ -1779,6 +1880,13 @@ rack_init_sysctls(void) OID_AUTO, "alloc_hot", CTLFLAG_RD, &rack_hot_alloc, "Total allocations from the top of our list"); + tcp_policer_detected = counter_u64_alloc(M_WAITOK); + SYSCTL_ADD_COUNTER_U64(&rack_sysctl_ctx, + SYSCTL_CHILDREN(rack_counters), + OID_AUTO, "policer_detected", CTLFLAG_RD, + &tcp_policer_detected, + "Total policer_detections"); + rack_to_alloc = counter_u64_alloc(M_WAITOK); SYSCTL_ADD_COUNTER_U64(&rack_sysctl_ctx, SYSCTL_CHILDREN(rack_counters), @@ -1957,17 +2065,8 @@ rack_init_sysctls(void) static uint32_t rc_init_window(struct tcp_rack *rack) { - uint32_t win; + return (tcp_compute_initwnd(tcp_maxseg(rack->rc_tp))); - if (rack->rc_init_win == 0) { - /* - * Nothing set by the user, use the system stack - * default. - */ - return (tcp_compute_initwnd(tcp_maxseg(rack->rc_tp))); - } - win = ctf_fixed_maxseg(rack->rc_tp) * rack->rc_init_win; - return (win); } static uint64_t @@ -2071,6 +2170,7 @@ rack_log_hybrid_bw(struct tcp_rack *rack, uint32_t seq, uint64_t cbw, uint64_t t off = (uint64_t)(cur) - (uint64_t)(&rack->rc_tp->t_tcpreq_info[0]); log.u_bbr.bbr_substate = (uint8_t)(off / sizeof(struct tcp_sendfile_track)); #endif + log.u_bbr.inhpts = 1; log.u_bbr.flex4 = (uint32_t)(rack->rc_tp->t_sndbytes - cur->sent_at_fs); log.u_bbr.flex5 = (uint32_t)(rack->rc_tp->t_snd_rxt_bytes - cur->rxt_at_fs); log.u_bbr.flex7 = (uint16_t)cur->hybrid_flags; @@ -2116,9 +2216,24 @@ rack_log_hybrid_sends(struct tcp_rack *rack, struct tcp_sendfile_track *cur, int memset(&log, 0, sizeof(log)); log.u_bbr.timeStamp = tcp_get_usecs(&tv); - log.u_bbr.cur_del_rate = rack->rc_tp->t_sndbytes; log.u_bbr.delRate = cur->sent_at_fs; - log.u_bbr.rttProp = rack->rc_tp->t_snd_rxt_bytes; + + if ((cur->flags & TCP_TRK_TRACK_FLG_LSND) == 0) { + /* + * We did not get a new Rules Applied to set so + * no overlapping send occured, this means the + * current byte counts are correct. + */ + log.u_bbr.cur_del_rate = rack->rc_tp->t_sndbytes; + log.u_bbr.rttProp = rack->rc_tp->t_snd_rxt_bytes; + } else { + /* + * Overlapping send case, we switched to a new + * send and did a rules applied. + */ + log.u_bbr.cur_del_rate = cur->sent_at_ls; + log.u_bbr.rttProp = cur->rxt_at_ls; + } log.u_bbr.bw_inuse = cur->rxt_at_fs; log.u_bbr.cwnd_gain = line; off = (uint64_t)(cur) - (uint64_t)(&rack->rc_tp->t_tcpreq_info[0]); @@ -2138,6 +2253,7 @@ rack_log_hybrid_sends(struct tcp_rack *rack, struct tcp_sendfile_track *cur, int log.u_bbr.lt_epoch = (uint32_t)((cur->timestamp >> 32) & 0x00000000ffffffff); /* now set all the flags in */ log.u_bbr.pkts_out = cur->hybrid_flags; + log.u_bbr.lost = cur->playout_ms; log.u_bbr.flex6 = cur->flags; /* * Last send time = note we do not distinguish cases @@ -2146,6 +2262,20 @@ rack_log_hybrid_sends(struct tcp_rack *rack, struct tcp_sendfile_track *cur, int */ log.u_bbr.pkt_epoch = (uint32_t)(rack->r_ctl.last_tmit_time_acked & 0x00000000ffffffff); log.u_bbr.flex5 = (uint32_t)((rack->r_ctl.last_tmit_time_acked >> 32) & 0x00000000ffffffff); + /* + * Compose bbr_state to be a bit wise 0000ADHF + * where A is the always_pace flag + * where D is the dgp_on flag + * where H is the hybrid_mode on flag + * where F is the use_fixed_rate flag. + */ + log.u_bbr.bbr_state = rack->rc_always_pace; + log.u_bbr.bbr_state <<= 1; + log.u_bbr.bbr_state |= rack->dgp_on; + log.u_bbr.bbr_state <<= 1; + log.u_bbr.bbr_state |= rack->rc_hybrid_mode; + log.u_bbr.bbr_state <<= 1; + log.u_bbr.bbr_state |= rack->use_fixed_rate; log.u_bbr.flex8 = HYBRID_LOG_SENT_LOST; tcp_log_event(rack->rc_tp, NULL, @@ -2299,6 +2429,7 @@ normal_ratecap: #ifdef TCP_REQUEST_TRK if (rack->rc_hybrid_mode && rack->rc_catch_up && + (rack->r_ctl.rc_last_sft != NULL) && (rack->r_ctl.rc_last_sft->hybrid_flags & TCP_HYBRID_PACING_S_MSS) && (rack_hybrid_allow_set_maxseg == 1) && ((rack->r_ctl.rc_last_sft->hybrid_flags & TCP_HYBRID_PACING_SETMSS) == 0)) { @@ -2338,7 +2469,10 @@ rack_get_gp_est(struct tcp_rack *rack) */ uint64_t srtt; - lt_bw = rack_get_lt_bw(rack); + if (rack->dis_lt_bw == 1) + lt_bw = 0; + else + lt_bw = rack_get_lt_bw(rack); if (lt_bw) { /* * No goodput bw but a long-term b/w does exist @@ -2374,19 +2508,22 @@ rack_get_gp_est(struct tcp_rack *rack) /* Still doing initial average must calculate */ bw = rack->r_ctl.gp_bw / max(rack->r_ctl.num_measurements, 1); } + if (rack->dis_lt_bw) { + /* We are not using lt-bw */ + ret_bw = bw; + goto compensate; + } lt_bw = rack_get_lt_bw(rack); if (lt_bw == 0) { /* If we don't have one then equate it to the gp_bw */ lt_bw = rack->r_ctl.gp_bw; } - if ((rack->r_cwnd_was_clamped == 1) && (rack->r_clamped_gets_lower > 0)){ - /* if clamped take the lowest */ + if (rack->use_lesser_lt_bw) { if (lt_bw < bw) ret_bw = lt_bw; else ret_bw = bw; } else { - /* If not set for clamped to get lowest, take the highest */ if (lt_bw > bw) ret_bw = lt_bw; else @@ -2487,6 +2624,8 @@ rack_log_dsack_event(struct tcp_rack *rack, uint8_t mod, uint32_t flex4, uint32_ log.u_bbr.flex7 = rack->r_ctl.dsack_persist; log.u_bbr.flex8 = mod; log.u_bbr.timeStamp = tcp_get_usecs(&tv); + log.u_bbr.epoch = rack->r_ctl.current_round; + log.u_bbr.lt_epoch = rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2535,6 +2674,8 @@ rack_log_hdwr_pacing(struct tcp_rack *rack, else log.u_bbr.cur_del_rate = 0; log.u_bbr.rttProp = rack->r_ctl.last_hw_bw_req; + log.u_bbr.epoch = rack->r_ctl.current_round; + log.u_bbr.lt_epoch = rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2552,28 +2693,9 @@ rack_get_output_bw(struct tcp_rack *rack, uint64_t bw, struct rack_sendmap *rsm, uint64_t bw_est, high_rate; uint64_t gain; - if ((rack->r_pacing_discount == 0) || - (rack_full_buffer_discount == 0)) { - /* - * No buffer level based discount from client buffer - * level is enabled or the feature is disabled. - */ - gain = (uint64_t)rack_get_output_gain(rack, rsm); - bw_est = bw * gain; - bw_est /= (uint64_t)100; - } else { - /* - * We have a discount in place apply it with - * just a 100% gain (we get no boost if the buffer - * is full). - */ - uint64_t discount; - - discount = bw * (uint64_t)(rack_full_buffer_discount * rack->r_ctl.pacing_discount_amm); - discount /= 100; - /* What %% of the b/w do we discount */ - bw_est = bw - discount; - } + gain = (uint64_t)rack_get_output_gain(rack, rsm); + bw_est = bw * gain; + bw_est /= (uint64_t)100; /* Never fall below the minimum (def 64kbps) */ if (bw_est < RACK_MIN_BW) bw_est = RACK_MIN_BW; @@ -2659,6 +2781,8 @@ log_anyway: log.u_bbr.pkts_out = rack->r_ctl.rc_out_at_rto; log.u_bbr.delivered = rack->r_ctl.rc_snd_max_at_rto; log.u_bbr.pacing_gain = rack->r_must_retran; + log.u_bbr.epoch = rack->r_ctl.current_round; + log.u_bbr.lt_epoch = rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2698,6 +2822,10 @@ rack_log_to_start(struct tcp_rack *rack, uint32_t cts, uint32_t to, int32_t slot log.u_bbr.lt_epoch = rack->rc_tp->t_rxtshift; log.u_bbr.lost = rack_rto_min; log.u_bbr.epoch = rack->r_ctl.roundends; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; + log.u_bbr.applimited = rack->rc_tp->t_flags2; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2731,6 +2859,9 @@ rack_log_to_event(struct tcp_rack *rack, int32_t to_num, struct rack_sendmap *rs log.u_bbr.pkts_out = rack->r_ctl.rc_out_at_rto; log.u_bbr.delivered = rack->r_ctl.rc_snd_max_at_rto; log.u_bbr.pacing_gain = rack->r_must_retran; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2780,6 +2911,9 @@ rack_log_map_chg(struct tcpcb *tp, struct tcp_rack *rack, log.u_bbr.lost = 0; else log.u_bbr.lost = rack->r_ctl.rc_prr_sndcnt; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2927,6 +3061,9 @@ rack_log_rtt_sample_calc(struct tcp_rack *rack, uint32_t rtt, uint32_t send_time log.u_bbr.flex4 = where; log.u_bbr.flex7 = 2; log.u_bbr.timeStamp = tcp_get_usecs(&tv); + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2939,7 +3076,7 @@ rack_log_rtt_sample_calc(struct tcp_rack *rack, uint32_t rtt, uint32_t send_time static void rack_log_rtt_sendmap(struct tcp_rack *rack, uint32_t idx, uint64_t tsv, uint32_t tsecho) { - if (tcp_bblogging_on(rack->rc_tp)) { + if (rack_verbose_logging && tcp_bblogging_on(rack->rc_tp)) { union tcp_log_stackspecific log; struct timeval tv; @@ -2951,6 +3088,9 @@ rack_log_rtt_sendmap(struct tcp_rack *rack, uint32_t idx, uint64_t tsv, uint32_t log.u_bbr.flex7 = 3; log.u_bbr.rttProp = tsv; log.u_bbr.timeStamp = tcp_get_usecs(&tv); + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -2979,6 +3119,9 @@ rack_log_progress_event(struct tcp_rack *rack, struct tcpcb *tp, uint32_t tick, log.u_bbr.pkts_out = rack->r_ctl.rc_out_at_rto; log.u_bbr.delivered = rack->r_ctl.rc_snd_max_at_rto; log.u_bbr.pacing_gain = rack->r_must_retran; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -3051,6 +3194,13 @@ rack_log_doseg_done(struct tcp_rack *rack, uint32_t cts, int32_t nxt_pkt, int32_ log.u_bbr.pkts_out = rack->r_ctl.rc_out_at_rto; log.u_bbr.delivered = rack->r_ctl.rc_snd_max_at_rto; log.u_bbr.pacing_gain = rack->r_must_retran; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; + log.u_bbr.epoch = rack->rc_inp->inp_socket->so_snd.sb_hiwat; + log.u_bbr.lt_epoch = rack->rc_inp->inp_socket->so_rcv.sb_hiwat; + log.u_bbr.lost = rack->rc_tp->t_srtt; + log.u_bbr.pkt_epoch = rack->rc_tp->rfbuf_cnt; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -3112,6 +3262,9 @@ rack_log_type_just_return(struct tcp_rack *rack, uint32_t cts, uint32_t tlen, ui log.u_bbr.delivered = rack->r_ctl.rc_snd_max_at_rto; log.u_bbr.pacing_gain = rack->r_must_retran; log.u_bbr.cwnd_gain = rack->rc_has_collapsed; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -3146,6 +3299,9 @@ rack_log_to_cancel(struct tcp_rack *rack, int32_t hpts_removed, int line, uint32 log.u_bbr.pkts_out = rack->r_ctl.rc_out_at_rto; log.u_bbr.delivered = rack->r_ctl.rc_snd_max_at_rto; log.u_bbr.pacing_gain = rack->r_must_retran; + log.u_bbr.bw_inuse = rack->r_ctl.current_round; + log.u_bbr.bw_inuse <<= 32; + log.u_bbr.bw_inuse |= rack->r_ctl.rc_considered_lost; TCP_LOG_EVENTP(rack->rc_tp, NULL, &rack->rc_inp->inp_socket->so_rcv, &rack->rc_inp->inp_socket->so_snd, @@ -3314,6 +3470,7 @@ rack_counter_destroy(void) counter_u64_free(rack_saw_enobuf_hw); counter_u64_free(rack_saw_enetunreach); counter_u64_free(rack_hot_alloc); + counter_u64_free(tcp_policer_detected); counter_u64_free(rack_to_alloc); counter_u64_free(rack_to_alloc_hard); counter_u64_free(rack_to_alloc_emerg); @@ -3475,6 +3632,8 @@ rack_free(struct tcp_rack *rack, struct rack_sendmap *rsm) rack->r_ctl.rc_num_split_allocs--; } if (rsm == rack->r_ctl.rc_first_appl) { + rack->r_ctl.cleared_app_ack_seq = rsm->r_start + (rsm->r_end - rsm->r_start); + rack->r_ctl.cleared_app_ack = 1; if (rack->r_ctl.rc_app_limited_cnt == 0) rack->r_ctl.rc_first_appl = NULL; else @@ -3490,7 +3649,7 @@ rack_free(struct tcp_rack *rack, struct rack_sendmap *rsm) rack->r_ctl.rc_sacklast = NULL; memset(rsm, 0, sizeof(struct rack_sendmap)); /* Make sure we are not going to overrun our count limit of 0xff */ - if ((rack->rc_free_cnt + 1) > 0xff) { + if ((rack->rc_free_cnt + 1) > RACK_FREE_CNT_MAX) { rack_free_trim(rack); } TAILQ_INSERT_HEAD(&rack->r_ctl.rc_free, rsm, r_tnext); @@ -3806,6 +3965,8 @@ rack_increase_bw_mul(struct tcp_rack *rack, int timely_says, uint64_t cur_bw, ui logged = 0; + if (rack->rc_skip_timely) + return; if (override) { /* * override is passed when we are @@ -3976,6 +4137,8 @@ rack_decrease_bw_mul(struct tcp_rack *rack, int timely_says, uint32_t rtt, int32 uint64_t logvar, logvar2, logvar3; uint32_t logged, new_per, ss_red, ca_red, rec_red, alt, val; + if (rack->rc_skip_timely) + return; if (rack->rc_gp_incr) { /* Turn off increment counting */ rack->rc_gp_incr = 0; @@ -4177,6 +4340,7 @@ rack_enter_probertt(struct tcp_rack *rack, uint32_t us_cts) */ uint32_t segsiz; + rack->r_ctl.rc_lower_rtt_us_cts = us_cts; if (rack->rc_gp_dyn_mul == 0) return; @@ -4203,7 +4367,6 @@ rack_enter_probertt(struct tcp_rack *rack, uint32_t us_cts) rack->r_ctl.rc_pace_min_segs); rack->in_probe_rtt = 1; rack->measure_saw_probe_rtt = 1; - rack->r_ctl.rc_lower_rtt_us_cts = us_cts; rack->r_ctl.rc_time_probertt_starts = 0; rack->r_ctl.rc_entry_gp_rtt = rack->r_ctl.rc_gp_srtt; if (rack_probertt_use_min_rtt_entry) @@ -4387,6 +4550,7 @@ static void rack_check_probe_rtt(struct tcp_rack *rack, uint32_t us_cts) { *** 7330 LINES SKIPPED ***