------------------------------------------------------------------------------- Centre for Advanced Internet Architectures, Swinburne University of Technology, Melbourne, Australia 15 October, 2015 Multipath TCP For FreeBSD Kernel Patch v0.51 mptcp-changelog-v0.51.txt Author: Nigel Williams ------------------------------------------------------------------------------- v0.51 Date: 15 October, 2015 - Merged with mercurial changeset a5a383383dcb of the 'freebsd-head' branch of hg-beta.freebsd.org/base -------------------------------------------------------------------------------- v0.5 Date: 1 September, 2015 - As the implementation has been extensively re-written, this changelog entry is limited to architecturally significant items. - Merged with FreeBSD-HEAD svn revision 285254 (Jul 7, 2015). - Multipath Control Block (mpcb) has been split into a Multipath Protocol Control Block (mppcb) and mpcb pair. The mppcb is attached to the socket and maintains overall connection-level state while the mpcb functions as a transport control block. This mimics the "socket <-> IP <-> TCP" control block relationship of a standard TCP socket. - Breakdown of MPTCP connection into a Multipath Socket + (n * Subflow Sockets). A subflow socket is a "standard" TCP socket that is instantiated and attached to the MPTCP socket, and accessed via a list of subflow sockets in the mpcb. - Defined new protocol hooks for MPTCP for handling user requests/syscalls (netinet/mptcp_usrreq.c). Calls to "(so)(pru_*)" now call a MPTCP specific function. - Created mp_input, mp_do_segment and mp_output (netinet/mptcp_subr.c) functions for handling data segments at the mp-level. - Introduced series of asynchronous task handlers that separate the subflow and mp layers. MP-level processing (such as processing an MP option) is performed by an appropriate task handler rather than in the subflow thread. - New locking scheme. Defined new lock for mppcb "MPP-LOCK", held when modifying/reading mp-level data or dereferencing the MPTCP Socket. This is similar to the INP-LOCK in standard TCP. - Simplified reassembly. TCP-level reassembly is mostly unchanged from the standard implementation. Once in-order, a pointer to the reassembly list- head is passed to the MP-layer for data-level reassembly. The subflow-level receive buffer is currently unused. - Per-subflow send buffers. Outbound data is mapped by the packet scheduler from the connection-level socket to the send buffer of a subflow socket. A data-sequence map is also passed to the subflow so that DSNs can be applied to outgoing segments. Subflows are not aware of the connection-level send buffer. - Send buffer mapping and mp-level reassembly have been re-written from scratch. -------------------------------------------------------------------------------- v0.4 Date: 11 July, 2014 - Merged up to HEAD revision 241912 (4th May, 2014). - Modified tcp_drop() to stop setting of ETIMEDOUT on so->so_error while the subflow count is greater than 1. this enables the conneciton to stay open when an individual subflow times out (failed syn, met max rto count etc). - Timed-out, RST subflows no longer close whole connection. Changed setting of INP_DROPPED and INP_SOCKREF to prevent premature free of the socket. These flags are cleared is more than one subflow remains, and set when closing the last subflow. should be replaced with a better solution in later releases. - Stop connections from localhost, or local interface, from becomming multipath. Also prevent advertising of an mp_address if it was the default interface for the connection. - Drop discarded tcpcbs from the mp-level subflow list. This was not being done previously so was causing a KASSERT (sf_inp != NULL) to trigger when a later call to tcp_usr_disconnect occured. - Minor modifications to subflow attach. Restrict advertising and attaching a listen PCB when we are a multi-homed passive opener and the remote host is also multi-homed. (this behaviour will change with the implementation of a path manager). - Fixes for re-injection of segments onto alternate subflow after a data-level timeout. Segments are now only retransmitted from the subflow that triggered the data-level timeout, and mapping code has been adjusted to re-map these segments onto the fist subflow that is not in retransmit. All outstanding segments are sent back-to-back on the retransmitting subflow, overriding the packet schedular. - Removed mptcp RTO timer (was implemented in the style of existing tcp timers). the data level rto is now triggered based on the number of RTOs that have fired at the subflow level - i.e. after a certain count of rxmtshft the outstanding segments will be queued and sent on an alternate subflow. - Sequence number wrap detection checks when accessing dsmaps. Detects when a sequence number being requested was in a map that wrapped over its length. Also detects if the requesting sequence number has wrapped. - Subflow sequence number wrap detection and handling related to marking sent maps as acked. - New macros to calculate the number of payload bytes acked (with wrap detection). - Pass DACK notification to tcp_output via tcpcb sf_flags (so that we no longer set this in the mp layer). Added SFF_NEED_DACK to signal this. - Set additional subflows (started via JOIN) directly into mp_active after handshake, rather than calling the mp_init_established function. - Added hmac calculations and code to perform spec-compliant MP_JOIN handshake. i.e. hmacs are included in the MP_JOIN options where appropriate. The code does not validate the HMAC at this point in time. - Changed endian-ness of random numbers and keys for compatibility with Linux implementation. - Fixed retransmit logic in the case of performing retransmit when a SACK hole does not exist for the retransmit (eg. window tail loss). - Synchronised receive window updates across subflows. This prevents a window from being closed on one subflow but updated on another subflow (thus leaving the first subflow with a zero window). This is done by forcing an ACK on each subflow whenever tcp_usr_rcvd() is called. - The length of the packet to be sent was previously calculated based on the window minus the offset in the socket buffer. When sendalot was used offset would track the number of bytes sent relative to una and would ensure no more than sendwin bytes were sent. By re-purposing offset for use with ds_maps sendalot has to track the number of bytes sent in the current call to tcp_output in a different way. The fix is to calculate sendwin once and subtract len from its value each time the function loops. We exit the loop once the value of sendwin decreases below zero. *This itself caused an issue where calls from tcp_usrreq would occur before the cwnd had been updated after sending, causing data to be sent past the available congestion window. Added new variable to track the actual available window to stop the sending of too much data. - Corrected sequence number mapping for locating existing ds_maps. Previously if a map wrapped, and the sequence number appeared before the wrapping point, this would be missed and would trigger a false panic that a map did not exist for the sequence number. - Fixed sequence number wrapping issue with finding ds_maps in the receive maps list. Maps for some sequence numbers would not be found if the map end had wrapped and the sequence number was less than UINTMAX. - Ack processing fix: added subtraction of oursynisacked to prevent kassert triggered by acked and sb_cc being offset by 1. - Now advertise a varying, reduced window size to prevent nmclusters from being exhausted - Retransmits were sending packets with a length of 0. This was because the variable map_unsent was used to determine the len of bytes to be included in a segment. The calculation for map_unsent was wrong, however, as it was subtracted from snd_max. This means that once snd_max had progressed beyond the end of the map in sequence space, it would not be possible to transmit those bytes again if they were lost (this was occurring in drop tail loss of segments at the receiver due too queuing). map_unset is now based on 'offset' (snd_nxt - snd_una) which allows tcp_output to traverse back into a previously sent map and retransmit data. - In multi-packet map scenarios, the segment carrying a map can be lost, causing subsequent packets to arrive before the map. This was triggering a panic for out-of-map packets. As a temporary fix these packets are now dropped after triggering an ACK. - Change to sysctl mp_addresses to handle '0' as a reset/clearing condition - Reversed direction of MP_JOIN. the MP_JOIN SYN is now sent from the host that received the ADD_ADDR. - Added new task handlers for subflow management (joins, subflow allocation, data-level rexmit, window updates). - fixed issue of creating duplicate inp/tp pairs for subflows in syncache_socket_subflow. Now the inp/tp pair created when the address is advertised is located using in_pcblookup_mbuf in syncache_socket_subflow. This was previously causing a 'connection closed' error on connection teardown and the spurious sending of ACKs from each of the active subflows. - fixed bug where a single addresses active opener would crash on connection close (to multi-addressed server) due to subflow count not being incremented correctly. - call to check soreadable always evaluated true, and resulted in the stack not responding to FIN when connected to a standard TCP implementation. This problem was manifest as a connection "stall" when in fact the connection had completed normally. - Fixed panics caused during shutdown of multi-subflow connections, where subflows have different BDP. In these cases the slower of the subflows would still be ACKing outstanding data while the faster subflow would be going through - Fixes to closing sequence to prevent holding of removed locks, flushing of buffers/pcbs while subflows are still active. - Changes to mp_timer to handle data-level retransmits. Will identify maps that need to be retransmitted and insert them into the newly created mp->rexmit_maps list. A task handler has been added that will retransmit the data on the first available subflow that is not in tcp-level retransmit - added mp_insert map to handle generic insertion of maps into mp->snd_maps. This is used to insert maps from the rexmit maps list, and for maps that have been created by mp_get_map - Duplicate sent map detection. if the same ds-level sequence space has been sent on two different subflows (for instance in rexmit), then there will be two entries in mp_sendmaps. if one of the maps has been acked in one of the subflows, then we can remove the map from the mp-level maps list. This allows the data to be freed from the send buffer, rather than needing to wait for both maps to be acked. The map is left in the tp-level sent maps list, and is freed in tcp_subr when/if the other subflow eventually acks it. - sbspace macro was using a signed int and imin to calculate space in the buffer. As we do not always drop mbufs before calling sowwakeup (in cases where data level rexmit has occurred, we wait for the subflow to timeout or ack those segments at the subflow level before dropping them from the send buffer) we can have ever increasing sizes of sb_mbcnt. need to use unsigned values and comparison so that negative values of sbspace are not returned. - Modified reass logic for detecting and setting the out of order segment (if any). Was a logic error where the first segment in the list could be disordered but not marked as such, allowing the tcp sequence space to progress out of step with the ds-level sequence space. There were also instances where multiple disordered segments (e.g. first segment and a later segment) would increase tcp_rcv next when the later hole was filled (again increasing the tcp sequence space when the first segment was out of order). - Function to check if addresses specified in "mp_addresses" are active at connection start. If not then we do not advertise the address via ADD_ADDR. -------------------------------------------------------------------------------- v0.3 Date: 16 April 2013 - Fixed a connection stall issue caused by small-sized maps not being sent by tcp_output(). - Added the net.inet.tcp.override_isn sysctl to manually set the TCP initial sequence number (isn) for easier debugging of sequence number wrapping issues. Setting this to '1' can mitigate panics caused by subflow sequence number wrapping. - Fixed a connection stall by having all received DACKs trigger a call to mp_deferred_sbdrop(). It's possible that DACKs arriving on different subflows arrive out of order, potentially leading to one or more maps never being garbage collected by mp_deferred_sbdrop(). This in turn left the corresponding socket buffer data unfreeable which ultimately caused a deadlock condition. - Fixed an off-by-one SYN related sequence accounting error that caused a panic when ssh'ing into an mptcp-enabled host. Reported by Scott Kamp - Reassembly accounting has been unified (always keyed off ds_rcv_nxt). Thus the same code path is used for active MP sessions and 'standard' tcp sessions. - Fixed receive-side reassembly issue where segments were not being delivered to the application due to incorrect updating of pointer to first out-of-order segment in reassembly list. - As it is possible to have in-order data waiting in a disordered queue, removed a KASSERT that was triggered by this. Added debugging output in tcp_reass() to indicate when changes between ordered and disordered occurred. - Fixed a race between mp_deferred_sbdrop() and mp_get_map() related to mp_ds_sndmaps list manipulation by acquiring the mpcb mutex in mp_get_map(). - Fixed a sanity checking KASSERT in mp_get_map() to account for overlapping maps in preparation for adding data-level retransmits. - Added functionality to specify a log level verbosity as must match or cumulative. - Fixed a race with a sanity checking KASSERT in mp_get_map() by holding the snd sockbuf lock in mp_deferred_sbdrop() while manipulating the socket buffer *and* ds_map_min. - Cap the minimum map size for send DS maps to the MSS of the subflow calling mp_get_map() when the socket buffer is full. Resolves observed behaviour where a subflow sharing a send buffer gets progressively smaller maps down to values well below the MSS. - Fixed a connection stall by making DS wrap detection more robust to handle DACKs arriving on multiple subflows and being processed out-of-order. - Account for MP_CAPABLE in the DS space by incrementing idsn by 1 as per the spec. -------------------------------------------------------------------------------- v0.2 Date: 15 March 2013 - Merged with FreeBSD 10-CURRENT svn revision 248226, which allows the patch to be more easily applied against recent revisions of 10-CURRENT. - Added a new "REASS" mp_debug class that provides access to debugging information related to segment reassembly. - Documented the existence of the "net.inet.tcp.mptcp.linux_compat" Linux compatibility sysctl after carrying out some interop testing with up-to-date Linux MPTCP Git sources and confirming some basic scenarios work. Setting the sysctl to 1 (the default) changes some stack behaviour to mimic that which is expected by the Linux implementation. - Moved setting of D-ACK flag to prevent stalling connections due to lack of retransmit at the data-level. - Fixed a connection stall issue triggered by send maps not being marked as ACKED before data-ACK processing took place, which in turn prevented send socket buffer bytes from being dropped. - Plumbed in a replacement for the socket disconnect logic that used to be in tcp_input.c, which now waits for all subflows to be closed. -------------------------------------------------------------------------------- v0.1 Date: 10 March 2013 - First public release of the FreeBSD MPTCP patch.