Centre for Advanced Internet Architectures, Swinburne University of Technology, Melbourne, Australia 11th July, 2014 Multipath TCP For FreeBSD Kernel Patch v0.4 ---------------------------------------------- OVERVIEW ---------------------------------------------- RFC6824 [1] proposes extensions to TCP [2] whereby multiple addresses (and potentially paths) can be used over a single TCP connection. This is referred to as 'Multipath TCP'. The extension is designed to maintain compatibility with existing TCP Socket APIs and is therefore backwards-compatible with existing TCP applications. It is recommended that the reader should become familiar with the Multipath TCP RFC before attempting to apply the kernel patch. At the time of writing, the Linux reference implementation is available from [3] as kernel sources, or as a pre-compiled package for Debian-based Linuxes. Two commercial implementations also exist - in Citrix Netscaler [4] and Apple iOS7 [5]. This distribution contains the v0.4 implementation of Multipath TCP for FreeBSD. It is applied as a kernel patch against revision 265307 of FreeBSD-HEAD (FreeBSD-11). Instructions for acquiring the FreeBSD source and applying the patch are provided in the INSTALLATION section. This release of the Multipath Kernel is a work-in-progress and should be considered for experimental use only. Additionally, this release is not fully compliant with the RFC (see KNOWN LIMITATIONS). We recommend reading the documentation supplied with the patch in full before use. ---------------------------------------------- CHANGES SINCE LAST RELEASE ---------------------------------------------- Please see change log at: http://caia.swin.edu.au/urp/newtcp/mptcp/tools/mptcp-changelog-v0.4.txt Example topologies and configuration: http://caia.swin.edu.au/urp/newtcp/mptcp/tools/mptcp-examples-v0.4.txt ---------------------------------------------- NOTES ON -HEAD REVISION 265307 ---------------------------------------------- We have patched against revision 265307 (committed 4th May 2014), as later revisions of -HEAD contain changes to memory management for which the implementation has not yet been fully adapted. Specifically, this relates to the change from using UMA memory (i.e. V_tcp_reass_zone) for TCP segments in reassembly to using mbufs. The next patch release will be consistent with the new memory management in -HEAD. For this reason we do not advise attempting to apply the patch against revisions later than this. ---------------------------------------------- UNDER DEVELOPMENT FOR UPCOMING RELEASES ---------------------------------------------- o Update memory management of TCP segments to be consistent with newer revisions of -HEAD o Overhaul of connection shutdown to better suit multipath connections o Dynamic path management o Coupled congestion control (rfc6356) o Further improvements to increase RFC compliance ---------------------------------------------- LICENCE ---------------------------------------------- The FreeBSD multipath kernel patch is released under a BSD licence. Refer to licence headers in each source file for further details. ---------------------------------------------- INSTALLATION ---------------------------------------------- Prerequisites: o The kernel patch has been developed and tested on amd64 hosts (though versions have been built for i386). The patch has also been tested on 64-bit virtual machines. o FreeBSD-11.x - We recommend installing a snapshot ISO from: ftp://ftp.freebsd.org/pub/FreeBSD/snapshots/amd64/amd64/ o If using a virtual machine, a 15GB disk image is recommended as a minimum. This will be enough to hold the distribution, source and build files, with around 4-5GB of headroom. To obtain the correct revision of the FreeBSD source tree that this patch applies to, and store it in the local directory "/path/to/src", run: svnlite co -r 265307 http://svn.freebsd.org/base/head We have developed and tested the patch against this revision of FreeBSD-HEAD. We do not advise attempting to apply the patch to revisions after 265307. Issuing the following commands will build and install the mptcp-enabled distribution: cd fetch http://caia.swin.edu.au/urp/newtcp/mptcp/tools/mptcp_v0.4_11.x.265307.patch patch -p1 < mptcp_v0.4_11.x.265307.patch make -j`sysctl -n hw.ncpu` buildworld buildkernel installkernel installworld mergemaster -iF -m shutdown -r now Upon reboot MPTCP will be enabled by default, and the host will attempt to use MPTCP when setting up new connections. Settings such as SACK and TSO should be disabled when attempting to use multipath connections. See KNOWN LIMITATIONS. ---------------------------------------------- CHANGES TO THE KERNEL ---------------------------------------------- Enabling MPTCP support in the FreeBSD kernel required substantial changes to the TCP stack, in particular the TCP connection setup, input and output paths and socket buffer access methods. The changes in brief (CAPABILITIES AND FEATURES provides some additional depth): o Creation of multipath Protocol Control Block (MPCB) and the redefinition of existing TCP Control Block (TCPCB) to act as a MPTCP subflow. o Changes to how control blocks are attached and detached from a socket. A single socket can now support multiple IP and TCP control blocks. o Changes to socket buffer access routines and accounting. Mechanisms for shared access to socket buffers from multiple TCP subflows. o Option adding and parsing code for MPTCP in input and output paths. o Locking mechanisms to handle concurrent access to data-structures used in MPTCP connections. o Significant changes to segment reassembly (receive buffer), acknowledgement handling/allocation of data for transmission (send buffer). New data structures and maps have been added to use multiple TCPs on a single socket. ---------------------------------------------- CAPABILITIES AND FEATURES ---------------------------------------------- o Compatible with standard TCP: The implementation can establish standard TCP connections with non MPTCP-enabled hosts. o Multipath Capable: Can establish, add additional subflows to, and terminate a multipath session. Individual subflows can time out/temporarily stall without terminating the overall connection. o Data-level retransmits: Subflows that trigger successive subflow-level retransmit timeouts will have their outstanding data re-injected on an alternate subflow. o Basic Linux interoperability: Can establish and carry out a single-subflow MPTCP connection (without Data-level FIN handshake). Checksums should be disabled on the Linux host when creating a connection. o MPTCP signalling: MP_CAPABLE, MP_ADD_ADDR, MP_JOIN and DSS exchanges are implemented and functional. Other options are currently parsed but not acted upon. o Mediated Socket Buffer Access: Access to the socket and socket buffers is now restricted to the multipath control block. The multipath control block 'subdivides' the socket buffers and allocates portions to individual subflows. o ADD_ADDR issued after connection is established. Hosts receiving an ADD_ADDR will attempt an MP_JOIN to that address. o Basic packet scheduling has been implemented to prevent subflows from being starved of data to send. ---------------------------------------------- KNOWN LIMITATIONS ---------------------------------------------- o TCP Segmentation Offload (TSO) disabled: The implementation has been tested and debugged without TSO enabled, thus it has been disabled by default (and segments are limited to 1420 bytes). o Selective Acknowledgement (SACK) is currently not supported, and should be disabled with the following command: sysctl net.inet.tcp.sack.enable=0 o Basic packet scheduler The current packet scheduler features no 'intelligence', such as making decisions based on flow statistics. o Restriction on the number of multi-homed hosts: Currently the implementation allows for one of the hosts to be multi-homed. For example, a connection may have a multi-homed host connected to a single- homed host. If both hosts are multi-homed, only one will use the additional interface. o No buffering out-of-map packets. Packets that arrive and do not belong to an existing map (e.g. the packet carrying the map was lost) are dropped rather than buffered. This can increase the number of retransmits required during a multipath session. Buffering code is present but not enabled as it requires further testing. o Fall-back to 'infinite map' not handled: A fully established multipath connection will not fall back into standard TCP "infinite map" mode if an error is detected. o Connection-level closing states not fully implemented: The Data-FIN closing sequence is not carried out in the current implementation. When an application calls a close() on the socket, each of the subflows is disconnected (standard TCP close) and the connection is closed. We do not wait for outstanding data-level segments to be acknowledged (however subflows are able to finish sending any data that has already been mapped). o No dynamic subflow management during a connection: (a) The implementation will issue ADD_ADDR and JOIN signals at the start of a connection. It will not attempt to remove advertised addresses. However stalled subflows are timed out and removed from the connection. (b) Misbehaving subflows are not RST during a multipath session. An application may RST a connection however if corrupt data is received. o No automated path discovery, basic path management: (a) Addresses are not automatically discovered. They are added via a sysctl variable (see usage details above). Setting this sysctl makes the address available to any multipath connection that becomes established. (b) Addresses learnt during a connection (via the ADD_ADDR option) are stored locally in the 'multipath layer', rather than in an independent, globally accessible path manager. o No coupled congestion control: Coupled congestion control, as defined in [4], is not implemented. o Security (hmacs, etc) only at most basic level for operation: Hashes and keys are generated and exchanged where required, but are not validated internally. o Only 32-Bit DSNs on the wire: Data sequence numbers are tracked as 64-bit values internally, but only the lower 32-bits are sent over the wire. o Checksumming is disabled: Checksumming is not implemented in this version of the patch. o IPv4 only: IPv6 code paths have not been fully implemented and tested as of this version. o Performance is not optimised. o Occasional panic/KASSERT failure: Some KASSERT and panic conditions will occasionally occur and break the system to gdb. o Firewall-based routing Currently 'pf' is used to re-route packets to the correct interface. In future releases, routing will involve using multiple FIBs and and route management within the MP connection. o Ceiling on single-map segment size When using the "one-map-per-packet" option, TCP segment size is limited to 1420 bytes ---------------------------------------------- RUN TIME CONFIGURATION ---------------------------------------------- Sysctl variables that provide configuration options: net.inet.tcp.mptcp.mp_addresses Additional addresses are made available using this variable. A list of addresses are provided as input, and these will be advertised to the remote host when a multipath connection becomes established. This setting can be left empty if you only wish to use a single address on the local host (the default address, or master subflow address, is determined by the route table). For example, on a host with two addresses: 192.168.0.10 (default gw) and 192.168.0.11, you can add the '.11' address to be used as an extra subflow in multipath connections with the following command: sysctl net.inet.tcp.mptcp.mp_addresses="192.168.0.11" In this case '.10' will act as the primary subflow, while '.11' will be advertised with ADD_ADDR once multipath is established. By default the host receiving an ADD_ADDR will initiate the MP_JOIN. Multiple addresses can be added as a space delimited string: sysctl net.inet.tcp.mptcp.mp_addresses="192.168.0.11 10.0.0.20" The list of addresses can be cleared by setting a '0' sysctl net.inet.tcp.mptcp.mp_addresses=0 Currently the implementation is hard-coded to allow the addition of a single additional subflow (i.e. two interfaces). This limitation will be removed in a future release. net.inet.tcp.mptcp.max_subflows Specifies the maximum number of subflows that can be attached to a single multipath connection. The default value is 8, however the implementation is currently internally limited to a maximum of two subflows. net.inet.tcp.mptcp.single_packet_maps Enabled by default, restricts DSN mappings to cover a single segment only. Setting this to '0' will enable multi-packet DSN mappings (the DSN mapping being present only on the first packet of the map). Recent testing has been performed using single-packet maps, so it is highly recommended that the default setting should be used. net.inet.tcp.override_isn Manually set the TCP initial sequence number (isn). The isn can be set to any number greater than 0. The default value of 0 will result in normal randomisation. Useful for debugging sequence number issues. net.inet.tcp.mptcp.mp_debug The kernel features multi-level debugging info, the depth and class of which is set using this sysctl variable. There are currently three classes of debug info that can be displayed: MPSESSION - General session information (such as hashes and keys) DSMAP - data-sequence map info (e.g. map lengths etc) SBSTATUS - the status of the socket buffers REASS - reassembly-related information ALL - apply settings to ALL of the above classes at once Each of these classes has a level of verbosity, which ranges from 0 (no output) to 5 (fully verbose). An example of usage is shown below (enables full verbosity DSMAP): sysctl net.inet.tcp.mptcp.mp_debug="DSMAP:5" In this case we use the format to enable debugging. The notation causes all debugging levels up to to be printed. I.e., "DSMAP:5" causes debugging at levels 1-5 of DSMAP to be printed. To print a single debug level exclusively, use the notation: sysctl net.inet.tcp.mptcp.mp_debug="DSMAP:=5" This prints out ONLY level five debug statements. To turn off debugging, the following command would be issued: sysctl net.inet.tcp.mptcp.mp_debug="DSMAP:0" Entering the following will print a string with the current debug configuration: sysctl net.inet.tcp.mptcp.mp_debug Note that enabling mp_debug will result in many lines being printed to the console, which will slow down the connection dramatically. However, it may be useful when testing to have at least MPSESSION:1 information displayed. ---------------------------------------------- ROUTING CONFIGURATION - pf ---------------------------------------------- For instances where a host has multiple addresses (and a single routing table), pf (packet filter) rules can be added to ensure packets are transmitted via the correct interface. These rules override the decision of route table lookup. For detailed examples, see: http://caia.swin.edu.au/urp/newtcp/mptcp/tools/mptcp-examples-v0.4.txt The use of pf is a temporary measure until route management is added to the implementation. To enable pf, add the following to /etc/rc.conf pf_enable="YES" Two pf actions allow the redirection of packets to a particular interface: (1) "route-to" For packets originating internally, route to (interface, gateway) based on the source IP address of the packet. Example: A host has a single routing table but two interfaces on different subnets: {em0, 192.168.5.5}, {em1, 192.168.0.5}. The default route is "192.168.5.1", accessed via em0. The routing table might look like: Destination Gateway Netif default 192.168.5.1 em0 192.168.5.0/24 link#2 em0 192.168.0.0/24 link#3 em1 The host wants to send a packet with {src: 192.168.0.5, dst: 10.0.0.5}. As network "10.0.0.0/24" is not in the table, the packet will be forwarded to the default gateway via em0 (even though the source address is in the network "192.168.0.0/24"). The "route-to" filter can re-direct this packet out of the interface em1. The rule would look like the following (see below for an actual example): pass out on em0 route-to {(em1 192.168.0.1)} from 192.168.0.0 to any (2) "reply-to" When receiving packets on an interface, ensure that any responses are returned via the same path, specified by (interface, gateway). Example: The host described above receives a packet at the interface {em1, 192.168.0.5} from "10.0.0.5". All packets in response to this packet (e.g. ACKs) should be sent back via {em1, 192.168.0.5}. A rule can also be specified for {em0, 192.168.5.5}. pass in on em1 reply-to {(em1 192.168.0.1)} from any to any keep state pass in on em0 reply-to {(em0 192.168.5.1)} from any to any keep state Example multi-homed host: The following host has two interfaces on separate networks. A1 (em0) has the default route and belongs to the network "192.168.5.0/24". A2 (em1) is a secondary interface on network "192.168.0.0/24". Host +----+ | A1| <-------> gateway 192.168.5.1 | | | | | A2| <-------> gateway 192.168.0.1 +----+ The pf configuration below will ensure that packets with source address of A2 are sent towards the gateway 192.168.0.1, and that any response to packets arriving on either interface occurs via that same interface. Adding the pf rules to /etc/pf.conf # The default interface, and the interface to be added to "mp_addresses" def_if = "em0" mp_if = "em1" # default gateway of the first (default) interface def_gw = "192.168.5.1" # Network of the "mp_address" interface, and the gateway mp_net = "192.168.0.0" mp_gw = "192.168.0.1" # Any packets originating from this host and leaving "def_if" with source # network "mp_net" will be forwarded out of the "mp_if" interface instead. pass out on $def_if route-to {($mp_if $mp_gw)} from $mp_net to any # Responses to any packets received on an interface are sent via that same # interface, irrespective of the source/destination networks. pass in on $mp_if reply-to {($mp_if $mp_gw)} from any to any keep state pass in on $def_if reply-to {($def_if $def_gw)} from any to any keep state ---------------------------------------------- ACKNOWLEDGEMENTS ---------------------------------------------- This project has been made possible in part by a gift from The Cisco University Research Program Fund, a corporate advised fund of Silicon Valley Community Foundation. ---------------------------------------------- RELATED READING ---------------------------------------------- This software was developed at Swinburne University's Centre for Advanced Internet Architectures, under the umbrella of the NewTCP research project. More information on the project is available at: http://caia.swin.edu.au/urp/newtcp/ The FreeBSD MPTCP implementation homepage can be found at: http://caia.swin.edu.au/urp/newtcp/mptcp An overview of the process of designing the MPTCP protocol, by the RFC authors [7]. ---------------------------------------------- REFERENCES ---------------------------------------------- [1] Ford, A. et al, "TCP Extensions for Multipath Operation with Multiple Addresses", RFC 6824, January 2013. [2] Postel, J., "Transmission Control Protocol", RFC 793, September 1981. [3] "MultiPath TCP - Linux Kernel implementation", Homepage, http://multipath-tcp.org/, March 2013 [4] Philip Eardley, "Survey of MPTCP Implementations" http://datatracker.ietf.org/doc/draft-eardley-mptcp-implementations-survey/ [5] "Apple seems to also believe in Multipath TCP", Blog Entry, http://perso.uclouvain.be/olivier.bonaventure/blog/html/2013/09/18/mptcp.html [6] Raiciu, C. et al, "Coupled Congestion Control for Multipath Transport Protocols", RFC 6356, October 2011. [7] Raiciu, C. et al, "How Hard Can It Be? Designing and Implementing a Deployable Multipath TCP", USENIX Symposium of Networked Systems Design and Implementation (NSDI'12), San Jose (CA), 2012. ---------------------------------------------- DEVELOPMENT TEAM ---------------------------------------------- This FreeBSD MPTCP implementation was first released in 2013 by the Multipath TCP research project at Swinburne University of Technology's Centre for Advanced Internet Architectures (CAIA), Melbourne, Australia. The members of this project team are: Lead developer: Nigel Willams (njwilliams@swin.edu.au) Technical advisor/developer: Lawrence Stewart (lastewart@swin.edu.au) Project leader: Grenville Armitage (garmitage@swin.edu.au)