* *
Papers
TCP Rate-Halving
NIMI
Autotuning

Projects
TCP Rate-Halving
NIMI
Autotuning
SACK/FACK
Technology
   Integration

Software
TCP Implementations
TReno
Traceroute
Windowed Ping

Websites
TCP Performance
   Debugging
Performance
   Tuning
TCP Friendly

Related Projects
NLANR
NCNE Engineering
   Services
NCNE GigaPop
PSC
LBNL NRG

Miscellaneous
Staff
Help
Search
Web Feedback


Enabling High Performance Data Transfers

System Specific Notes for System Administrators (and Privileged Users)


Please Note: This information was originally compiled by Jamshid Mahdavi (Thanks a bunch, Jamshid!).

Many, many people have helped us in compiling the information on this website. We want to thank all of them for their help in sending us updates, and encourage people to continue to report any errors or additions to us at nettune@ncne.org so that the information herein will be as up-to-date as possible.

We are especially interested in information on the latest OS releases. If the instructions here apply to newer releases please let us know so we can update the version information.



Introduction

Network hardware components are getting faster and cheaper day by day. On the LAN front, Gigabit Ethernet cards have become very affordable. On the WAN front, academic research institutions have the ability to connect to very high speed backbones such as Abilene with OC-48 links, capable of tranferring at a rate of over 2 Gbps. Typical users are no longer concerned if the data is coming from a machine across the room or across the continent!

In order to take full advantage of these high speed networks however, users and system administrators have to pay attention to some of the configuration and tuning issues. This is particularly important on high speed paths that involve large round trip times (RTT), sometimes referred to as "Long Fat Networks" (LFNs, pronounced "elefan(t)s"; please see "TCP/IP Illustrated, Vol 1" by Richard W. Stevens).

Please Note: Some applications (ssh and scp, for example) implement flow control using an application level window mechanim that defeats TCP tuning.

Our objective here is to summarize the issues involved and maintain a repository of hardware/OS specific information that is required for getting the best possible network performance on these platforms.

In the following section titled "Overview" we will briefly explain the issues, and also define some terms. A later section under "Detailed Procedures" provides some step by step directions on making the necessary changes under various operating systems.

Overview

TCP, the most dominant protocol used on the internet today, is a "reliable" "window-based" protocol. Under ideal conditions, best possible network performance is achieved when the data pipe between the sender and the receiver is kept full.

Bandwidth*Delay Products (BDP)

The amount of data that can be in transit in the network, termed "Bandwidth-Delay-Product," or BDP for short, is simply the product of the bottleneck link bandwidth and the Round Trip Time (RTT). BDP is a simple but important concept in a window based protocol such as TCP. Some of the issues discussed below arise because of the fact that the BDP of today's networks has increased way beyond what it was when the TCP/IP protocols were initially designed. In order to accommodate the large increases in BDP, some high performance extensions have been proposed and implemented in the TCP protocol. But these high performance options are sometimes not enabled by default and will have to be explicitly turned on by the system administrators.

Buffers

In a "reliable" protocol such as TCP, the importance of BDP described above is that this is the amount of buffering that will be required in the end hosts (sender and receiver). The largest buffer the original TCP (without the high performance options) supports is limited to 64K Bytes. If the BDP is small either because the link is slow or because the RTT is small (in a LAN, for example), the default configuration is usually adequate. But for a paths that have a large BDP, and hence require large buffers, it is necessary to have the high performance options discussed in the next section be enabled.

Computing the BDP

To compute the BDP, we need to know the speed of the slowest link in the path and the Round Trip Time (RTT). The peak bandwidth of a link is typically expressed in Mbit/s (or more recently in Gbit/s). The round-trip delay (RTT) for a link can be measured with ping or traceroute, which for WAN links is typically between 10 msec and 100 msec.

As an example, for two hosts with GigE cards, communicating across a coast-to-coast link over an Abilene connection (assuming a 2.4 Gbps OC-48 link), the bottleneck link will be the GigE card itself. The actual round trip time (RTT) can be measured using ping, but we will use 70 msec in this example. Knowing the bottleneck link speed and the RTT, the BDP can be calculated as follows:

   1,000,000,000 bits    1 byte   70 seconds
   ------------------- * ------ * ---------- = 8,750,000 bytes = 8.75 MBytes
   1 second              8 bits   1,000

Based on these calculations, it is easy to see why the typical default buffer size of 64 KBytes would be way inadequate for this connection.

The next section presents a brief overview of the high performance options. Specific details on how to enable these options in various operating systems is provided in a later section.

High Performance Networking Options

  1. TCP Selective Acknowledgments (SACK, RFC2018): SACKs allow a receiver to acknowledge non-consecutive data. This is particularly helpful on paths with a large Bandwidth-Delay-Product (BDP). While SACK is now supported by most operating systems, it may have to be explicitly turned on by the system administrator.

    Additional information on commercial and experimental implementations of SACK is available at http://www.psc.edu/networking/all_sack.html.

  2. Large Windows (RFC1323): Without the support of this TCP enhancement, the buffer sizes that can be used by the application is limited to 64K Bytes. As we have seen in the BDP section above, this would be inadequate for today's high speed WANs.

    On most systems, RFC1323 extensions are included but may require the system administrator to explicitly turn them on.

  3. Maximum Buffer Sizes on the host: Typically operating systems limit the amount of memory that can be used by an application for buffering network data. The host system must be configured to support large enough socket buffers for reading and writing data to the network. Typical Unix systems include a default maximum value for the socket buffer size between 128 kB and 1 MB. For many paths, this is not enough, and must be increased.

    Please note that without RFC1323 "Large Windows" indicated above, TCP/IP does not allow applications to buffer more the 64 kB in the network, irrespective of the maximum buffer size configured.

  4. Default Buffer Sizes: The "Maximum Buffer Size" mentioned above sets a maximum limit for the buffers (as you may have guessed by the name!). In addition to this, most operating systems have a system-wide "default buffer size" that is configured in. Unless an application explicitly requests a specific buffer size, it gets a buffer of the default size. Most system administrators set this value such that it is appropriate for a LAN, but would not necessarily be sufficient for WAN with a large BDP.

    System administrators usually have to make a judicious choice for default value and maximum value for the buffers.

  5. Application Buffers: If the default buffer size is not large enough, the application must set its send and receive socket buffer sizes (at both ends) to at least the BDP of the link. Some network applications support options for the user to set the socket buffer size (for example, Cray UNICOS FTP); many do not. There are several modified versions of applications available which support large socket buffer sizes.

    Alternatively, the system-wide default socket buffer size can be raised, causing all applications to utilize large socket buffers. This is not generally recommended, as many network applications then consume system memory which they do not require.

    New: The best solution would be for the operating system to automatically tune socket buffers to the appropriate size. Linux users may want to investigate Web100, which includes autotuning and TCP instrumentation to enable better TCP diagnosis. This follows earlier work by Jeff Semke at PSC where he developed an experimental Autotuning Implementation for NetBSD. In the future, we hope to see such automatic tuning as a part of all TCP implementations, making this entire website obsolete.

    For socket applications, the programmer can choose the socket buffer sizes using a setsockopt() system call. A Detailed Users Guide describing how to set socket buffer sizes within socket based applications has been put together by Von Welch at NCSA.

  6. Path MTU The host system must use the largest possible MTU for the path. This may require enabling Path MTU Discovery (RFC1191). Since RFC1191 is flawed this is never enabled by default and need to be explicitly enabled by the system administrator. Watch for pMTU blackholes. If Path MTU Discovery is unavailable or undesired, it is sometimes possible to trick the system into using large packets, but this may have undesirable side effects.

    The Path MTU Discovery server (described in a little more detail in the next section) may be useful in checking out the largest MTU supported by some paths:

    http://www.ncne.org/jumbogram/mtu_discovery.php

Using Web Based Network Diagnostic Servers

There are some
Web100 based diagnostic servers that are useful for troubleshooting some of the networking problems. We point out a couple of them here.

Support for these features under various operating systems

Operating System (Alphabetical) (Click for additional info) RFC1191 Path MTU DiscoveryRFC1323 Support Default maximum socket buffer size Default TCP socket buffer size Default UDP socket buffer size Applications (if any) which are user tunable RFC2018 SACK Support
More info
BSD/OS 2.0 NoYes256kB8kB9216 snd 41600 rcvNone Hari Balakrishnan's BSD/OS 2.1 implementation
BSD/OS 3.0 YesYes256kB8kB9216 snd 41600 rcvNone
CRI Unicos 8.0 YesYesFTP
(Compaq) Digital Unix 3.2 Yes Winscale, No Timestamps128kB32kBNone
(Compaq) Digital Unix 4.0 YesYes Winscale, No Timestamps 128kB32kB9216 snd 41600 rcvNone PSC Research version
FreeBSD 2.1.5 YesYes256kB16kB40kBNone Luigi Rizzo's FreeBSD2.1R version
Also Eliot Yan of UCLA has one
HPUX 9.X No9.05 and 9.07 provide patches for RFC13231 MB (?) 8kB9216FTP (with patches)
HPUX 10.{00,01,10,20,30} YesYes256kB32kB9216FTP
HPUX 11 YesYes >31MB? 32kB65535FTP
IBM AIX 3.2 & 4.1 NoYes64kB16kB41600 Bytes recieve/9216 Bytes sendNone
IBM MVS TCP stack by Interlink, v2.0 or greater NoYes1MB
Linux 2.4.00 or later. YesYes64kB32kB (see notes32kB(?) None Yes
MacOS (Open Transport) YesYeslimited only by available system RAM32kB64kB (send and receive) Fetch (ftp client)Not in versions up to Open Transport 2.7.x; will be in OT 3.0
NetBSD 1.1/1.2 NoYes256kB16kBNone PSC Research version
FTP Software (NetManage) OnNet Kernel 4.0 for Win95/98 YesYes963.75 MB 8K [146K for Satellite tuning] 8K send 48K recvFTP serverYes
Novell Netware5 YesNo64kB31kBNone
SGI IRIX 6.5 YesYesUnlimitted60kB60kBNoneYes, as of 6.5.7. It is on by default.
Sun Solaris 2.5 YesNo. However, can be purchased as a Sun Consulting Special, and will be in Solaris 2.6256kB8kB8kBNone
Sun Solaris 2.6 YesYes1MB TCP, 256kB UDP8kB8kBNoneYes, experimental patch from Sun
Sun Solaris 7 YesYes1MB TCP, 256kB UDP8kB8kBNoneYes; default is "passive". (See below)
Microsoft Windows NT 3.5/4.0 YesNo64kBmax(~8kB, min(4*MSS, 64kB))No
Microsoft Windows NT 5.0 Beta YesYes
Microsoft Win98 Yes1GB(?!)8kBYes (on by default)
Microsoft Winows 2000 and Windows XP Yes1GB(?!)8kBYes (on by default)
Operating System (Alphabetical) (Click for additional info) Path MTU DiscoveryRFC1323 Support Default maximum socket buffer size Default TCP socket buffer size Default UDP socket buffer size Applications (if any) which are user tunable SACK Support


Detailed procedures for system tuning under various operating systems


Procedure for raising network limits under BSD/OS 2.1 and 3.0 (BSDi)

MTU discovery is now supported in BSD/OS 3.0. RFC1323 is also supported, and the procedure for setting the relevant kernel variable uses the "sysctl" interface described for FreeBSD. See sysctl(1) and sysctl(3) for more information.


Procedure for raising network limits under CRI systems under Unicos 8.0

System configuration parameters are tunable via the command "/etc/netvar". Running "/etc/netvar" with no arguments shows all configurable variables:
% /etc/netvar
Network configuration variables
        tcp send space is 32678
        tcp recv space is 32678
        tcp time to live is 60
        tcp keepalive delay is 14400
        udp send space is 65536
        udp recv space is 68096
        udp time to live is 60
        ipforwarding is on
        ipsendredirects is on
        subnetsarelocal is on
        dynamic MTU discovery is on
        adminstrator mtu override is on
        maximum number of allocated sockets is 3750
        maximum socket buffer space is 409600
        operator message delay interval is 5
        per-session sockbuf space limit is 0
The following variables can be set: Once variables have been changed in by /etc/netvar, they take effect immediately for new processes. Processes which are already running with open sockets are not modified.



Procedure for raising network limits on (Compaq) DEC Alpha systems under Digital Unix 3.2c




Procedure for raising network limits on (Compaq) DEC Alpha systems under Digital Unix 4.0




Procedure for raising network limits under FreeBSD 2.1.5

MTU discovery is on by default in FreeBSD past 2.1.0-RELEASE. If you wish to disble MTU discovery, the only way that we know is to lock an interface's MTU, which disables MTU discovery on that interface.

You can't modify the maximum socket buffer size in FreeBSD 2.1.0-RELEASE, but in 2.2-CURRENT you can use

	sysctl -w kern.maxsockbuf=524288
to make it 512kB (for example). You can also set the TCP and UDP default buffer sizes using the variables
	net.inet.tcp.sendspace
	net.inet.tcp.recvspace
	net.inet.udp.recvspace



Procedure for raising network limits under HPUX 9.X

HP-UX 9.X does not support Path MTU discovery.

There are patches for 9.05 and 9.07 that provide 1323 support. To enable it, one must poke the kernel variables tcp_dont_tsecho and tcp_dont_winscale to 0 with adb (the patch includes a script, but I don't recall the patch number).

Without the 9.05/9.07 patch, the maximum socket buffer buffer size is somewhere around 58254 bytes. With the patch it is somewhere around 1MB (there is a small chance it is as much as 4MB).

The FTP provided with the up to date patches should offer an option to change the socket buffer size. The default socket buffer size for this could be 32KB or 56KB.

There is no support for SACK in 9.X.

Procedure for raising network limits under HPUX 10.X

HP-UX 10.00, 10.01, 10.10, 10.20, and 10.30 supports Path MTU discovery. It is on by default for TCP, and off by default for UDP. On/Off can be toggled with nettune.

Up through 10.20, RFC 1323 support is like the 9.05 patch, except the maximum socket buffer size is somewhere between 240 and 256KB. In other words, you need to do the same adb "pokes" as described above.

10.30 does not require adb "pokes" to enable RFC1323. 10.30 also replaces nettune with ndd. The 10.X default TCP socket buffer size is 32768, the default UDP remains unchanged from 9.X. Both can be tweaked with nettune.

FTP should be as it is in patched 9.X.

There is no support for SACK in 10.X up through 10.20.

Procedure for raising network limits under HPUX 11

HP-UX 11supports PMTU discovery and enables it by default. This is controlled through the ndd setting ip_pmtu_strategy.

Note: Addition (extensive) information is available at ftp://ftp.cup.hp.com/dist/networking/briefs/annotated_ndd.txt

RFC 1323 support is enabled automagically in HP-UX 11. If an application requests a window/socket buffer size greater than 64 KB, window scaling and timestamps will be used automatically.

The default TCP window size in HP-UX 11 remains 32768 bytes and can be altered though ndd and the settings:

    tcp_recv_hiwater_def
    tcp_recv_hiwater_lfp
    tcp_recv_hiwater_lnp
    tcp_xmit_hiwater_def
    tcp_xmit_hiwater_lfp
    tcp_xmit_hiwater_lnp

FTP in HP-UX 11 uses the new sendfile() system call. This allows data to be sent directly from the filesystem buffer cache through the network without intervening data copies.

HP-UX 11 (patches) and 11i (patches or base depending on the revision) have commercial support for SACK (based on feedback from HP - Thanks!)

Here is some ndd -h parm output for a few of the settings mentioned above. For those not mentioned, use ndd -h on an HP-UX 11 system, or consult the online manuals at http://docs.hp.com/

# ndd -h ip_pmtu_strategy

ip_pmtu_strategy:

    Set the Path MTU Discovery strategy: 0 disables Path MTU
    Discovery; 1 enables Strategy 1; 2 enables Strategy 2.

    Because of problems encountered with some firewalls, hosts,
    and low-end routers, IP provides for selection of either
    of two discovery strategies, or for completely disabling the
    algorithm. The tunable parameter ip_pmtu_strategy controls
    the selection.

    Strategy 1: All outbound datagrams have the "Don't Fragment"
    bit set. This should result in notification from any intervening
    gateway that needs to forward a datagram down a path that would
    require additional fragmentation. When the ICMP "Fragmentation
    Needed" message is received, IP updates its MTU for the remote
    host. If the responding gateway implements the recommendations
    for gateways in RFCM- 1191, then the next hop MTU will be included
    in the "Fragmentation Needed" message, and IP will use it.
    If the gateway does not provide next hop information, then IP
    will reduce the MTU to the next lower value taken from a table
    of "popular" media MTUs.

    Strategy 2: When a new routing table entry is created for a
    destination on a locally connected subnet, the "Don't Fragment"
    bit is never turned on. When a new routing table entry for a
    non-local destination is created, the "Don't Fragment" bit is
    not immediately turned on. Instead,

    o  An ICMP "Echo Request" of full MTU size is generated and
       sent out with the "Don't Fragment" bit on.

    o  The datagram that initiated creation of the routing table
       entry is sent out immediately, without the "Don't Fragment"
       bit. Traffic is not held up waiting for a response to the
       "Echo Request".

    o  If no response to the "Echo Request" is received, the
       "Don't Fragment" bit is never turned on for that route;
       IP won't time-out or retry the ping. If an ICMP "Fragmentation
       Needed" message is received in response to the "Echo Request",
       the Path MTU is reduced accordingly, and a new "Echo Request"
       is sent out using the updated Path MTU. This step repeats as
       needed.

    o  If a response to the "Echo Request" is received, the
       "Don't Fragment" bit is turned on for all further packets
       for the destination, and Path MTU discovery proceeds as for
       Strategy 1.

    Assuming that all routers properly implement Path MTU Discovery,
    Strategy 1 is generally better - there is no extra overhead for the
    ICMP "Echo Request" and response. Strategy 2 is available
    only because some routers, or firewalls, or end hosts have been
    observed simply to drop packets that have the DF bit on without
    issuing the "Fragmentation Needed" message. Strategy 2 is more
    conservative in that IP will never fail to communicate when using
    it. [0,2] Default: Strategy 2

# ndd -h tcp_recv_hiwater_def | more

tcp_recv_hiwater_def:

    The maximum size for the receive window. [4096,-]
    Default: 32768 bytes

# ndd -h tcp_xmit_hiwater_def

tcp_xmit_hiwater_def:

    The amount of unsent data that triggers write-side flow control.
    [4096,-] Default: 32768 bytes

HP has detailed networking performance information online, including information about the "netperf" tool and a large database of system performance results obtained with netperf:

http://www.netperf.org/netperf/NetperfPage.html



Procedure for raising network limits on IBM RS/6000 systems under AIX 3.2 or AIX 4.1

RFC1323 options and defaults are tunable via the "no" command.

See the "no" man page for options; additional information is available in the IBM manual AIX Versions 3.2 and 4.1 Performance Tuning Guide, which is available on AIX machines through the InfoExplorer hypertext interface.



Procedure for raising network limits on IBM MVS systems under the Interlink TCP stack

The default send and receive buffer sizes are specified at startup, through a configuration file. The range is from 4K to 1MByte. The syntax is as follows:

FTP and user programs can be configured to use Window Scaling and Timestamps. This is done through the use of SITE commands:


Tuning a Linux 2.4 system

Enabling and disabling some of the advanced features of TCP:
(Usually it is a good idea for these to be enabled)
	/proc/sys/net/ipv4/tcp_timestamps
	/proc/sys/net/ipv4/tcp_window_scaling
	/proc/sys/net/ipv4/tcp_sack 
To enable all these features, for example, do the following as root:

	echo 1 > /proc/sys/net/ipv4/tcp_timestamps
	echo 1 > /proc/sys/net/ipv4/tcp_window_scaling
        echo 1 > /proc/sys/net/ipv4/tcp_sack 

Path MTU discovery can be enabled and disabled using the following boolean sysctl variable:

	/proc/sys/net/ipv4/ip_no_pmtu_disc
But be on the lookout for blackholes if this is enabled, as indicated at the beginning of the document. Please see the following URL for additional documentation on sysctl variables:
http://www.linuxhq.com/kernel/v2.4/doc/networking/ip-sysctl.txt.html
    

Tuning the default and maximum window sizes:

	/proc/sys/net/core/rmem_default   - default receive window
	/proc/sys/net/core/rmem_max       - maximum receive window
	/proc/sys/net/core/wmem_default   - default send window 
	/proc/sys/net/core/wmem_max       - maximum send window

	/proc/sys/net/ipv4/tcp_rmem       - memory reserved for TCP rcv buffers
	/proc/sys/net/ipv4/tcp_wmem       - memory reserved for TCP snd buffers

The following values would be reasonable for path with a large BDP:


       	echo 8388608 > /proc/sys/net/core/wmem_max 
       	echo 8388608 > /proc/sys/net/core/rmem_max 
       	echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
       	echo "4096 65536 4194304" > /proc/sys/net/ipv4/tcp_wmem

You will find a short description in LINUX-SOURCE_DIR/Documentation/networking/ip-sysctl.txt

If you would like these changes to be preserved across reboots, it may be a good idea to add these commands to your /etc/rc.d/rc.local file.

Linux users may want to look into the Web100 project. In addition to providing complete instrumentation of the TCP stack, it has features for doing auto tuning of send and receive buffers.

For paths with a very large BDP, it also has some features that will allow you to get better transfers.

User testimonial: With the tuned TCP stacks it was possible to get a maximum throughput between 1.5 - 1.8 Mbit/s via a 2Mbit/s satellite link, measured with netperf.


Information about tuning for MacOS

I don't have detailed information, however, someone pointed me to a good website with useful information. The URL is http://www.sustworks.com/products/prod_ottuner.html. I don't endorse the product they are selling (since I've never tried it). However, it is available for a free trial, and they appear to do an excellent job of describing perf-tune issues for Macs.



Procedure for raising network limits under NetBSD

RFC1323 is on by default in NetBSD 1.1 and above. Under NetBSD 1.2, it can be verified to be on by typing:
      sysctl net.inet.tcp.rfc1323

The maximum socket buffer size can be modified by changing SB_MAX in /usr/src/sys/sys/socketvar.h.

The default socket buffer sizes can be modified by changing TCP_SENDSPACE and TCP_RECVSPACE in /usr/src/sys/netinet/tcp_usrreq.c.

It may also be necessary to increase the number of mbufs, NMBCLUSTERS in /usr/src/sys/arch/*/include/param.h.

Update: It is also possible to set these parameters in the kernel configuration file.

options		SB_MAX=1048576		# maximum socket buffer size
options		TCP_SENDSPACE=65536	# default send socket buffer size
options		TCP_RECVSPACE=65536	# default recv socket buffer size
options		NMBCLUSTERS=1024	# maximum number of mbuf clusters



Procedure for raising network limits under FTP Software (NetManage) OnNet 4.0 for Win95/98

OnNet Kernel has a check box "Enable Satellite tuning" which was intended and tested for 2Mb Satellite link with 600ms delay. This sets tcp window to 146K.

Many default settings, all of the above and more, may be overriden with registry entries. We plan to make available tuning guidelines at "some future time". Also default TCP window may be set with Statistics app which is installed with OnNet Kernel.

The product "readme" discusses changing TCP window size and Initial slow start threshold with the Windows registry.

Statistics also has interesting graphs of TCP/UDP/IP/ICMP traffic. Also IPtrace app is shipped with OnNet Kernel to view unicast / multicast / broadcast traffic (no unicast traffic for other hosts - it does not run in promiscuous mode).



Procedure for raising network limits under SGI systems under IRIX 6.5

Under this version, there are two locations where configuration is done. Although I list the BSD information first, SGI recommends using systune which is described below.

The BSD values are now stored in /var/sysgen/mtune/bsd.

For instance from the file:

* name                  default         minimum   maximum
*
* TCP window sizes/socket space reservation; limited to 1Gbyte by RFC
1323
*
tcp_sendspace                   61440   2048    1073741824
tcp_recvspace                   61440   2048    1073741824

These variables are used similarly to earlier IRIX 5 and 6 versions.

There is now a systune command. This command allows you to configure other networking variables. systune keeps strack of the chances you make in a file called stune so that you can see them all in one place. Also note that changes made using systune are permanent. Here is a sample of things which can be tuned using systune:

/usr/sbin/systune (which is like sysctl for BSD) is what you use for
tuneable values.

 group: net_stp (statically changeable)
        stp_ttl = 60 (0x3c)
        stp_ipsupport = 0 (0x0)
        stp_oldapi = 0 (0x0)

 group: net_udp (dynamically changeable)
        soreceive_alt = 1 (0x1)
        arpreq_alias = 0 (0x0)
        udp_recvgrams = 2 (0x2)
        udp_sendspace = 61440 (0xf000)
        udp_ttl = 60 (0x3c)

 group: net_tcp (dynamically changeable)
        tcp_gofast = 0 (0x0)
        tcp_recvspace = 61440 (0xf000)
        tcp_sendspace = 61440 (0xf000)
        tcprexmtthresh = 3 (0x3)
        tcp_2msl = 60 (0x3c)
        tcp_mtudisc = 1 (0x1)
        tcp_maxpersistidle = 7200 (0x1c20)
        tcp_keepintvl = 75 (0x4b)
        tcp_keepidle = 7200 (0x1c20)
        tcp_ttl = 60 (0x3c)

 group: net_rsvp (statically changeable)
        ps_num_batch_pkts = 0 (0x0)
        ps_rsvp_bandwidth = 50 (0x32)
        ps_enabled = 1 (0x1)

 group: net_mbuf (statically changeable)
        mbretain = 20 (0x14)
        mbmaxpages = 16383 (0x3fff)

 group: net_ip (dynamically changeable)
        tcpiss_md5 = 0 (0x0)
        subnetsarelocal = 1 (0x1)
        allow_brdaddr_srcaddr = 0 (0x0)
        ipdirected_broadcast = 0 (0x0)
        ipsendredirects = 1 (0x1)
        ipforwarding = 1 (0x1)
        ipfilterd_inactive_behavior = 1 (0x1)
        icmp_dropredirects = 0 (0x0)

 group: network (statically changeable)
        netthread_float = 0 (0x0)

 group: inpcb (statically changeable)
        udp_hashtablesz = 2048 (0x800)
        tcp_hashtablesz = 8184 (0x1ff8)

Changes made using systune may or may not require a reboot. This can be easily determined by looking at the 'group' heading for each section of tunables. If the group heading says dynamic, changes can be made on the fly. Group headings labelled static require a reboot.

Finally, the tcp_sendspace and tcp_recvspace can be tuned on a per-interface basis using the rspace and sspace options to ifconfig.

SACK: As of 6.5.7, SACK is included in the IRIX operating system and is on by default.


Procedure for raising network limits under Solaris 2.5

The ndd variable tcp_xmit_hiwat is used to determine the default SO_SNDBUF size.
The ndd variable tcp_recv_hiwat is used to determine the default SO_RCVBUF size.

The ndd variable tcp_max_buf specifies the maximum socket buffer size.

To change them, use:

	ndd -set /dev/tcp tcp_max_buf xxx
	ndd -set /dev/tcp tcp_xmit_hiwat xxx
	ndd -set /dev/tcp tcp_recv_hiwat xxx

(Note: I believe xxx should be specified in bytes)

The ndd variable ip_path_mtu_discovery controls the use of path MTU discovery. The default value is 1, which means on.

Note that ndd can also be used to increase the volume of TCP connections available to a machine.

	ndd -set /dev/tcp tcp_conn_req_max 
(where is greater than 32 (the default) but less (or equal to) 1024). This may help if your network traffic is comprised of many small streams rather than just a few large streams.

In Solaris 2.6; and also 2.5 and 2.5.1 with newer tcp patches, the tcp_conn_req_max ndd setting has been removed, and split into two new settings:

  tcp_conn_req_max_q 
    default value = 128
    number of connections in ESTABLISHED state
    (3-way handshake completed; not yet accepted)

  tcp_conn_req_max_q0 
    default value =  1024
    number of connections in SYN_RCVD state
SACK is now available in an experimental release for Solaris 2.6. To obtain it, see ftp://playground.sun.com/pub/sack/tcp.sack.tar.Z

Additional Info about recent versions of solaris can be found at http://www.rvs.uni-hannover.de/people/voeckler/tune/EN/tune.html#thp


Details about SACK under Solaris 7

Solaris 7 includes SACK, which is on in "passive" mode by default. That means it is enabled only if the other side sends sackok in the initial SYN. To make it active, set tcp_sack_permitted to 2. The default is 1. To completely disable SACK, set tcp_sack_permitted to 0. The tcp_sack_permitted variable can be set using the ndd command as described below. Other kernel variables remain the same under Solaris 7 as they were in 2.5.

Misc Info about Windows NT

Editor's note: See Windows 98 above for a detailed description of how this all works. In NT land, the Registry Editor is called regedt32.

Any Registry Values listed appear in:
	HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

Receive Window
	maximum value = 64kB, since window scaling is not supported
	default value = min( max( 4 x MSS, 
				  8kB rounded up to nearest multiple of MSS),
			     64kB) 
	Registry Value: 
		TcpWindowSize

Path MTU Discovery Variables:
		EnablePMTUDiscovery	(default = enabled)
			turn on/off path MTU discovery
		EnablePMTUBHDetect	(default = disabled)
			turn on/off Black Hole detection

Using Path MTU Discovery:
        EnablePMTUDiscovery     REG_DWORD
	Range: 0 (false) or 1 (true)
	Default: 1

    Determines whether TCP uses a fixed, default maximum transmission unit
    (MTU) or attempts to find the actual MTU. If the value of this entry is
    0, TCP uses an MTU of 576 bytes for all connections to computers outside
    of the local subnet. If the value of this entry is 1, TCP attempts to
    discover the MTU (largest packet size) over the path to a remote host.

Using Path MTU Discovery's "Blackhole Detection" algorithm:
        EnablePMTUBHDetect     REG_DWORD
	Range: 0 (false) or 1 (true)
	Default: 0 

    If the value of this entry is 1, TCP tries to detect black hole routers
    while doing Path MTU Discovery. TCP will try to send segments without
    the Don't Fragment bit set if several retransmissions of a segment go
    unacknowledged. If the segment is acknowledged as a result, the MSS will
    be decreased and the Don't Fragment bit will be set in future packets on
    the connection.

I received the following additional notes about the Windows TCP implementation.

PMTU Discovery. If PMTU is turned on, NT 3.1 cannot cope with routers that have the BSD 4.2 bug (see RFC 1191, section 5). It loops resending the same packet. Only confirmed on NT 3.1.


Procedure for raising network limits under Microsoft Windows 98

New: Some folks at NLANR/MOAT in SDSC have written a tool to do guide you through some of this stuff. It can be found at http://moat.nlanr.net/Software/TCPtune/.

Even newer: I've updated some sending window information which was inaccurate. See below.

Several folks have recently helped me to figure out how to accomplish the necessary tuning under Windows98, and the features do appear to exist and work. Thanks to everyone for the assistance! The new description below should be useful to even the complete Windows novice (such as me :-).

Windows98 includes implementation of RFC1323 and RFC2018. Both are on by default. (However, with a default buffer size of only about 8kB, window scaling doesn't do much).

Windows stores the tuning parameters in the Windows Registry. In the registry are settings to toggle on/off Large Windows, Timestamps, and SACK. In addition, default socket buffer sizes can be specified in the registry.

In order to modify registry variables, do the following steps:

  1. Click on Start -> Run and then type in "regedit". This will fire up the Registry Editor.
  2. In the Registry Editor, double click on the appropriate folders to walk the tree to the parameter you wish to modify. For the parameters below, this means clicking on HKEY_LOCAL_MACHINE -> System -> CurrentControlSet -> Services -> VxD -> MSTCP.
  3. Once there, you should see a list of parameters in the right half of your screen, and MSTCP should be highlighted in the left half. The parameters you wish to modify will probably not appear in the right half of your screen; this is OK.
  4. In the menu bar, Click on "Edit -> New -> String Value". It is important to create the parameter with the correct type. All of the parameters listed below are strings.
  5. A box will appear with "New Value #1"; change the name to the name listed below, exactly as shown. Hit return.
  6. On the menu, click on "Edit -> Modify" (your new entry should still be selected). Then type in the value you wish to assign to the parameter.
  7. Exit the registry editor, and reboot windows. (The rebooting is important, *sigh*.)
  8. When your system comes back up, you should have access to the features you have just turned on. The only real way to verify this is through packet traces (or by noticing a significant performance improvement).

TCP/IP Stack Variables

Support for TCP Large Windows (TCPLW)

Win98 TCP/IP supports TCP large windows as documented in RFC 1323. TCP large windows can be used for networks that have large bandwidth delay products such as high-speed trans-continental connections or satellite links. Large windows support is controlled by a registry key value in:

HKLM\system\currentcontrolset\services\VXD\MSTCP

The registry key Tcp1323Opts is a string value type. The values for Tcp1323Opt are

Value Meaning
0 No Windowscaling and Timestamp Options
1 Window scaling but no Timestamp options
3 Window scaling and Time stamp options

The default value for Tcp1323Opts is 3: Window Scaling and Time stamp options. Large window support is enabled if an application requests a Winsock socket to use buffer sizes greater than 64K. The current default value for TCP receive window size in Memphis TCP is 8196 bytes. In previous implementations the TCP window size was limited to 64K, this limit is raised to 2**30 through the use of TCP large window support.

Support for Selective Acknowledgements (SACK)

Win98 TCP supports Selective Acknowledgements as documented in RFC 2018. Selective acknowledgements allow TCP to recover from IP packet loss without resending packets that were already received by the receiver. Selective Acknowledgements is most useful when employed with TCP large windows. SACK support is controlled by a registry key value in:

HKLM\system\currentcontrolset\services\VXD\MSTCP

The registry key SackOpts is a string value type. The values for SackOpts are

Value Meaning
0 No Sack options
1 Sack Option enabled

Support for Fast Retransmission and Fast Recovery

Win98 TCP/IP supports Fast Retransmission and Fast Recovery of TCP connections that are encountering IP packet loss in the network. These mechanisms allow a TCP sender to quickly infer a single packet loss by reception of duplicate acknowledgements for a previously sent and acknowledged TCP/IP packet. This mechanism is useful when the network is intermittently congested. The reception of 3 (default value) successive duplicate acknowledgements indicates to the TCP sender that it can resend the last unacknowledged TCP/IP packet (fast retransmit) and not go into TCP slow start due to a single packet loss (fast recovery). Fast Retransmission and Recovery support is controlled by a registry key value in:

HKLM\system\currentcontrolset\services\VXD\MSTCP\Parameters

The registry key MaxDupAcks is DWORD taking integer values from 2 to N. If MaxDupAcks is not defined, the default value is 3.

Update: If you wish to set the default receiver window for applications, you should set the following key:

DefaultRcvWindow

HKLM\system\currentcontrolset\services\VXD\MSTCP

DefaultRcvWindow is a string type and the value describes the default receive windowsize for the TCP stack. Otherwise the windowsize has to be programmed in apps with setsockopt.

For a long time, I had the following sentence on this page:

It turns out that there is not in fact such a variable. My limited experience has shown that, in some cases, it is possible to see very large send windows from Microsoft boxes. However, recent reports on the tcpsat mailing list have also stated that a number of applications under Windows severely limit the sending window. These applications appear to include FTP and possibly also the CIFS protocol which is used for file sharing. With these applications, it appears to be impossible to exceed the performance limit dictated by this sending window.

If anyone has any further information on these specific applications under Windows, I would be happy to include it here.


Procedure for raising network limits under Microsoft Windows 2000 and Windows XP

New: The following URL:


    http://rdweb.cns.vt.edu/public/notes/win2k-tcpip.htm

appears to be a pretty good summary of the procedure for TCP tuning under Windows 2000. It also has the URL for the Windows 2000 TCP tuning document from Microsoft.

We are not sure if it still necessary to set DefaultReceiveWindow even after setting the parameters indicated in the URL above.

If your machine does a lot of large outbound tranfers, it will be necessary to set DefaultSendWindow in addition to the suggestions mentioned above.


Matt Mathis <mathis@psc.edu> and Raghu Reddy <rreddy@psc.edu>
(with help from many others, especially Jamshid Mahdavi)

* *

This material is based in whole or in part on work supported by the National Science Foundation under Grant Nos. 9415552, 9870758, 9720674, or 9711091. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

© Pittsburgh Supercomputing Center (PSC), Carnegie Mellon University
URL: http://www.psc.edu/networking/perf_tune.html
Revised: Thursday, 04-Mar-2004 05:30:12 EST