WD-mux-961023

Simple MUX Protocol Specification

W3C Working Draft 23-October-96

This version:: http://www.w3.org/pub/WWW/TR/WD-mux-961023
$Id: WD-mux-961023.html,v 1.5 1996/12/09 03:35:09 jigsaw Exp $
Latest version:: http://www.w3.org/pub/WWW/TR/WD-mux
Authors:: Jim Gettys, World Wide Web Consortium

Status of this document

This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current W3C working drafts is also available.

Note: Since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves. This document was not modified for months between March and October of 1996 due to the author's involvement with HTTP/1.1; it is expected to evolve more rapidly until complete.

Abstract

The Internet is suffering from the effects of the HTTP/1.0 protocol, which was designed without understanding of the underlying TCP transport protocol. HTTP/1.0 opens a TCP connection for each URL retrieved (at a cost of both packets and round trip delays), and then closes the connection. For small HTTP requests, these connections have poor performance due to TCP slow start [Jac88] as well as the round trips required to open the connection. Current TCP implementations discard the congestion information when connections are closed; therefore the slow start algorithm is invoked each operation.

The current widespread use of multiple TCP connections in use simultaneously is compounding HTTP/1.0's misdesign:

A client gains an significant perceived performance advantage using multiple connections as early retrieval of meta-data (e.g. size) of embedded objects in a page. This allows a client to format a page sooner without suffering annoying reformatting of the page. Clients which open multiple connections in parallel to the same server, however can cause self congestion on heavily congested links, since TCP opens and closes are not themselves congestion controlled.
To keep low bandwidth/high latency links busy, more than one connection has been necessary since slow start may cause the line to be partially idle.
The additional TCP opens cause performance problems in the network, but a client that opens multiple connections simultaneously to the same server may also receive an "unfair" bandwidth advantage in the network relative to clients that use a single connection. This problem is not solvable at the application level; only the network itself can enforce such "fairness".

Persistent connections, part of HTTP/1.1 [HTTP/1.1] will go a long way to reduce the network traffic and some of the congestion problems caused by the HTTP/1.0 protocol; however, but by itself will not succeed, as it does not address the rendering nor the fairness problems described above.

The solution to these problems requires two actions; either by itself will not entirely discourage opening multiple connections to the same server from a client.

Internet service providers should enable the RED [RED] algorithm in their routers to ensure bandwidth fairness to clients when the network is congested. RED also addresses queue length problems observed in routers today.
Development and deployment of a multiplexing protocol for use with HTTP (and eventually other protocols), so that multiple objects from a web server can be fetched approximately simultaneously over a single TCP connection, so that the meta-data to objects can be sent to clients without other metadata waiting for the rest of the first object requested.

This document describes such an experimental multiplexing protocol. It is designed to multiplex a connection underneath HTTP so that HTTP itself does not have to change, and allow coexistence of multiple protocols (e.g. HTTP and HTTP-NG), which will ease transitions to future Web protocols, and communications of client applets using private protocols with servers over the same connection as the HTTP conversation.

Introduction

This document describes an experimental design for multiplexing transport, intended for, but not restricted to use with the World Wide Web. Use of this protocol is EXPERIMENTAL and the protocol is guaranteed to change. In particular, transition strategies to use of Simple Mux protocol have not been worked out. You have been warned!

Ideas from this design come from Simon Spero's SCP [SCP] description and from experience from the X Window System's protocol design.

Goals:

Unconfirmed service without negotiation.
SCP allows data to be sent with the session establishment; the recipient does not confirm successful connection establishment, but may reject unsuccessful attempts. This simplifies the design of the protocol, and removes the latency required for a confirmed operation.
simple design
performance where critical

There are four issues that make Simon Spero's SCP inadequate for our use:

it has no provision for multiplexing multiple protocols over the same transport connection, essential for graceful transition without dependency on the currently incomplete NG design, and to allow other uses which could use the same multiplexed connection (e.g. applet communication with servelets).
SCP's 8 byte overhead is not reasonable most of the time. MUX uses four bytes in the default case. The design below permits an 8 byte header if you care to preserve 64 bit alignment at the cost of bytes. In practice, there seems few data formats or architectures that actually require more than 32 bit alignment.
Without some form of flow control, infinite buffering in clients would be required.
Alignment is preserved in the data stream. This allows compact, high speed (un)marshalling code in implementations of binary protocols, without extra data copies, which in such protocols can be significant overhead.

So far, Mux is similar to SCP. There are some important differences:

allow multiple protocols to be multiplexed over same connection (not available in SCP).
lower overhead than SCP, while preserving data alignment
ability to build a full function socket interface above this protocol.

Other comment on SCP:

SCP has 2²⁴ sessions, which seems highly excessive, and reserves 1024 of them for future use.

The X Window system protocol did not (on day zero) make provisions for objects larger than 2¹⁸ bytes; this became a problem for poly graphics operations and for 3D extensions, particularly in shared memory situations. Large images, and either very fast transport (e.g. FDDI/ATM) or transport via shared memory to a local server encourage essentially unlimited lengths. This multiplexor can be used under all circumstances, even though for low bandwidth, typical sizes will be quite small. If a client wants to always preserve 64 bit alignment, the client can choose to always use the extended form described hereafter.

Mux Header

Mux headers are always in little endian byte order. If people want, we could expand out the union below on a control message type basis (e.g. the way the C bindings to X events were written out...). For this draft, I'm not doing so.

#define MUX_LONG_LENGTH		0x80000000
#define MUX_CONTROL		0x40000000
#define MUX_SYN			0x20000000
#define MUX_FIN			0x10000000
#define MUX_RST 		0x08000000
#define MUX_PUSH		0x04000000
#define MUX_SESSION		0x03FB0000
#define MUX_LENGTH		0x0003FFFF

typedef unsigned int flagbit;
struct w3mux_hdr {
    union {
	struct {
	    flagbit long_length : 1;
	    flagbit control : 1;
	    flagbit syn : 1;
	    flagbit fin : 1;
	    flagbit rst : 1;
	    flagbit push : 1;
	    unsigned int session_id : 8;
	    unsigned int session_fragment_size : 18; 	/* used for other purposes if long_length is set */
	    int long_fragment_size : 32; 		/* only present if long_length is set */
	} data_hdr;
        struct {
	    flagbit long_length : 1;
	    flagbit control : 1;
	    unsigned int control_code : 4;
	    unsigned int session_id : 8;
	    unsigned int session_fragment_size : 18;	/* used for other purposes if long_length is set */
	    int long_fragment_size : 32;		/* only present if long_length is set */
	} control_message;
    } contents;
};

Alignment

Mux headers are always (at least) 32 bit aligned. To find the next mux header, take the session_fragment_size, and round up to the next 32 bit boundary. If the long_length bit was set, round up to the next 64 bit boundary.

Long Fragments

A mux header with the long_length bit set must use the 32 bits following the mux header for the length of this session fragment, rather than that specified in the session_fragment_size.

Clients can also use this bit to force 64 bit alignment of the protocol stream.

Session ID Allocation

Each session is allocated a session identifier. Session Identifiers below 2 are reserved for future use. Session IDs allocated by initiator of the TCP connection are even; those allocated by receivers of the TCP connection odd. Proxies or re-multiplexors that do not understand messages of reserved Sessions ID should forward them unchanged.

Session Establishment

A session is established by setting the syn and long_length bits in the first message sent on that session. The session_fragment_size field is interpreted as the protocol stack ID of the session, as discussed below. Data can be sent in this initial fragment; the payload is the number of bytes as specified by long_fragment_size.

Protocol Stack ID's and Protocol Stack Numbering

Protocols and transports are "stacked" in MUX. This concept comes directly from Xerox's ILU system [ILU]. That is to say, one protocol is used (e.g. HTTP), and one or more transport protocols (e.g. TCP). Transports may be stacked one on top of the other, for example, GZIP compression on top of TCP. So a valid protocol stack might be "TCP|gzip|http", or "TCP|ssl3.0|gzip|http". In this usage, all strings are case insensitive. The string "TCP" is regarded identically with the string "TcP". Note that nonsensical stacks can easily be created.

Protocols can be used over different transports, and transport layers can be stacked. For example, you might want to allow for using HTTP as a protocol, which is then GZIP compressed, and then sent over TCP.

A remaining major problem is the string associated with a protocol or transport. By using the IANA registry for such strings and protocol stack ID's as defined in the list below, we can immediately define most currently used protocols and transports, and pick up automatically any future registrations in made via IANA. MUX therefore defines

Stack ID's in the range of 0x0-0xFFFF are interpreted as "Keyword|TCP" well known port numbers, as registered by IANA in the port number registry [Ports]; Keyword is as defined in the registry.
Stack ID's in the range of 0x10000-0x1FFFF are interpreted as "Keyword|UDP" well known port numbers, as registered by IANA in the port number registry [Ports]; Keyword is as defined in the registry.
Stack ID's in the range of 0x20000-0x200FF are interpreted as "Keyword", as registered by IANA in the protocol numbers registry [Protocols]; Keyword is as defined in the registry.

Stack ID's in the range of 0x20100-0x30000 are assigned as follows:

Address	Keyword	Protocol
0x20100	SMUX	This protocol (MUX is already a used keyword in the IANA registry).
0x20101	GZIP	The GZIP compression algorithm, as defined in [GZIP]
0x20102 - 0x30000	TBD	What additional ones do we want to define now???

Stack ID's in the range of 0x20100-0x3FFFF are allocated by this protocol as new protocols, transports, and stacks are defined.

When Stack ID's from the predefined range are used to define a new protocol/transport stack (see DefineStack control message below), they are merely used to indicate the Keyword to be used in that position of the stack, not the full stack itself.

Graceful Release

A session is ended by sending a message with the FIN bit set. Each end of a connection may be closed independently.

Disgraceful Release

A session may be terminated by sending a message with the rst bit set. All pending data for that session should be discarded. "No such protocol" errors detected by the receiver of a new session are signaled to the originator on session creation by sending a message with the RST bit set. (Same as in TCP).

Message Boundaries

A message boundary is marked by sending a message with the PUSH bit set. The boundary is set between the last octet in this message, including that octet, and the first byte of a subsequent message.

Flow Control

Flow control is determined by a simple credit scheme described by the SendCredits control message below. Fragments transmitted must never exceed the outstanding credit for that session. The initial outstanding credit for a session is TBD.

We need to think through the possible deadlock scenarios here carefully, and add warnings to implementors to avoid implementation traps, particularly on memory constrained systems...

Control Messages

The control bit and long_length bit of the mux header is always set in a control message. The control_code of the control message determines the control message type. The long_fragment_size determines how much additional data may be contained in the control message. Any unused data in the control message must be ignored. If the long_length bit is set, some control messages may reuse the session_fragment_size field for different purposes than the length.

(There might be more efficient bit packings if the long_length bit is allowed to be zero, at the cost of complexity; is it worth it? I decided to KISS in this draft, but I worry a bit about the length of SendCredit messages. I'll think about this next week when I return. Also, this table is the last thing I tried to define, and contains errors of having been composed at 11:00PM on a day I started work at 9:00AM. Please give opinions and sanity checks.)

The individual control message types are listed below.

control_code Name Description

0
DefineString
The session_id is ignored. The session_fragment_size is interpreted as the Stack ID. The fragment contains the string to be defined as an ID. The long_fragment_size contains the length of the string.

1
DefineStack
The session_id is ignored. The session_fragment_size is interpreted as the Stack ID. The fragment contains an array of little endian Stack ID's, the first of which is the new ID being allocated, and the following being the lowest point of the stack first, as in the examples above. (Need example using numeric values)

2
MuxControl
The session_id specifies the session. A session_fragment_size of zero means no limit on the fragment size allowed for this session. This sets a limit on fragment sizes below the outstanding credit limit.

3
SendCredit
The session_id specifies the session. The long_fragment_size specifies the flow control credit granted. A value of zero indicates no limit on how much data may be sent on this session (fragments will only limited by the MuxControl size).

4-15
-
undefined Reserved for future use.

control_code	Name	Description
0	DefineString	The session_id is ignored. The session_fragment_size is interpreted as the Stack ID. The fragment contains the string to be defined as an ID. The long_fragment_size contains the length of the string.
1	DefineStack	The session_id is ignored. The session_fragment_size is interpreted as the Stack ID. The fragment contains an array of little endian Stack ID's, the first of which is the new ID being allocated, and the following being the lowest point of the stack first, as in the examples above. (Need example using numeric values)
2	MuxControl	The session_id specifies the session. A session_fragment_size of zero means no limit on the fragment size allowed for this session. This sets a limit on fragment sizes below the outstanding credit limit.
3	SendCredit	The session_id specifies the session. The long_fragment_size specifies the flow control credit granted. A value of zero indicates no limit on how much data may be sent on this session (fragments will only limited by the MuxControl size).
4-15	-	undefined Reserved for future use.

Remaining Issues for Discussion

StackHints control message: It occurs to me that defining a control message to inform the other end of the connection which protocols you are able to handle may save a round trip in the HTTP related cases. For example, if a connection starts out using HTTP, then we might know that we can upgrade to NG as soon as the first reply is back from the server, if it tells the client that it is also NG capable. Then we could start a new session speaking NG, without waiting for an UPGRADE and upgrade acknowledgement round trip. Is this worth defining immediately, or should we wait for experience?
ListStacks control message: For debugging purposes, it might be useful to allow one end to request the other to tell it what stacks it supports (note that this may be involved, as new strings and stacks may need to be defined before it would mean anything to the other end. There are also serious security issues; while a request might always be able to be made, there are circumstances under which it should not be honored. Is this worth defining immediately, or should we wait for experience?
When can MUX be used???: What are the appropriate strategies for determining if the simple multiplexing protocol can be used? Name server hack? UPGRADE in HTTP? Remember that previous UPGRADE to use MUX worked?

Closed Issues from Discussion and Mail

Stacking Protocols and Transports (Stacks)

ILU style protocol stacks are a GOOD THING. But there have been too many worries about the birthday problem for people to be comfortable with Bill Janssen's hashing schemes (see Henrik Frystyk Nielsen and Robert Thau's mail on this topic).

Answer: Resolution (reflected in above draft) is to go with a modification of my original scheme, in concert with control messages allowing new protocols and transport names to be defined and stacked per the ILU scheme. By using the IANA registry idea (modified slightly to depend on the protocol names defined by IANA as strings) we still get the minimum number of bytes my design had for what is believed the current most common case, while allowing for arbitrary stacks of protocols and transports.

Parameters, options and versions on protocol names are left to whoever interprets the protocol or transport name; the complexity of predefining syntax (which might conflict with syntax required by the individual protocol or transport itself) seems to me to be a mistake, so this idea is NOT codified in the specification, but left to upper layers to implement or not at their option.

Byte Usage

Wasting bytes in general, and in particular at connection establishment, for a multiplexing transport must be avoided. There are several reasons for this:

if the initial segment is too long, a network round trip will be lost to TCP slow start, so bytes near the beginning of a conversation MAY BE much more precious than bytes later in the conversation, once slow start overhead has been paid. If the first segment is too long, you fall off a cliff.
Directly affects user perceived response; no cleverness of later packing and batching of request can get the time back; each goes directly to perceived latency when a user talks to the server for the first time.

So there is more than the usual tension between generality vs. performance. Performance analysis

Human perception is about 30 milliseconds; if much more than this, the user perceives delay. At 14.4 K baud, one byte uncompressed costs .55 milliseconds (ignoring modem latencies). On an airplane via telephone today, you get a munificent 4800 baud, which is 3X slower. Cellular modems transmitting data (CDPD), as I understand it, will give us around 20Kbaud, when deployed.

So basic multiplexing @ 4 byte overhead costs ~ 2 milliseconds on common modems. This means basic overhead is small vs. human perception, for most low speed situations, a good position to be in.

On connection open, with above protocol we send 4 bytes in the setup message, and then must open a session, requiring at least 8 bytes more. 12 bytes == 7 milliseconds at 14.4K. Not 64 bit aligned, and 4 bytes costs of order 2 milliseconds. Ugh... Maybe a setup message isn't a good idea; other uses (e.g. security) can be dealt with by a control message.

Multiple protocols over one mux

We want to mux multiple protocols simultaneously over the same transport connection, so we need to know what protocol is in use with each session, so the demultipexor can hand the data to the right person. (e.g. SUNRPC and DCERCP simultaneously).

There are two obvious ways I can see to do this:

a): Send a control message when a session is first used, indicating the protocol.; Disadvantage: costs probably 8 bytes to do so (4 mux overhead, and 4 byte message), and destroys potential 64 bit alignment.
b): If syn is set indicating new session, then steal mux_length field to indicate protocol in use on that session. (overhead; 4 bytes for the mux header used just to establish the session.)

Opinions? Mine is that b) is better than a. Answer: b) is the adopted strategy.

Priority...

For a given stream, priority will affect which session is handled when multiplexing data; sending the priority on every block is unneeded, and would waste bytes. There is one case in which priority might be useful: at an intermediate proxy relaying sessions (and maybe remultiplexing them).

If so, it should be sent only when sessions are established or changed. Changes can be handled by a control message. Opinions?

A priority field can be hacked into the length field with the protocol field using b) above.

So the question is: is it important to send priority at all in this mux protocol? Or should priority control, if needed, be a control message?

Answer: Not in this protocol. Opens Pandora's box with remultiplexors, which could have denial of service attacks.

Setup message

Is any setup message needed? I don't think it is,. and initial bytes are precious (see performance discussion above), and it complicates trivial use. If we move the byte order flag to the mux header, and use control messages if other information needs to be sent, we can dispense with it, and the layer is simpler. This is my current position, and unless someone objects with reasons, I'll nuke it in the next version of this document.

Answer: Not needed. Nuked.

Byte order flags

While higher layer protocols using host dependent byte order can be a performance win (when sending larger objects such as arrays of data), the overhead at this layer isn't much, and may not be worth bothering with. Worst case (naive code) would be four memory reads and 3 shift overhead/payload. Smart code is one load and appropriate shifts etc.

Opinions? I'm still leaning toward swapping bytes here, but there are other examples of byte load and shift (particularly slow on Alpha, but not much of an issue on other systems).

Answer: Not sufficient performance gain at mux level to be worth doing. Defined as LE byte order for mux headers.

Error handling

There are several error conditions, probably best reported via control messages from server:

No such protocol. Some sort of serial number should be reported, I suppose; this serial number can be implicit as in X
bad message.
Some combinations of flag bits are not legal.
Priority if it exists?

Any others? Any twists to worry about?

Answer: Only error that can occur is no such protocol, given no priority in the base protocol. May still be some unresolved issues here around "Christmas Tree" message (all bits turned on).

Length Field

Any reason to believe that the 32 bit length field for a single payload is inadequate? I don't think so, and I live on an Alpha.

Answer: 32 bit extended length field for a single fragment is sufficient.

Compression

Does there need to be a bit saying the payload is compressed to avoid explosion of protocol types?

Answer: Yes; introduction of control message to allow specification of transport stacks achieves this.

Stacks

I think that we should be able to multiplex any TCP, UDP, or IP protocol. Internet protocol numbers are 8 bit fields.

So we need 16 bits for TCP, one bit to distinguish TCP and UDP, and one bit more we can use for IP protocol numbers and address space we can allocate privately. This argues for an 18 bit length field to allow for this reuse. * 18 bit length field * * 8 bit session field * * 4 control bits * * 1 long length bit *

The last bit is used to define control messages, which reuse the syn, fin, rst, and push bits as a control_code to define the control message. There are escapes, both by undefined control codes, and by the reservation of two sessions for further use if there needs to be further extensions. The spec above reflects this.

Alignment

Back to alignment. If we demand 4 byte alignment, for all requests that do not end up naturally aligned, we waste bytes. Two bytes are wasted on average. At 14.4Kbaud the overhead for protocols that do not pad up would on mean be 6 bytes or ~3ms, rather than 4 bytes or ~ 2 ms (presuming even distributions of length). Note that this DOES NOT effect initial request latency (time to get first URL), and is therefore less critical than elsewhere.

I have one related worry; it can sometimes be painful to get padding bytes at the end of a buffer; I've heard of people losing by having data right up to the end of a page, so implementations are living slightly dangerously if they presume they can send the padding bytes by sending the 1, 2 or 3 bytes after the buffer (rather than an independent write to the OS for padding bytes).

Alternatively, the buffer alignment requirement can be satisfied by implementations remembering how many pad bytes have to be sent, and adjusting the beginning address of the subsequent write by that many bytes before the buffer where the mux header has been put. Am I being unnecessarily paranoid?

Opinion: I believe alignment of fragments in general is a GOOD THING, and will simplify both the mux transport and protocols at higher levels if they can make this presumption in their implementations. So I believe this overhead is worth the cost; if you want to do better and save these bytes, then start building an application specific compression scheme. If not, please make your case.

Control bits

Are the four bits defined in Simon's flags field what we need? Are there any others?

Answer: no. More bits than we need. Current protocol doesn't use as many. I've ended back at the original bits specified, rather than the smaller set suggested by Bill Janssen. This enables full emulation of all the details of a socket interface, which would not otherwise be possible. See details around TCP and socket handling, discussed in books like "TCP/IP Illustrated," by W. Richard Stevens.

Am I all wet?

Opinion: I believe that we should do this.

Control Messages

Question: do we want/need a short control message? Right now, the out for extensibility are control messages sent in the reserved (and as yet unspecified) control session. This requires a minimum of 8 bytes on the wire. We could steal the last available bit, and allow for a 4 byte short control message, that would have 18 bits of payload.

Opinion: Flow control needs it; protocol/transport stacks need it. Document above now defines some control messages.

Simplicity of default Behavior

The above specification allows for someone who just wants to mux a single protocol to entirely ignore protocol ID's.

Everyone happy?

??????

Glossary

To be supplied

References

[Jac88]: V. Jacobson, "Congestion Avoidance and Control", Proceedings of SIGCOMM '88.
[HTTP/1.0]: T. Berners-Lee, R. Fielding, and H. Frystyk, May 1996. "Hypertext Transfer Protocol - 1.0". Other, more readable versions can be found in Roy Fielding's archives.
[HTTP/1.1]: R. Fielding, J. Gettys, J.C. Mogul, H. Frystyk, T, Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1". HTTP/1.1 is now a IETF proposed standard, but has not yet issued in RFC form due to RFC editor backlog.
[GZIP]: P. Deutsch, "GZIP file format specification version 4.3", RFC 1952, Aladdin Enterprises, May, 1996.
[ILU]: B. Janssen, M. Spreitzer, "Inter-Language Unification"; in particular see the manual section on Protocols and Transports.
[Ports]: Keywords and Port numbers are maintained by IANA in the port-numbers registry.
[Protocols]: Keywords and Protocol numbers are maintained by IANA in the protocol-numbers registry. [RED] S. Floyd and V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance," IEEE/ACM Trans. on Networking, vol. 1, no. 4, Aug. 1993.
[SCP]: S. Spero, "Session Control Protocol"

@(#) $Id: WD-mux-961023.html,v 1.5 1996/12/09 03:35:09 jigsaw Exp $