IEEE 1355 (ISO/IEC 14575) Rationale

Summary

This rationale starts by identifying the need for a new standard. Aspects of parallel interconnect are discussed under the headings parallel interconnect, routing, wormhole routing and flow control. Comparisons are drawn with telecommunication systems and local area networks. This rationale concludes with the description of the constraints and consequent trade-offs considered appropriate for parallel systems interconnect, the scope of this standard.

Back to IEEE 1355 Home Page


1. Need for a New Standard

It is widely recognized today that the most economic way to build high performance systems is by using parallelism. Parallel systems can provide very high computational power (in machines such as the CM5 and the Parsytec GC machine), fast response (for transaction processing or distributed control), very large I/O throughput (in RAID systems) and extremely high reliability (in redundant fault tolerant systems). They also have the potential to be more maintainable and more expandable than conventional, monolithic systems.

The construction of high performance systems with parallel processing and/or parallel I/O demands a fast, low cost, low latency interconnect. It must be fast and low latency, otherwise it will be the limiting factor in system performance, and it must be low cost, or else it will dominate the system cost. It must also scale well in both performance and cost relative to the system size, otherwise highly parallel systems will be limited in performance or too expensive. Existing standards do not meet these criteria, either because they are designed for communication over long distances (which incurs high costs), or because they aim at the extreme of currently achievable performance (which again increases costs), or because they are based on a restricted model such as a bus, which limits overall performance and scalability.

The purpose of this new interconnect standard is to enable high performance, scalable, modular, parallel systems to be constructed with low cost, where 'cost' must include not only the price of components, but also the engineering effort required to use them successfully. This international standard specifies the physical connectors and cables to be used, the electrical properties of the interconnect, and a cleanly separated set of logical protocols to perform the interconnection in the simplest possible way.

Back to summary

2 Aspects of a Parallel Systems Interconnect

2.1 Parallel Interconnect

How can the requirements of performance, scalability and low cost be met? Since these are no different from the requirements of the whole parallel system, they can be satisfied in the same way; by allowing many instances of a cheap component to operate concurrently. Thus the interconnect should consist of many separate connections operating simultaneously to give a high aggregate performance. Provided each connection can be utilised at a reasonable level, its raw performance need not be very high, allowing both component and engineering costs to be kept down. For maximum simplicity, modularity and fault tolerance, each connection should be point to point.

The requirement of low cost implies that, at the very least, a connection to the interconnect can be implemented with a relatively small amount of circuitry in a non-exotic technology. Ideally such a connection could be integrated onto a chip with a processor or other device to minimize costs. The requirement of a small amount of circuitry implies that protocols must be simple and require minimal buffering. At the system level, cost limits the number of signal connections which can be used. Too large a number would make integration with a processor impractical, for example, limit the maximum number of connections available on a device, and lead to skew control problems at the system level.

Back to summary

2.2 Routing

To connect many devices together, it is not possible to provide a direct, physical connection between every pair. Nor is it acceptable to connect only certain pairs unless the connection pattern happens to match that required by the application. To enable every pair of devices to communicate without necessarily having a direct connection, data must be routed in some way through intermediate nodes.

In order for every pair of devices to be able to communicate, data must be routed in different ways at different times. This can be achieved either by configuring intermediate nodes to make each connection before the data is sent, or by making the data self-routing so that it contains within it information which determines which way it should be routed. Configuring the nodes in advance of the data requires that the destination of the data must be supplied to a central controller, which increases latency, and creates the danger that the controller will become a system bottleneck. Self-routed systems enable the control function to be distributed, and hence to scale in performance with the size of the system.

A piece of data together with its associated routing information is called a packet. If a packet is sent directly from sender to receiver it is not necessary for the length of the packet to be represented within it; it is only necessary to ensure that the sender and receiver agree on the length. However, where packets pass through intermediate nodes, it must be possible for these nodes to determine the length of the packet passing through, to determine when the task of routing a particular packet is complete. The requirement of low latency implies that packets must be limited in length, since otherwise connections could be occupied indefinitely by long packets. Since for many purposes short messages are required, the overhead on each packet must be small, and the packet size must be variable. This requires that the protocol provides an indication of the packet length, such as a termination marker or an initial length count.

Back to summary

2.3 Wormhole Routing

In most serial packet switching networks each intermediate node inputs a packet, decodes the header, and then forwards the packet to the next node. This is undesirable for a parallel systems interconnect for two reasons:

a. It requires storage in each node for transmitted packets, which either limits the capacity of the node or requires a separate memory (which increases costs).

b. It causes potentially long delays between the output of a packet and its reception, because each node waits for the whole packet to be received before starting re-transmission, thereby increasing latency.

A more suitable approach is wormhole routing, in which only the header of the packet is initially read in by the node. The routing decision is taken, the header is output, and the rest of the packet is sent directly from the input to the output without being stored in the node. This means that a packet can be passing through several nodes at the same time, and the head of the packet may be received by the destination before the whole packet has been transmitted by the source. Thus this method can be thought of as a form of dynamic circuit switching, in which the header of the packet, in passing through the network, creates a temporary circuit (the 'wormhole') through which the data flows. As the tail of the packet is pulled through, the circuit vanishes. The transmission of a single packet may thus be pipelined through a series of devices.

Note that, as far as the senders and receivers of packets are concerned, the wormhole routing is invisible. Its only effect is to minimize the latency in the message transmission. If one or more intermediate nodes were to store and forward the packet it would still be delivered correctly. Note also, however, that wormhole routing has the further advantage that it is independent of the packet length. In a store and forward system, the maximum packet size must be determined in advance so that buffering can be provided; if few packets are of this size, then the extra buffering is largely unused. In addition, independence from the packet length is desirable because it achieves a clean separation of layers of the protocol.

Back to summary

2.4 Flow Control

Guaranteeing that physical connections will be available for each stage of a packet's journey requires global information about the state of the system. Accumulating global information is time consuming and inherently non scalable. Thus in a low latency, scalable system, the possibility exists that the header of a packet will be input by a node but be unable to proceed because the required output is already in use. However the body of the packet will still be passing through previous nodes. There are then three possibilities:

a. The incoming packet body is buffered in the node where the packet is stalled;

b. The incoming data is discarded;

c. A flow control mechanism stops the flow of data.

The first of these options is a return to store and forward routing, with the disadvantage of requiring buffer resources in each routing node, which increases cost and destroys the packet length independence of the routing. The second option is undesirable because it forces the end equipment to engage in complex protocols to deal with the possibility of data loss. In addition, once part of a packet has been lost the packet is probably useless and may as well be entirely destroyed. A system using this scheme could then easily degenerate into a state where most of the connections were carrying packets to the point of their destruction! Clearly it would be preferable to propagate information about a stall back along the path of the packet, which brings us to the third option, to implement a system of flow control.

Since we do not wish to provide buffering for an entire packet, the flow control system must be capable of stalling the flow of data part way through a packet. This implies that it operates on a level below that of packets. It is the flow of the sub units of which packets are composed which must be controlled. With this scheme, when a packet is unable to proceed, data may continue to move until all buffering along the path of the packet is filled. The flow control mechanism must then ensure that data movement ceases so that no buffers are overwritten. This means that all links which are still occupied by the packet will be idle; however this is an improvement on the previous scenario in which they would be busy moving data to the point at which it is discarded.

Back to summary

3 Comparison with Other Types of System

In this section we consider how the requirements for a parallel system interconnect compare with those of other types of communications system.

3.1 Telecommunications Systems

The principal feature of digital telecommunications systems is that they operate over long distances. This means that the actual, physical medium of communication is large and expensive; most of the attributes of such systems are a consequence of this. Because the medium (be it an ordinary cable, optical fibre or satellite) is expensive, it is worthwhile using expensive end equipment to extract the most from it. It is justifiable to push the basic operating frequency to a level at which the bit-error rate (BER) is non negligible and to compensate for the errors with very sophisticated protocols, if this results in a net improvement in data rate. Protocols may be further complicated by the need to perform a number of functions via the same medium, since in general there will not be an alternative connection (and if there were it would not be economic to dedicate it only to functions which are used infrequently, however important they may be).

The extended nature of the medium makes it vulnerable to noise and transient interruption. Thus information may be lost; and so for example if fine grained flow control were employed it would frequently fail due to loss of the flow control information. This implies that flow control must be coarse grained, and so large buffers must be provided. Since packets may be lost, they must be buffered for re-transmission until positive confirmation of reception is received. This may in turn cause further packets to be lost if the buffers of a forwarding node are full. Thus telecommunications systems are intrinsically 'lossy' and higher level protocols must compensate for this.

Back to summary

3.2 Local area Networks

LANs share some of the attributes of telecommunications systems, but the balance of cost between the physical medium and the end equipment is quite different. Because the cost of the medium is still significant, however, LANs work on the principle that a number of pieces of end equipment effectively share the same medium (and hence share its bandwidth). BERs are lower, but the protocols used can still be quite complicated in order to organize the sharing of the medium.

To economize on wiring, the most popular physical configuration of LANs is a ring. Rings may be linked together via 'bridges'; this improves the bandwidth because the bandwidth of each ring is shared between the active users connected to it, whilst separate rings joined by bridges may operate concurrently. It also improves the fault tolerance, since the failure of one ring can be isolated from the others. However the overall level of concurrency in the interconnect is low.

Back to summary

4 Parallel Systems Interconnect

For communications in a parallel system, the constraints are rather different. When connecting devices on a board or in a cardcage, the actual cost of the wires involved is negligible. However the pins and board area consumed by wide, parallel connections and the difficulty of correctly routing them is not negligible, and this leads to a preference for serial communications.

Since the cost of the copper tracks is not significant, it is not necessary to maximize the data rate on each one. Nor is it essential to use a complex protocol to perform a variety of different functions on a single connection, since the cost of providing separate connections for, for example, configuration, control or monitoring purposes need not be prohibitive.

Within a system with limited physical dimensions and reasonably clean electrical properties it is entirely feasible to make connections with extremely low BERs (10-15 or less). Thus connections can be assumed reliable, and complicated systems of acknowledgements, anti acknowledgements and retries can be avoided. Moreover since the system can be designed and controlled as a whole, each part can be relied upon to obey strict rules, and need not behave defensively with respect to other parts of the system, allowing further simplifications. Although in such systems it is possible to make communications synchronous with a common clock, this can create serious problems of high speed clock distribution and should be avoided to keep costs down.

In such systems, the cost of the end equipment for each connection is very significant. This is because at least one such end must be provided for each device, and hence contributes to the cost of the node. For reasons of both cost and performance it is very advantageous to be able to integrate the end equipment with the device chip(s). Thus it is preferable not to require sensitive analogue circuitry, exotic technology (GaAs etc.), or large amounts of logic. Replicating a simple piece of end equipment is more cost effective than providing a complex one with maximum performance.

High bandwidth and low latency are essential for the communications to be matched with the speed of the devices. In particular it is important to provide a high total bandwidth when many devices need to communicate simultaneously, and for the latency of each communication to be moderate even under high load. Fault tolerance is highly desirable, and these requirements together imply that the communication should be performed via a number of independent, concurrently operating connections rather than one higher speed connection.

Back to summary