Gentlefolk, As promised, here is the material regarding methods for allocating and reallocating isochronous resources. Much of this will be a re-hash for many of you, but please bear with me ... There are some additional rantings, and this is rather more clearly stated than my remarks at the last meeting, too. In discussing the reallocation of isochronous resources following a bus reset, the IP1394 folks covered some interesting ground, learning a bit about how things work, and how they should work. Regrettably, these are not the same, so Mike Teener and I have exchanged some ideas regarding how to deal with the situation. First, some analysis of the situation. Clauses 8.4.3.1, "Bandwidth allocation," and 8.4.3.2, "Channel Allocation," of the 1394-1995 standard describe the procedure for allocating isoch bandwidth and channels. To allocate bandwidth a node must use a quadlet read to obtain the current contents of the BANDWIDTH_AVAILABLE register, determine whether sufficient allocation units are available for the application, subtract the number of allocation units required (including gaps, arbitration, data prefix and end, etc.), then perform a compare-swap lock passing the contents of BANDWIDTH_AVAILABLE previous read as the arg_value, and the updated number of allocation units remaining as the data_value. If the lock transaction fails, the procedure is retried from the start, while if it completes but arg_value does not equal the returned old_value (meaning the allocation failed) the procedure may be retried using the returned old_value, without re-reading the register. Channels are allocated in like manner, with similar retry procedures. This procedure really doesn't even resemble a conventional semaphore. For a conventional semaphore, a requestor reserves a resource by means of a test-and-set lock operation (or something similar) on the semaphore. If the test shows the resource wasn't reserved, the requestor now "owns" the resource, performs any needed operations on or with it, then releases the resource by clearing the semaphore. Otherwise, the requestor "spins" on the semaphore until it gains control of it. Instead, 1394 requires a node to examine the resource, then try to change it without gaining control of the resource first. In other words, a semaphore gates access to a resource before operations on the resource can be performed, while 1394 gates the operation on the actual final manipulation of the resource. Further, a semaphore operation is atomic, while the 1394 mechanism is a two-part operation, where the separate phases may be widely separated in time. If two nodes attempt to allocate either a channel or bandwidth at the same time, a successful allocation by one node may cause the allocation attempt of the other to fail, since the contents of the register in question may have changed, and thus not match the arg_value passed in the second nodes lock transaction request. Further, if the node containing the affected register is very busy and cannot adequately queue the requests, both the quadlet read and compare_swap lock transactions may fail due to outstanding requests in process (ack_busy). In the worst possible case, with several nodes attempting allocations concurrently, such behavior could lead to a complete failure to allocate any resources, as each attempt at allocation fails and is reinitiated with a quadlet read, keeping the IRM eternally busy with a storm of reads preventing the compare_swap lock transactions from successfully completing. Implementing these registers in hardware (or requiring that transactions to be responded to via concatenated responses/unified transactions) greatly diminishes the risk of this occurring, but does not eliminate that risk, as other activities involving the target node may result in its being busy. Since both the bandwidth and channel allocation processes are subject to this behavior, the risk to quality of service is increased. OHCI requires the registers in question be in hardware, but we can't rely on these registers being in hardware on all nodes, just as we can't count on a PC being attached to all buses. Clauses 8.3.2.3.7, "BANDWIDTH_AVAILABLE register," and 8.3.2.3.8, "CHANNELS_AVAILABLE register," require only that these registers be implemented on IRM and BM capable nodes (and their capabilities), not the method of their implementation. If the IRM does not implement these registers such that requests to these registers are completed as unified transactions, the potential for such degenerate behavior exists, even if it is unlikely. During initial allocation of isoch resources, under normal operating conditions, there is VERY little chance of such problems developing. Channel and bandwidth allocations by different nodes and applications are generally well separated in time, and what allocation "collisions" may occur are unlikely to involve more than two nodes. However, following a bus reset, there is a one second protected period during which nodes that had previously owned resources may continue to use those resources and reclaim them, while new allocations may not be made. (See clauses 8.4.2.2, "Prior isochronous traffic (cable environment)," 8.4.2.4, "Reallocation of prior isochronous resources (cable environment)," and 8.4.2.8, "Allocation of new isochronous resources (cable environment).") Having all nodes that owned resources attempt reallocations in a short period of time can lead to problems of the kind described, as the likelihood of having several nodes concurrently attempting allocations is dramatically increased. Worsening the problem is the fact that heavy isochronous traffic will reduce the asynchronous bandwidth available for resource reallocations during that one second interval, potentially by as much as 80%. Even assuming that the real implementations of these registers eliminate the worst-case scenarios described, we can still have some atrocious behavior. Suppose N nodes try to allocate C channels, where C is greater than or equal to N. If each node tries to allocate all the channels it needs at once, but all nodes attempt this at once, we may have N reads of CHANNELS_AVAILABLE, followed by N compare_swap lock transactions, only the first of which can succeed in allocating channels. The remaining allocating nodes then use the returned old_value for the next pass, and again attempt to allocate channels concurrently, with only one succeeding. Repeating the process until all nodes have allocated the channels they need, we end up with N + N(N+1)/2 = N(N+3)/2 transactions. For N = 8, this is 44. For N = 16, it is 152. Now, if the channels are allocated one at a time (as a "large OS vendor" has indicated will be their standard method), with nodes having more than one channel to allocate succeeding in the allocation of these "extra" channels first, we have C + (C-N)N + N(N+1)/2 = C(N+1) + N(1-N)/2 transactions. For N = 8 and C = 16, that is 116 transactions. For N = 12 and C = 20, 194 transactions, and for N = 16 and C = 32, 424 transactions. Now, assume that all the nodes allocating are "simple," in that they treat both failures to allocate and failures of transactions the same, and always read the CHANNELS_AVAILABLE register before each attempt to allocate a channel. (This is left to the implementer by clauses 8.4.3.1 and 8.4.3.2, which state that rereading the register is "unnecessary," but don't direct that it not be performed.) We have instead 2(C-N)N + N(N+1) = (2C+1)N - N**2, transactions. For N = 8 and C = 16, that's 200 transactions, while for N = 12 and C = 20, 348 transactions, and N = 16 and C = 32, 784 transactions. Now, what happens if we merge two buses with isochronous traffic? Assuming that neither IRM retains that role (which is something of a stretch, since both should have their contender bits set), all isochronous streams will be halted, and neither channels nor bandwidth will be allocated for a period of one second following the reset. (How the nodes of one bus will know that the IRM of their bus retained that role in the merged bus is another problem.) Then, everyone tries to allocate their needed resources at once, and we can have the same worst-case scenarios as for the reallocations described above. If one of the IRMs retains that role, we may have as many channel collisions as the number of channels on the "lesser" bus, and potentially an isochronous interval that's too long. Depending on precisely how these conditions are handled, we may see all or most of the isochronous traffic stopped until the 1 second protected period has expired, and again see the possibility of a worst-case scenario. Even if these behaviors are unlikely, they are possible. I believe we should eliminate them, or at least push them farther into the corner cases, and thereby decrease some threats to 1394's quality of service. As we move to p1394b long-haul based home networks, where a reset on the home "backbone" can affect quality of service for all 1394 activities in the home, I think doing so verges on the imperative. So, how do we do it? When allocating bandwidth, a node really isn't concerned with how much bandwidth remains after the allocation, or even with how much there was before the allocation, except that it is sufficient for the application's purposes. Likewise, when allocating a channel, changes in the availability of other channels are of no interest to the node trying to allocate a specific channel. If, then, we can practically eliminate the number of retries, we should do much towards eliminating these problems. In the case of allocating channels, the individual bits in the CHANNELS_AVAILABLE register behave exactly like one bit semaphores, and we could, if permitted, reliably allocate and deallocate individual and groups of channels using the mask_swap lock transaction. The new procedure might be as follows. 1. Read the CHANNELS_AVAILABLE register to determine which channels are presently available. 2. Perform a mask_swap lock transaction on the CHANNELS_AVAILABLE register, passing an arg_value with bit(s) corresponding to the channel(s) to be allocated set to one, and all others bits cleared to zero, and a data_value of zero for all bits. 3. If the transaction fails, go to step 2 to retry the allocation. 4. If the transaction completes, check the bit(s) in old_value corresponding to the channel(s) to be allocated. If the bit(s) is(are) clear (zero), the allocation of the channel(s) failed, since the channel had been previously allocated. For any unallocated channels, use the returned old_value as the contents of CHANNELS_AVAILABLE, and go to step 2. If we had a reasonably good "hashing" and sequencing function to select "first candidate" and following channels for allocation by each node, we could potentially dispense with the initial read, with a good statistical expectation of completing a channel allocation in a single transaction. When reallocating a previously owned channel (following a bus reset) we know which channel to attempt to reallocate ... We're already using it, provided the IRM did not change due to the reset. (Failure to reallocate the same channel forces the old resource owner to disable channel use by the talker and listeners, and it must wait until after the one second protected period following the reset expires before allocating a different channel.) Thus, worst-case behavior for reallocation of channels after a reset would take only N transactions (reallocating all a node's channels at once), or C transactions (allocating one channel at a time), provided we are not merging two buses. For new channel allocations, there are at least two reasonably good mechanisms for selecting a first-candidate channel: use a number based on the Physical ID of the node, or on the node's GUID. We don't know how random the GUIDs will really be, but these would seem to be better distributed within the 64 available channel numbers than the Physical IDs, which are always sequential starting at zero. Any of a number of mechanisms for sequencing through channel numbers might be used to select candidate channels after the first allocation, but simply incrementing the six bit first candidate channel number should work. In any case, such a "blind grab" will serve just as well as a read to obtain the contents of the CHANNELS_AVAILABLE register, and has a real chance of actually succeeding ... So, scratch the quadlet read. The resulting worst case (where all allocating nodes start out trying to allocate the same channel, all at the same time, and increment through the nodes serially in subsequent attempts) is exceedingly unlikely (and only exists when using the GUID based candidate selection method, or following multiple allocations by all involved nodes), and requires at worst N(N+1)/2 or (C-N)N + N(N+1)/2 = (C+1/2)N - N**2/2 transactions, depending on per-node or per-channel allocation. Either way, this is both better than was previously the case, and less likely to occur. As for the process of allocating bandwidth, if we had a "bounded_allocate" lock transaction that behaved as (new_value = (data_value <= old_value) ? (old_value - data_value) : old_value), a node attempting to allocate bandwidth could simply try to allocate what it needs, and compare the returned old_value with the passed data_value to determine whether the allocation succeeded or not. (If old_value < data_value, the allocation failed.) Both initial allocation of bandwidth and reallocation following a reset become much simpler, and take N or C transactions, depending on how channel bandwidth is allocated. Unfortunately, clauses 8.3.2.3.7 and 8.3.2.3.8 both contain text stating, "Special transaction limitations: Only the quadlet read and lock compare swap transactions shall be supported," and we don't even have a "bounded_allocate" lock transaction. We cannot just change the way we do things and declare victory. Also, existing hardware, and that conforming to p1394a, is somewhat problematic, since they cannot support the new methods. Further, in 'b' nodes we would be adding some new complexities. So, is it worth the effort to make the changes needed to enable these new methods? It is really only necessary to show one important application where the methods described are of significant benefit and likely to be used, while not incurring excessive costs, in order to justify providing them. A pivotal application where quality of service is critical, and we are likely to find a significant number of isochronous and asynchronous streams, is the well-developed 1394 home "network" of the future. In such a domain, with numerous audio/video and simple audio streams, as well as IP broadcast and multicast streams, a significant interruption of service is undesirable. As the home becomes more "1394 centered" for communications, entertainment, security, etc., tolerance of significant periods of connectivity "chaos" declines, and efficient restoration of services following a reset becomes essential. This is, then, a good candidate for an application where the methods presented provide real benefits. I am confident other environments may be easily found. As for the added cost of the described methods, others will need to determine this, but I believe it to be quite limited. The compare-swap operations currently used require some compare logic, and the logic needed to support the assorted bounded subtract can be used to provide much of the compare functionality. Procedurally, the state machine changes are additive but simple, since a direct method that may be used by 'b' nodes to perform allocations would be to attempt allocation using the new methods, and switch to the old methods if an error is encountered. A minor point to consider while dealing with changes of this nature, is the deallocation of channels and bandwidth is that it is possible to do these deallocations a bit more efficiently, since we know with absolute certainty what (as node) hold as resources, and thus are even less concerned with the activities of other nodes relative to our deallocations. The mask_swap lock transaction works just as well here, and requires no prior read, and a simple add lock transaction of appropriate type is more than adequate to deallocate bandwidth without a prior read. This doesn't patch over any significant flaw I can think of off-hand, but it still represents a savings, and should at least be considered. So, I propose several things: 1. That p1394b require all IRM and BM capable nodes respond to all request transactions to the BANDWIDTH_AVAILABLE and CHANNELS_AVAILABLE registers as concatenated responses (unified transactions), 2. That p1212r define a "bounded_allocate" or "bounded_subtract" lock transaction that behaves as "new_value = (data_value <= old_value) ? (old_value - data_value) : old_value," and allocate the presently reserved extended tCode 0 (zero) for that purpose (which will require some liaison work), 3. That p1394b revise clause 8.3.2.3.7, "BANDWIDTH_AVAILABLE," to require support of the proposed bounded_allocate lock transaction, in addition to the quadlet read and compare_swap lock transactions, 4. That p1394b revise clause 8.3.2.3.7 to require support of fetch_add lock transactions, in addition to the other required transactions types, 5. That p1394b revise clause 8.3.2.3.8, "CHANNELS_AVAILABLE," to require support of mask_swap lock transactions, in addition to the quadlet read and compare_swap lock transactions, 6. That p1394b define a candidate-channel selection process based upon the physical ID, GUID or other suitable characteristic of an allocating node, with suitable next-candidate sequencing algorithm, 7. That p1394b permit and recommend that nodes allocate and deallocate channels by means of a mask_swap lock transaction, provided the method is applicable (meaning the target IRM/BM supports it), 8. That p1394b permit and recommend that nodes allocating channels by the above method do so without a prior read of the CHANNELS_AVAILABLE register, provided the proper candidate-channel selection and sequencing method is used, as well as permit and recommend deallocating channels without prior read, using mask_swap lock transactions, where applicable, 9. That p1394b permit and recommend that nodes allocate bandwidth by use of the bounded_allocate lock transaction, without a prior read of the BANDWIDTH_AVAILABLE register, provided the method is applicable, and 10. That p1394b permit and recommend that nodes deallocating bandwidth do so using the fetch_add lock transaction, provided the method is applicable. Which is at least a starting point for some discussion. Sincerely, Richard Churchill, Adv. Portable PC Arch., Compaq Computer Corporation, P.O. Box 692000, ms120303, Houston, TX, 77269-2000, 20555 St. Hwy. 249, ms120303, Houston, TX, 77070, richard.churchill@compaq.com, phone: (281)514-6984, fax: (281)518-5324. [ To receive daily digests of this list send the one line message: ] [ "subscribe digest p1394b" to requests@zayante.com ]