Studijní materiály

Zpět

Hromadně přidat materiály

Kniha An%20Introduction%20to%20Distributed%20Processing

IV100 - Paralelní a distribuované výpočty

Hodnocení materiálu:

Zjednodušená ukázka:

Stáhnout celý tento materiál

, which emerge when the computations that take place on those
distributed-memory systems are looked at from different perspectives and at different levels
of abstraction.
The remainder of the book is devoted exclusively to message-passing computations over
point-to-point connections. Such computations will be described at the task level, which
clearly can be regarded as encompassing message-passing computations at the processor
level as well. This is so because the latter can be regarded as message-passing
computations at the task level when there is exactly one task per processor and two tasks
only communicate with each other if they run on processors directly interconnected by a
communication link. However, before leaving aside the processor level completely, we find it
convenient to have some understanding of how a group of processors interconnected by
point-to-point connections can support intertask message passing even among tasks that
run on processors not directly connected by a communication link. This is the subject of the
following two sections.

1.2 Communication processors
When two tasks that need to communicate with each other run on processors which are not
directly interconnected by a communication link, there is no option to perform that intertask
communication but to somehow rely on processors other than the two running the tasks to
relay the communication traffic as needed. Clearly, then, each processor in the system must,
in addition to executing the tasks that run on it, also act as a relayer of the communication
traffic that does not originate from (or is destined to) any of the tasks that run on it.
Performing this additional function is quite burdensome, so it appears natural to somehow
provide the processor with specific capabilities that allow it to do the relaying of
communication traffic without interfering with its local computation. In this way, each
processor in the system can be viewed as actually a pair of processors that run
independently of each other. One of them is the processor that runs the tasks (called the
host processor) and the other is the communication processor. Unless confusion may arise,
the denomination simply as a processor will in the remainder of the book be used to indicate
either the host processor or, as it has been so far, the pair comprising the host processor
and the communication processor.
In the context of computer networks (and in a similar fashion networks of workstations as
well), the importance of communication processors was recognized at the very beginning,
not only by the performance-related reasons we indicated, but mainly because, by the very
nature of the services provided by such networks, each communication processor was to
provide services to various users at its site. The first generation of distributed-memory
multiprocessors, however, was conceived without any concern for this issue, but very soon
afterwards it became clear that the communication traffic would be an unsurmountable
bottleneck unless special hardware was provided to handle that traffic. The use of
communication processors has been the rule since.
There is a great variety of approaches to the design of a communication processor, and that
depends of course on the programming model to be provided at the task level. If message
passing is all that needs to be provided, then the communication processor has to at least be
able to function as an efficient communication relayer. If, on the other hand, a shared-
memory programming model is intended, either by itself or in a hybrid form that also allows
message passing, then the communication processor must also be able to handle memory-
management functions.
Let us concentrate a little more on the message-passing aspects of communication
processors. The most essential function to be performed by a communication processor is in
this case to handle the reception of messages, which may come either from the host
processor attached to it or from another communication processor, and then to decide where
to send it next, which again may be the local host processor or another communication
processor. This function per se involves very complex issues, which are the subject of our
discussion in Section 1.3.
Another very important aspect in the design of such communication processors comes from
viewing them as processors with an instruction set of their own, and then the additional issue
comes up of designing such an instruction set so to provide communication services not only
to the local host processor but in general to the entire system. The enhanced flexibility that
comes from viewing a communication processor in this way is very attractive indeed, and
has motivated a few very interesting approaches to the design of those processors. So, for
example, in order to send a message to another (remote) task, a task running on the local
host processor has to issue an instruction to the communication processor that will tell it to
do so. This instruction is the same that the communication processors exchange among
themselves in order to have messages passed on as needed until a destination is reached.
In addition to rendering the view of how a communication processor handles the traffic of
point-to-point messages a little simpler, regarding the communication processor as an
instruction-driven entity has many other advantages. For example, a host processor may
direct its associated communication processor to perform complex group communication
functions and do something else until that function has been completed system-wide. Some
very natural candidate functions are discussed in this book, especially in Chapters 4 and 5
(although algorithms presented elsewhere in the book may also be regarded as such, only at
a higher level of complexity).

1.3 Routing and flow control
As we remarked in the previous section, one of the most basic and important functions to be
performed by a communication processor is to act as a relayer of the messages it receives
by either sending them on to its associated host processor or by passing them along to
another communication processor. This function is known as routing, and has various
important aspects that deserve our attention.
For the remainder of this chapter, we shall let our distributed-memory system be represented
by the connected undirected graph G
P
= (N
P
,E
P
), where the set of nodes N
P
is the set of
processors (each processor viewed as the pair comprising a host processor and a
communication processor) and the set E
P
of undirected edges is the set of point-to-point
bidirectional communication links. A message is normally received at a communication
processor as a pair (q, Msg), meaning that Msg is to be delivered to processor q. Here Msg
is the message as it is first issued by the task that sends it, and can be regarded as
comprising a pair of fields as well, say Msg = (u, msg), where u denotes the task running on
processor q to which the message is to be delivered and msg is the message as u must
receive it. This implies that at each processor the information of which task runs on which
processor must be available, so that intertask messages can be addressed properly when
they are first issued. Section 1.6 is devoted to a discussion of how this information can be
obtained.
When a processor r receives the message (q, Msg), it checks whether q = r and in the
affirmative case forwards Msg to the host processor at r. Otherwise, the message must be
destined to another processor, and is then forwarded by the communication processor for
eventual delivery to that other processor. At processor r, this forwarding takes place
according to the function next
r
(q), which indicates the processor directly connected to r to
which the message must be sent next for eventual delivery to q (that is, (r,next
r
(q)) ∊ E
P
).
The function next is a routing function, and ultimately indicates the set of links a message
must traverse in order to be transported between any two processors in the system. For
processors p and q, we denote by R (p,q) E
P
the set of links to be traversed by a message
originally sent by a task running on p to a task running on q. Clearly, R(p,p) = Ø and in
general R(p,q) and R(q,p) are different sets.
Routing can be fixed or adaptive, depending on how the function next is handled. In the fixed
case, the function next is time-invariant, whereas in the adaptive case it may be time-
varying. Routing can also be deterministic or nondeterministic, depending on how many
processors next can be chosen from at a processor. In the deterministic case there is only
one choice, whereas the nondeterministic case allows multiple choices in the determination
of next. Pairwise combinations of these types of routing are also allowed, with adaptivity and
nondeterminism being usually advocated for increased performance and fault-tolerance.
Advantageous as some of these enhancements to routing may be, not many of adaptive or
nondeterministic schemes have made it into practice, and the reason is that many difficulties
accompany those enhancements at various levels. For example, the FIFO (First In, First
Out) order of message delivery at the processor level cannot be trivially guaranteed in the
adaptive or nondeterministic cases, and then so cannot at the task level either, that is,
messages sent from one task to another may end up delivered in an order different than the
order they were sent. For some applications, as we discuss for example in Section 5.2.1, this
would complicate the treatment at the task level and most likely do away with whatever
improvement in efficiency one might have obtained with the adaptive or nondeterministic
approaches to routing. (We return to the question of ensuring FIFO message delivery among
tasks in Section 1.6.2, but in a different context.)
Let us then concentrate on fixed, determinist routing for the remainder of the chapter. In this
case, and given a destination processor q, the routing function next
r
(q) does not lead to any
loops (i.e., by successively moving from processor to processor as dictated by next until q is
reached it is not possible to return to an already visited processor). This is so because the
existence of such a loop would either require at least two possibilities for the determination
of next
r
(q) for some r, which is ruled out by the assumption of deterministic routing, or
require that next be allowed to change with time, which cannot be under the assumption of
fixed routing. If routing is deterministic, then another way of arriving at this loopfree property
of next is to recognize that, for fixed routing, the sets R of links are such that R(r,q) R(p,q)
for every processor r that can be obtained from p by successively applying next given q. The
absence of loops comes as a consequence. Under this alternative view, it becomes clear
that, by building the sets R to contain shortest paths (i.e., paths with the least possible
numbers of links) in the fixed, deterministic case, the containments for those sets appear
naturally, and then one immediately obtains a routing function with no loops.
Loops in a routing function refer to one single end-to-end directed path (i.e., a sequence of
processors obtained by following next
r
(q) from r = p for some p and fixed q), and clearly
should be avoided. Another related concept, that of a directed cycle in a routing function, can
also lead to undesirable behavior in some situations (to be discussed shortly), but cannot be
altogether avoided. A directed cycle exists in a routing function when two or more end-to-end
directed paths share at least two processors (and sometimes links as well), say p and q, in
such a way that q can be reached from p by following next
r
(q) at the intermediate r's, and so
can p from q by following next
r
(p). Every routing function contains at least the directed cycles
implied by the sharing of processors p and q by the sets R(p,q) and R(q,p) for all p,q ∈ N
P
. A
routing function containing only these directed cycles does not have any end-to-end directed
paths sharing links in the same direction, and is referred to as a quasi-acyclic routing
function.
Another function that is normally performed by communication processors and goes closely
along that of routing is the function of flow control. Once the routing function next has been
established and the system begins to transport messages among the various pairs of
processors, the storage and communication resources that the interconnected
communication processors possess must be shared not only by the messages already on
their way to destination processors but also by other messages that continue to be admitted
from the host processors. Flow control strategies aim at optimizing the use of the system's
resources under such circumstances. We discuss three such strategies in the remainder of
this section.
The first mechanism we investigate for flow control is the store-and-forward mechanism.
This mechanism requires a message (q,Msg) to be divided into packets of fixed size. Each
packet carries the same addressing information as the original message (i.e., q), and can
therefore be transmitted independently. If these packets cannot be guaranteed to be
delivered to q in the FIFO order, then they must also carry a sequence number, to be used
at q for the re-assembly of the message. (However, guaranteeing the FIFO order is a
straightforward matter under the assumption of fixed, deterministic routing, so long as the
communication links themselves are FIFO links.) At intermediate communication processors,
packets are stored in buffers for later transmission when the required link becomes available
(a queue of packets is kept for each link).
Store-and-forward flow control is prone to the occurrence of deadlocks, as the packets
compete for shared resources (buffering space at the communication processors, in this
case). One simple situation in which this may happen is the following. Consider a cycle of
processors in G
P
, and suppose that one task running on each of the processors in the cycle
has a message to send to another task running on another processor on the cycle that is
more than one link away. Suppose in addition that the routing function next is such that all
the corresponding communication processors, after having received such messages from
their associated host processors, attempt to send them in the same direction (clockwise or
counterclockwise) on the cycle of processors. If buffering space is no longer available at any
of the communication processors on the cycle, then deadlock is certain to occur.
This type of deadlock can be prevented by employing what is called a structured buffer pool.
This is a mechanism whereby the buffers at all communication processors are divided into
classes, and whenever a packet is sent between two directly interconnected communication
processors, it can only be accepted for storage at the receiving processor if there is buffering
space in a specific buffer class, which is normally a function of some of the packet's
addressing parameters. If this function allows no cyclic dependency to be formed among the
various buffer classes, then deadlock is ensured never to occur. Even with this issue of
deadlock resolved, the store-and-forward mechanism suffers from two main drawbacks. One
of them is the latency for the delivery of messages, as the packets have to be stored at all
intermediate communication processors. The other drawback is the need to use memory
bandwidth, which seldom can be provided entirely by the communication processor and has
then to be shared with the tasks that run on the associated host processor.
The potentially excessive latency of store-and-forward flow control is partially remedied by
the second flow-control mechanism we describe. This mechanism is known as circuit
switching, and requires an end-to-end directed path to be entirely reserved in one direction
for a message before it is transmitted. Once all the links on the path have been secured for
that particular transmission, the message is then sent and at the intermediate processors
incurs no additional delay waiting for links to become available. The reservation process
employed by circuit switching is also prone to the occurrence of deadlocks, as links may
participate in several paths in the same direction. Portions of those paths may form directed
cycles that may in turn deadlock the reservation of links. Circuit switching should, for this
reason, be restricted to those routing functions that are quasi-acyclic, which by definition
pose no deadlock threat to the reservation process.
Circuit switching is obviously inefficient for the transmission of short messages, as the time
for the entire path to be reserved becomes then prominent. Even for long messages,
however, its advantages may not be too pronounced, depending primarily on how the
message is transmitted once the links are reserved. If the message is divided into packets
that have to be stored at the intermediate communication processors, then the gain with
circuit switching may be only marginal, as a packet is only sent on the next link after it has
been completely received (all that is saved is then the wait time on outgoing packet queues).
It is possible, however, to pipeline the transmission of the message so that only very small
portions have to be stored at the intermediate processors, as in the third flow-control
strategy we describe next.
The last strategy we describe for flow control employs packet blocking (as opposed to
packet buffering or link reservation) as one of its basic paradigms. The resulting mechanism
is known as wormhole routing (a misleading denomination, because it really is a flow-control
strategy), and contrasting with the previous two strategies, the basic unit on which flow
control is performed is not a packet but a flit (flow-control digit). A flit contains no routing
information, so every flit in a packet must follow the leading flit, where the routing information
is kept when the packet is subdivided. With wormhole routing, the inherent latency of store-
and-forward flow control due to the constraint that a packet can only be sent forward after it
has been received in its entirety is eliminated. All that needs to be stored is a flit, significantly
smaller than a packet, so the transmission of the packet is pipelined, as portions of it may be
flowing on different links and portions may be stored. When the leading flit needs access to a
resource (memory space or link) that it cannot have immediately, the entire packet is
blocked and only proceeds when that flit can advance. As with the previous two
mechanisms, deadlock can also arise in wormhole routing. The strategy for dealing with this
is to break the directed cycles in the routing function (thereby possibly making pairs of
processors inaccessible to each other), then add virtual links to the already existing links in
the network, and then finally fix the routing function by the use of the virtual links. Directed
cycles in the routing function then become "spirals", and deadlocks can no longer occur.
(Virtual links are in the literature referred to as virtual channels, but channels will have in this
book a different connotation—cf. Section 1.4.)
In the case of multiprocessors, the use of communication processors employing wormhole
routing for flow control tends to be such that the time to transport a message between nodes
directly connected by a link in G
P
is only marginally smaller than the time spent when no
direct connection exists. In such circumstances, G
P
can often be regarded as being a
complete graph (cf. Section 2.1, where we discuss details of the example given in Section
1.6.2).
To finalize this section, we mention that yet another flow-control strategy has been proposed
that can be regarded as a hybrid strategy combining store-and-forward flow control and
wormhole routing. It is called virtual cut-through, and is characterized by pipelining the
transmission of packets as in wormhole routing, and by requiring entire packets to be stored
when an outgoing link cannot be immediately used, as in store-and-forward. Virtual cut-
through can then be regarded as a variation of wormhole routing in which the pipelining in
packet transmission is retained but packet blocking is replaced with packet buffering.

1.4 Reactive message-passing programs
So far in this chapter we have discussed how message-passing systems relate to
distributed-memory systems, and have outlined some important characteristics at the
processor level that allow tasks to communicate with one another by message passing over
point-to-point communication channels. Our goal in this section is to introduce, in the form of
a template algorithm, our understanding of what a distributed algorithm is and of how it
should be described. This template and some of the notation associated with it will in Section
2.1 evolve into the more compact notation that we use throughout the book.
We represent a distributed algorithm by the connected directed graph G
T
= (N
T
,D
T
), where
the node set N
T
is a set of tasks and the set of directed edges D
T
is a set of unidirectional
communication c

Stáhnout celý tento materiál

Předchozí

1 2 3

Další

Vloženo: 24.04.2009

Velikost: 3,11 MB

Stáhnout celý tento materiál

Komentáře

Tento materiál neobsahuje žádné komentáře.

Mohlo by tě zajímat:

Reference vyučujících předmětu IV100 - Paralelní a distribuované výpočty

Podobné materiály