Infiniband is a rather new phenomenon: the specification was finished in June
2001. From 2002 on a number of vendors has started to offer their products
based on the Infiniband standard. A very complete description (1200 pages) can
be found in [29]. Like
cLAN, [12], Infiniband can be used to
connect various system components within a system. Via Host Channel Adaptors
(HCAs) the Infiniband fabric can be used for interprocessor networks, attaching
I/O subsystems, or to multi-protocol switches like Gbit Ethernet switches, etc.
Because of this versatility, the market is not limited just to the
interprocessor network segment and so Infiniband is expected to become
relatively inexpensive because a higher volume of sellings can be realised. The
characteristics of Infiniband are rather nice: there are product definitions
both for copper and glass fiber connections, switch and router properties are
defined and for high bandwidth multiple connections can be employed. Also the
way messages are broken up in packets and reassembled, as well as, routing,
prioritising, and error handling are all described in the standard. This makes
Infiniband independent of a particular technology and it is, because of its
completeness, a good basis to implement a communication library (like MPI) on
top of it.
Conceptually, Infiniband knows of two types of connectors to the system
components, the Host Channel Adapters (HCAs), already mentioned, and Target
Channel Adapters (TCAs). The latter are typically used to connect to I/O
susbsystems while HCAs does more concern us as these are the connectors used in
interprocessor communication. Infiniband defines a basic link speed of 2.5 Gb/s
(312.5 MB/s) but also a 4× and 12× speed of 1.25 GB/s and 3.75 GB/s,
respectively. Also HCAs and TCAs can have multiple ports that are independent
and allow for higher reliability and speed.
Messages can be sent on the basis of Remote Memory Direct Access (RDMA) from
one HCA/TCA to another: a HCA/TCA is permitted to read/write the memory of
another HCA/TCA. This enables very fast transfer once permission and a
write/read location are given. A port together with its HCA/TCA provide a
message with a 128-bit header which is IPv6 compliant and that is used to
direct it to its destination via cut-through wormhole routing: In each
switching stage the routing to the next stage is decoded and send on. Short
messages of 32 B can be embedded in control messages which cuts down on the
negotiation time for control messages.
Infiniband switches for HPC are offered with 8--128 ports and always at a speed
of 1.25 GB/s. The switches can be configured in any desired topology but in
practice a fat tree topology is almost always
preferred. It obviously depends on the quality of the MPI implementation put on
top of the Infiniband specifications how much of the raw speed can be realised.
A Ping-Pong experiment on an Infiniband-based cluster has shown a bandwidth of
around 850 MB/s and an MPI latency of < 7 µs for small messages. The
in-switch latency is typically about 200 ns.
Presently, the price per port is still somewhat higher than for that of Myrinet
the market leader. However, when Infiniband would take on, the price can drop
significantly and become a serious competitor for
Myrinet. Because also other component connection standards like PCI-X and
PCI Express are making their mark it is not clear at the moment what the impact
of Infiniband in the long run will be.