Chip Watson [*]
Thomas Jefferson National Accelerator Facility
12000 Jefferson Avenue, Newport News, VA 23606
Many computational science problems require thousands of processors to achieve acceptable time to solution. The cost of these computers is a constraint, thus solutions such as clusters which lower the cost while achieving high performance are of high interest. Many applications, among them lattice quantum chromodynamics, can exploit communications on multiple links in a regular grid, while tolerating modest link latency. Nearest neighbor communications predominate, thus a fully switched (full bisectional bandwidth) network is not required.
Many science applications require solving complex equations describing a system under study. These equations are discretized onto a regular or irregular grid, and solved by iterative or monte carlo techniques. Time to solution is reduced by distributing the problem onto a large number of processors, with clusters being a very cost effective platform. This distribution necessitates interprocessor communication, and this communication limits the scaling of the problem to arbitrarily large clusters.
Quantum chromodynamics (QCD) is a particularly challenging theory which describes the strong force. QCD calculations today are done on computers of scale hundreds of gigaflop/s, with many calculations requiring months of running time. The immediate demand is to support calculations of scale teraflop/s-years, moving to 10 teraflop/s-years within the next few years. In practical terms, it is essential to move to multiple machines of more than 1,000 processors. At this scale, cost is a major factor, and commodity or semi-commodity clusters are an interesting solution.
The discretised problem, Lattice QCD, is dominated by the inversion of large, sparse matrices distributed across a regular 4-dimensional grid representing space-time. The sparse nature of the matrices implies that the matrix inversions involve primarily nearest neighbour communication. This problem is ideally suited for implementation on a grid-based machine, in which each grid node is responsible for a subgrid of space-time.
Each processor in the cluster operates on a regular 4D sub-grid (hypercube), communicating 3D hypersurface data (a 3D face) to adjacent processors. One or more dimensions can be collapsed onto a single processor or onto multiple processors in a node, reducing the number of adjacent neighbors for external communications. For example, if 3 dimensions are distributed onto the cluster, then each computer node has 6 nearest neighbors.
For many other science applications which model systems in 3 dimensions, similar 3D mesh requirements exist. Updates to one subvolume must be communicated to adjacent processors working on adjacent subvolumes.
Modern processors today achieve such high performance that parallel applications running on clusters are nearly always constrained by the cluster interconnect. A next generation interconnect must therefore support the full bandwidth of the I/O bus. Today that implies PCI-X at 1 GByte/sec, and within a year PCI-Express 4x at a similar performance level. Two or three years out it should support PCI-Express 16x.
The network interface card (NIC) must support full duplex communications to 6 nearest neighbors in a 3D mesh. For communications to non-adjacent nodes, it must be capable of forwarding the message to another NIC with no host intervention.
Links in the first version should run at 2 Gbits/sec data rate (2.5 Gbits/sec link rate with 8B/10B encoding). Within 2 years, links should run at 10 Gb/s.
The NIC should be capable of sustaining full host bandwidth on 3 full duplex links or on 6 links used half-duplex.
Host overhead is an important issue, as is synchronization of the communications. To eliminate host received message copy overhead, each of the 6 links should be capable of buffering at least 64Kbytes of inbound data and holding it until the host posts a receive buffer to the card for that link. In this way, the use of host kernel buffers (and a later copy to user space) can be avoided.
Host to host latency in the 10 microseconds range is acceptable, but host overhead per send+receive must be low: 2 microseconds or better for repetitive communications patterns (send from same buffer to each of 6 neighbors repetitively, receive to same buffer from each of 6 neighbors repetitively).
A cluster of this topology is being tested at Jefferson Lab for Lattice QCD. In this prototype, each node has 3 dual gigabit ethernet cards to implement the necessary 6 links. Compared to the desired network, the latency is a bit high (18 microseconds with current software), and the bandwidth is acceptable (limited by the gigE spec). The biggest performance constraint is the lack of buffering in the NIC, and hence the need for kernel receive buffers and single copy on receive, or alternatively a need for synchronization prior to transmission. Also, this mesh network has no routing capability except with host assistance. Network cost for this prototype is also acceptably low (25% of total cluster cost).
[*] Correspondence to Chip Watson, Jefferson Lab MS 16A, 12000 Jefferson Av, Newport News, VA 23606, U.S.A. Email: firstname.lastname@example.org