Jefferson Lab in the News

Chip Watson, head of the High-Performance Computing Group (from left); watches Walt Akers, computer engineer; and Jie Chen, computer Scientist, install a Myrinet card into a computer node.
Cluster Supercomputing at Jefferson Lab:
Programmers Tackle Software, Speed Issues
In terms of sheer computational power, a human brain isn't much to brag about. But the brain's ability to parcel out tasks, be situationally flexible, handle ambiguity and otherwise deal with the unexpected attests to a powerful kind of architecture that has so far eluded its would-be electronic rivals. Until, that is, the advent of parallel processing.
In a primitive approximation of what occurs in the brain, programmers at the Department of Energy's Jefferson Lab, in Newport News, Virginia, and elsewhere are figuring out how to divvy up computational tasks. Unlike conventional computation, where instructions are executed one by one, or serially, parallel programming identifies the crucial components of a given task, assigns those tasks to particular processors - called "nodes" by insiders - and then insures that, although the processors are working independently, they are sufficiently interconnected to be able to communicate the results of their labors with every other node.
That approach is crucial as physicists attempt to simulate and then describe the interactions of quarks - thought by many scientists to be one of the basic building blocks of matter - within particles like protons and neutrons. Advanced calculations such as those required by "lattice quantum chromodynamics," or LQCD, theorists simply can't run effectively on the serial processors found on run-of-the-mill personal computers.
"Even if you have a dozen processors in one computer, your job, your software, will still only use one processor," says Walt Akers, a computer engineer with JLab's Chief Information Office, High-Performance Computing Group. "Most run from the top to the bottom [of a task]. It's one long stream of operations, with each one depending on the one before. Going to a parallel computing system potentially offers vast increases in performance capabilities."
As part of its participation in the Scientific Discovery through Advanced Computing project, or SciDAC, administered by the Department of Energy's Office of Science, the Lab's computer engineers are developing hardware and software innovations that will allow the developing generation of "cluster supercomputers" to operate at maximum potential. Aside from continuing to solve basic deployment issues - such as the interconnections and communication among the computer nodes that comprise each cluster - JLab specialists have also developed or refined several collaborative software packages that monitor node performance and insure speedy replacement of defective components. In particular, the Lab-developed "Cluster in a Can" program has been posted to the World Wide Web for anytime-anywhere use by software developers.
"We're in a national collaboration to make LQCD a reality," says Akers' colleague Jie Chen, a computer scientist in JLab's High-Performance Computing Group. "It's a big job. Parallel programming is more difficult than serial programming code. Coordination and communication are the big issues. Each node has to 'talk' to its neighboring node, even while the nodes are calculating."
Chen, Akers and others in the Group continue to wrestle with "latency," the speed with which the nodes exchange data with one another. In building 128-node clusters, latency speeds were manageable thanks to the installation of dedicated Myrinet hardware interconnects. But these devices remain expensive, and so are not realistic options for the 256-node cluster, which will use billion-bit, or gigabit, Ethernet interconnects the group is now preparing to build. By nature, these gigabit interconnets operate more slowly, from a barely acceptable 20 microseconds latency up to as much as an unworkable 60. Latency speeds must improve dramatically if successor, bigger clusters are to work effectively, on the order of 5 to 10 microseconds.
These problems can be solved, the Group believes. The solution likely lies with a combination of the next generation of cheaper hardware and custom-designed software, perhaps written at the Lab. "As long as the problems are large and ugly, there will be bigger computers to solve them," Akers says. "But the bandwidth and latency issues will remain the biggest obstacles. Those are major issues that aren't going away."

