Research on Petascale Technologies

A critical component of ITAPS technology development is the drive toward peta and exa-scale computing for unstructured, adaptive computations; particularly for applications that use implicit solution technologies. Parallel adaptive solution procedures require the ability to adaptively modify meshes on distributed parallel computers and to regain load balance before the next set of solution steps. The ITAPS MeshAdapt service has supported the execution of such processes for some time and provides acceptable performance on moderately sized computers. However, we have found that a number of new technical developments are required to deal with the move to very large core counts of petascale computers. In particular, while achieving equal load balance across all processors has been important for machines with 1000's of processors, it becomes both more challenging and more impactful as the number of processors grows to 100,000 and beyond. Even the smallest imbalance can result in 10s of thousands of processors sitting idle; a significant inefficiency that results in wasted resources. We are developing new algorithms to improve both load balance and partition quality for complex simulations. In addition, we are developing predictive load balancing tools to ensure that local operations such as mesh refinement do not exceed the physical memory of the processor.

Another significant challenge with large numbers of processors, is ensuring that operations such as passing information among all processors doesn't begin to dominate the computational cost of the simulation. To address this challenge, we are developing new communication libraries that focus on eliminating all reduce MPI calls by operating on local neighborhoods.

Finally, we are exploring novel mechanisms to address issues of fault tolerance. One of the major issues facing the community with so many processing elements has to do with the recovery of user application codes after CPU, communication, and I/O related failures. Over the past twenty years there has been considerable effort put into the design of "state recovery" at both the system and user code level that allows applications to back up and restart at a point before the failure. Since the Cray machines in the mid-80's, these attempts have not been effective or successful, and as systems get larger, the problem becomes more difficult. To address this, the ITAPS team is researching the redesign of mesh-based computational algorithms so that they are inherently tolerant to component failure. Our approach leverages transaction-based parallel computing techniques using a MapReduce formulation within a Hadoop framework.