Publications

Publications / Conference

Current trends in parallel computation and the implications for modeling and optimization

Siirola, John D.

Process Systems Engineering (PSE) is built on the application of computational tools to the solution of physical engineering problems. Over the course of its nearly five decade history, advances in PSE have relied roughly equally on advancements in desktop computing technology and developments of new tools and approaches for representing and solving problems (Westerberg, 2004). Just as desktop computing development over that period focused on increasing the net serial instruction rate, tool development in PSE has emphasized creating faster general-purpose serial algorithms. However, in recent years the increase in net serial instruction rate has slowed dramatically, with processors first reaching an effective upper limit for clock speed and now approaching apparent limits for microarchitecture efficiency. Current trends in desktop processor development suggest that future performance gains will occur primarily through exploitation of parallelism. For PSE to continue to leverage the "free" advancements from desktop computing technology in the future, the PSE toolset will need to embrace the use of parallelization. Unfortunately, "parallelization" is more than just identifying multiple things to do at once. Parallel algorithm design has two fundamental challenges: first, to match the characteristics of the parallelizable problem workload to the capabilities of the hardware platform, and second to properly balance parallel computation with the overhead of communication and synchronization on that platform. The performance of any parallel algorithm is thus a strong function of how well the characteristics of the problem and algorithm match those of the hardware platform on which it will run. This has led to a proliferation of highly specialized parallel hardware platforms, each designed around specific problems or problem classes. While every platform has its own unique characteristics, we can group current approaches into six basic classes: symmetric multiprocessing (SMP), networks of workstations (NOW), massively parallel processing (MPP), specialized coprocessors, multi-threaded shared memory, and hybrids that combine components of the first five classes. Perhaps the most familiar of these is the SMP architecture, which forms the bulk of current the desktop and workstation market. These systems have multiple processing units (processors and/or cores) controlled by a single operating system image and sharing a single common shared memory space. While SMP systems provide only a modest level of parallelism (typically 2-16 processing units), the existence of shared memory and full-featured processing units makes them perhaps the most straightforward development platform. A challenge of SMP platforms is the discrepancy between the speed of the processor and the memory system: both latency and overall memory bandwidth limitations can lead to processors idling waiting for data. Clusters, a generic term for coordinated groups of independent computers (nodes) connected with high-speed networks, provide the opportunity for a radically different level of parallelism, with the largest clusters having over 25,000 nodes and 100,000 processing units. The challenge with clusters is memory is distributed across independent nodes. Communication and coordination among nodes must be explicitly managed and occurs over a relatively high latency network interconnect. Efficient use of this architecture requires applications that decompose into pseudo-independent components that run with high computation to communication ratios. The level to which systems utilize commodity components distinguishes the two main types of cluster architectures, with NOW nodes running commodity network interconnects and operating systems and MPP nodes using specialized or proprietary network layers or microkernels. Specialized coprocessors, including graphics processing units (GPU) and the Cell Broadband Engine (Cell), are gaining popularity as scientific computing platforms. These platforms employ non-general purpose dependent processing units to speed fine-grained, repetitive processing. Architecturally, they are reminiscent of vector computing, combining very fast access to a small amount of local memory with processing elements implementing either a single-instruction-multiple-data (SIMD) (GPU) or a pipelined (Cell) model. As application developers must explicitly manage both parallelism on the coprocessor and the movement of data to and from the coprocessor memory space, these architectures can be some of the most challenging to program. Finally, multi-threaded shared memory (MTSM) systems represent a fundamental departure from traditional distributed memory systems like NOW and MPP. Instead of a collection of independent nodes and memory spaces, an MTSM system runs a single system image across all nodes, combining all node memory into a single coherent shared memory space. To a developer, the MTSM appears to be a single very large SMP. However, unlike a SMP that uses caches to reduce the latency of a memory access, the MTSM tolerates latency by using a large number of concurrent threads. While this architecture lends itself to problems that are not readily decomposable, effective utilization of MTSM systems requires applications to run hundreds - or thousands - of concurrent threads. The proliferation of specialized parallel computing architectures presents several significant challenges for developers of parallel modeling and optimization applications. Foremost is the challenge of selecting the "appropriate" platform to target when developing the application. While it is clear that architectural characteristics can significantly affect the performance of an algorithm, relatively few rules or heuristics exist for selecting a platform based solely on application characteristics. A contributing challenge is that different architectures employ fundamentally different programming paradigms, libraries, and tools. Knowledge and experience on one platform does not necessarily translate to other platforms. This also complicates the process of directly comparing platform performance, as applications are rarely portable: software designed for one platform rarely compiles on another without modification, and the modifications may require a redesign of the fundamental parallelization approach. A final challenge is effectively communicating parallel results. While the relatively homogenous environment of serial desktop computing facilitated extremely terse descriptions of a test platform, often limited to processor make and clock speed, reporting results for parallel architectures must include not only processor information, but depending on the architecture, also include operating system, network interconnect, coprocessor make, model, and interconnect, and node configuration. There are numerous examples of algorithms and applications designed explicitly to leverage specific architectural features of parallel systems. While by no means comprehensive, three current representative efforts are the development of parallel branch and bound algorithms, distributed collaborative optimization algorithms, and multithreaded parallel discrete event simulation. PICO, the Parallel Integer and Combinatorial Optimizer (Eckstein, et al., 2001), is a scalable parallel mixed-integer linear optimizer. Designed explicitly for cluster environments (both NOW and MPP), PICO leverages the synergy between the inherently decomposable branch and bound tree search and the independent nature of the nodes within a cluster by distributing the independent sub-problems for the tree search across the nodes of the cluster. In contrast, agent-based collaborative optimization (Siirola, et al., 2004, 2007) matches traditionally non-decomposable nonlinear programming algorithms to high-latency clusters (e.g. NOWs or Grids) by replicating serial search algorithms intact and unmodified across the independent nodes of the cluster. The system then enforces collaboration through sharing intermediate "solutions" to the common problem. This creates a decomposable artificial meta-algorithm with a high computation to communication ratio that can scale efficiently on large, high latency, low bandwidth cluster environments. For modeling applications, efficiently parallelizing discrete event simulations has presented a longstanding challenge, with several decades of study and literature (Perumalla, 2006). The central challenge in parallelizing discrete event simulations on traditional distributed memory clusters is efficiently synchronizing the simulation time across the processing nodes during a simulation. A promising alternative approach leverages the Cray XMT (formerly called Eldorado; Feo, et al. 2005). The XMT implements an MTSM architecture and provides a single shared memory space across all nodes, greatly simplifying the time synchronization challenge. Further, the fine-grained parallelism provided by the architecture opens new opportunities for additional parallelism beyond simple event parallelization, for example, parallelizing the event queue management. While these three examples are a small subset of current parallel algorithm design, they demonstrate the impact that parallel architectures have had and will continue to have on future developments for modeling and optimization in PSE. © 2009 Elsevier B.V. All rights reserved.