Publications

Publications / Conference Poster

Towards workload-adaptive scheduling for HPC clusters

Goponenko, Alexander V.; Izadpanah, Ramin; Brandt, James M.; Dechev, Damian

The performance of HPC clusters depends on efficient scheduling of jobs. However, modern schedulers generally lack real-time information about resource utilization and require users to provide information, which is seldom accurate, on job requirements. The problem is exacerbated as HPC systems become increasingly more complicated and heterogeneous, which gives rise to new resource constraints (GPU, parallel file system, network bandwidth, burst buffers, etc.) In this work, we integrated data from LDMS, the Lightweight Distributed Metric Service, with Slurm, a popular job scheduler. To demonstrate the capabilities of such integration, we enabled scheduling based on the Lustre file system throughput. We demonstrated benefits of measurement of real-time utilization, prediction of applications requirements from historical data, and finer control of resources, in a preliminary evaluation of scheduling on a cluster of virtual machines. We also identified the possibility of further improving the scheduling efficiency through workload-adaptive scheduling, by adjusting the scheduling based on characteristics of the pending job. We validated the feasibility of this strategy by simulating job executions in our custom-made HPC cluster simulator.