Problem Set 1 --- CS 442

Due: Tuesday, September 30th

(1) You are tired of having your programs write output to the screen in random jumbled order. Write a few lines of pseudocode that are executed by all processors, produce this style of output, and that require no more than 2P sends/receives, where P is the number of processors. It should work for any value of P.

Proc 0: value is 4
Proc 1: value is 2
Proc 2: value is 7
Proc 3: value is 15
...

If you were writing to a file instead of the screen, typically only processor 0 would open the file and do the ``writes''. Write a section of pseudocode that satisfies this new constraint. How many sends/receives are now required?

(2) Consider a finite difference calculation performed on an NxN grid, similar to the parallel game-of-life simulations we have been discussing. Assume that updating the value for each grid cell requires 8 floating-point operations (same 9-point stencil as the GOL), and communication is needed at the start of each timestep to update ghost-cell values. Analyze the two parallelization strategies discussed in class for programming assignment 3: (1) a one-dimensional decomposition of the grid over P processors, and (2) a two-dimensional decomposition across Q*Q=P processors. For each strategy, write down a formula for the total CPU time per timestep in terms of communication startup (alpha), inverse bandwidth per byte (beta) and time per flop (tau), as well as N and P (or Q). You can assume that N is an exact multiple of P, and that each communicated ghost-cell value requires eight bytes of memory. For each strategy, derive formulas for the parallel speedup and efficiency as a function of the same parameters. For the 2-d strategy, assuming the startup cost is negligible, how quickly must N grow as a function of P to keep the parallel efficiency constant?

On Sandia's Intel Tflops machine, the following values are approximately correct (in nanoseconds): alpha = 20000, beta = 3, and tau = 10. For each of the two strategies running on 1024 processors, how large must N be to keep the cost of communication below 20% of the total run time?

(3) Consider the recursive halving and recursive doubling operations described in class. For P a power-of-two, write some pseudocode using these operations to perform a broadcast of a N-length vector with 2*log(P) message startups and only 2N total communication volume. Note: Although it requires more startups, for large N this is better than the N*log(P) communication volume required by the tree-structured (cascade) broadcast you programmed for the first assignment.