** Due:** Tuesday, September 30th

(1) You are tired of having your programs write output to the screen
in random jumbled order. Write a few lines of pseudocode that are
executed by all processors, produce this style of output, and that
require no more than **2P** sends/receives, where **P** is the number of
processors. It should work for any value of **P**.

Proc 0: value is 4

Proc 1: value is 2

Proc 2: value is 7

Proc 3: value is 15

...

If you were writing to a file instead of the screen, typically only
processor 0 would open the file and do the ``writes''. Write a
section of pseudocode that satisfies this new constraint. How many
sends/receives are now required?

(2) Consider a finite difference calculation performed on an NxN grid, similar
to the parallel game-of-life simulations we have
been discussing. Assume that updating the value for each grid cell
requires 8 floating-point operations (same 9-point stencil as the
GOL), and communication is needed at the start of each timestep to
update ghost-cell values. Analyze the two parallelization strategies
discussed in class for programming assignment 3: (1) a
one-dimensional decomposition of the grid over **P** processors, and
(2) a two-dimensional decomposition across Q*Q=P processors. For
each strategy, write down a formula for the total CPU time per
timestep in terms of communication startup (alpha), inverse
bandwidth per byte (beta) and time per flop (tau), as well as
**N** and **P** (or **Q**). You can assume that **N** is an exact multiple of **P**, and
that each communicated ghost-cell value requires eight bytes of
memory. For each strategy, derive formulas for the parallel speedup
and efficiency as a function of the same parameters.
For the 2-d strategy, assuming the startup cost is negligible, how
quickly must **N** grow as a function of **P** to keep the parallel
efficiency constant?

On Sandia's Intel Tflops machine, the following values are
approximately correct (in nanoseconds): alpha = 20000,
beta = 3, and tau = 10. For each of
the two strategies running on 1024 processors, how large must **N** be
to keep the cost of communication below 20% of the total run time?

(3) Consider the recursive halving and recursive doubling operations
described in class. For **P** a power-of-two, write some pseudocode
using these operations to perform a broadcast of a **N**-length vector
with 2*log(P) message startups and only **2N** total communication
volume. Note: Although it requires more startups, for large **N** this
is * better* than the N*log(P) communication volume required by the
tree-structured (cascade) broadcast you programmed for the first
assignment.