MPI Bandwidth Benchmarks
I need to see if the wall times reported by mpiP for the MPI calls underneath FFTW 3.3alpha1's parallel transposes make sense. The TACC folks were excellent, as usual, and pointed me to the MVAPICH benchmarks.
One quick realization of the bandwidth test on lonestar and ranger shows the following:
Unfortunately, this information does not explain my bottleneck...