Because I believe they provide a nice way to perform P3DFFT- and 2DECOMP&FFT-like MPI-parallel pencil decompositions, I have have spent some time staring at FFTW's 3.3 alpha and beta releases. In particular, poring over their Distributed-memory FFTW with MPI capabilities.
Using 3.3-alpha1 I wrote a small library (called underling) which mimicked P3DFFT 2.3's data movement capabilities. I wholly isolated the data movement from computing the FFTs. This, unlike P3DFFT and 2DECOMP, provides well-defined memory layout at any stage during the pencil transposes. Functionally my first approach was quite sound. However, it performs suboptimal on-node memory reshuffling (a design mistake on my part) and can stand to be improved.
I am revisiting the assumption of separating the parallel FFTs from the parallel data movement. Doing so allows using the higher-level 2D r2c/c2r planning APIs freshly documented for 3.3-beta. It should require less needless memory reshuffling when combined with appropriate FFTW_MPI_TRANPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flag choices.
Because, though wonderfully written, the relevant sections of the FFTW MPI documentation do not provide a cheat sheet, I have cooked my own for the various piece parts I may use in a second attempt. Perhaps someone will find it useful. Be sure to check the FFTW MPI reference for full details, especially the
local_size calls necessary to obtain data distribution information.
Please tell me if you catch any mistakes. Many thanks to Steven G. Johnson for correcting my earlier misunderstanding of the real-to-complex and complex-to-real 2D DFT semantics.
|transposed in||transposed out||transposed in|out|
Notation: Real-valued directions are denoted n0 and n1 while complex-valued directions are ñ0 and ñ1. Half-complex storage is denoted ñ/2+1 and its padded, real-valued counterpart is 2(n/2+1). Directions decomposed along a communicator with P processes are denoted n/P.
ncstands for "number of components" and corresponds to the advanced planning API's