FFTW 3.3 MPI Cheat Sheet
Because I believe they provide a nice way to perform P3DFFT- and 2DECOMP&FFT-like MPI-parallel pencil decompositions, I have have spent some time staring at FFTW's 3.3 alpha and beta releases. In particular, poring over their Distributed-memory FFTW with MPI capabilities.
Using 3.3-alpha1 I wrote a small library (called underling) which mimicked P3DFFT 2.3's data movement capabilities. I wholly isolated the data movement from computing the FFTs. This, unlike P3DFFT and 2DECOMP, provides well-defined memory layout at any stage during the pencil transposes. Functionally my first approach was quite sound. However, it performs suboptimal on-node memory reshuffling (a design mistake on my part) and can stand to be improved.
I am revisiting the assumption of separating the parallel FFTs from the parallel data movement. Doing so allows using the higher-level 2D r2c/c2r planning APIs freshly documented for 3.3-beta. It should require less needless memory reshuffling when combined with appropriate FFTW_MPI_TRANPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flag choices.
Because, though wonderfully written, the relevant sections of the FFTW MPI documentation do not provide a cheat sheet, I have cooked my own for the various piece parts I may use in a second attempt. Perhaps someone will find it useful. Be sure to check the FFTW MPI reference for full details, especially the local_size
calls necessary to obtain data distribution information.
Please tell me if you catch any mistakes. Many thanks to Steven G. Johnson for correcting my earlier misunderstanding of the real-to-complex and complex-to-real 2D DFT semantics.
transposed in | transposed out | transposed in|out | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
transpose | in | n0/P | × | n1 | × | nc | n1 | × | n0/P | × | nc | n0/P | × | n1 | × | nc | n1 | × | n0/P | × | nc |
out | n1/P | × | n0 | × | nc | n1/P | × | n0 | × | nc | n0 | × | n1/P | × | nc | n0 | × | n1/P | × | nc | |
c2c 2D | in | ñ0/P | × | ñ1 | × | ñc | ñ1/P | × | ñ0 | × | ñc | ñ0/P | × | ñ1 | × | ñc | ñ1/P | × | ñ0 | × | ñc |
out | ñ0/P | × | ñ1 | × | ñc | ñ0/P | × | ñ1 | × | ñc | ñ1/P | × | ñ0 | × | ñc | ñ1/P | × | ñ0 | × | ñc | |
r2c 2D | in | n0/P | × | 2(n1/2+1) | × | nc | 2(n1/2+1)/P | × | n0 | × | nc | n0/P | × | 2(n1/2+1) | × | nc | 2(n1/2+1)/P | × | n0 | × | nc |
out | ñ0/P | × | (ñ1/2+1) | × | ñc | ñ0/P | × | (ñ1/2+1) | × | ñc | (ñ1/2+1)/P | × | ñ0 | × | ñc | (ñ1/2+1)/P | × | ñ0 | × | ñc | |
c2r 2D | in | ñ0/P | × | ñ1/2+1 | × | ñc | (ñ1/2+1)/P | × | ñ0 | × | ñc | ñ0/P | × | ñ1/2+1 | × | ñc | (ñ1/2+1)/P | × | ñ0 | × | ñc |
out | n0/P | × | 2(n1/2+1) | × | nc | n0/P | × | 2(n1/2+1) | × | nc | 2(n1/2+1)/P | × | n0 | × | nc | 2(n1/2+1)/P | × | n0 | × | nc |
Notation: Real-valued directions are denoted n0 and n1 while complex-valued directions are ñ0 and ñ1. Half-complex storage is denoted ñ/2+1 and its padded, real-valued counterpart is 2(n/2+1). Directions decomposed along a communicator with P processes are denoted n/P.
nc
stands for "number of components" and corresponds to the advanced planning API's howmany
arguments.