09 July 2011

FFTW 3.3 MPI Cheat Sheet

Because I believe they provide a nice way to perform P3DFFT- and 2DECOMP&FFT-like MPI-parallel pencil decompositions, I have have spent some time staring at FFTW's 3.3 alpha and beta releases. In particular, poring over their Distributed-memory FFTW with MPI capabilities.

Using 3.3-alpha1 I wrote a small library (called underling) which mimicked P3DFFT 2.3's data movement capabilities. I wholly isolated the data movement from computing the FFTs. This, unlike P3DFFT and 2DECOMP, provides well-defined memory layout at any stage during the pencil transposes. Functionally my first approach was quite sound. However, it performs suboptimal on-node memory reshuffling (a design mistake on my part) and can stand to be improved.

I am revisiting the assumption of separating the parallel FFTs from the parallel data movement. Doing so allows using the higher-level 2D r2c/c2r planning APIs freshly documented for 3.3-beta.  It should require less needless memory reshuffling when combined with appropriate FFTW_MPI_TRANPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flag choices.

Because, though wonderfully written, the relevant sections of the FFTW MPI documentation do not provide a cheat sheet, I have cooked my own for the various piece parts I may use in a second attempt. Perhaps someone will find it useful. Be sure to check the FFTW MPI reference for full details, especially the local_size calls necessary to obtain data distribution information.

Please tell me if you catch any mistakes. Many thanks to Steven G. Johnson for correcting my earlier misunderstanding of the real-to-complex and complex-to-real 2D DFT semantics.





transposed in transposed out transposed in|out
transpose in n0/P × n1 × nc n1 × n0/P × nc n0/P × n1 × nc n1 × n0/P × nc
out n1/P × n0 × nc n1/P × n0 × nc n0 × n1/P × nc n0 × n1/P × nc
c2c 2D in ñ0/P × ñ1 × ñc ñ1/P × ñ0 × ñc ñ0/P × ñ1 × ñc ñ1/P × ñ0 × ñc
out ñ0/P × ñ1 × ñc ñ0/P × ñ1 × ñc ñ1/P × ñ0 × ñc ñ1/P × ñ0 × ñc
r2c 2D in n0/P × 2(n1/2+1) × nc 2(n1/2+1)/P × n0 × nc n0/P × 2(n1/2+1) × nc 2(n1/2+1)/P × n0 × nc
out ñ0/P × (ñ1/2+1) × ñc ñ0/P × (ñ1/2+1) × ñc (ñ1/2+1)/P × ñ0 × ñc (ñ1/2+1)/P × ñ0 × ñc
c2r 2D in ñ0/P × ñ1/2+1 × ñc (ñ1/2+1)/P × ñ0 × ñc ñ0/P × ñ1/2+1 × ñc (ñ1/2+1)/P × ñ0 × ñc
out n0/P × 2(n1/2+1) × nc n0/P × 2(n1/2+1) × nc 2(n1/2+1)/P × n0 × nc 2(n1/2+1)/P × n0 × nc


Notation: Real-valued directions are denoted n0 and n1 while complex-valued directions are ñ0 and ñ1. Half-complex storage is denoted ñ/2+1 and its padded, real-valued counterpart is 2(n/2+1). Directions decomposed along a communicator with P processes are denoted n/P. nc stands for "number of components" and corresponds to the advanced planning API's howmany arguments.

No comments:

Subscribe Subscribe to The Return of Agent Zlerich