The Return of Agent Zlerich: FFTW 3.3 MPI Cheat Sheet

Because I believe they provide a nice way to perform P3DFFT- and 2DECOMP&FFT-like MPI-parallel pencil decompositions, I have have spent some time staring at FFTW's 3.3 alpha and beta releases. In particular, poring over their Distributed-memory FFTW with MPI capabilities.

Using 3.3-alpha1 I wrote a small library (called underling) which mimicked P3DFFT 2.3's data movement capabilities. I wholly isolated the data movement from computing the FFTs. This, unlike P3DFFT and 2DECOMP, provides well-defined memory layout at any stage during the pencil transposes. Functionally my first approach was quite sound. However, it performs suboptimal on-node memory reshuffling (a design mistake on my part) and can stand to be improved.

I am revisiting the assumption of separating the parallel FFTs from the parallel data movement. Doing so allows using the higher-level 2D r2c/c2r planning APIs freshly documented for 3.3-beta. It should require less needless memory reshuffling when combined with appropriate FFTW_MPI_TRANPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flag choices.

Because, though wonderfully written, the relevant sections of the FFTW MPI documentation do not provide a cheat sheet, I have cooked my own for the various piece parts I may use in a second attempt. Perhaps someone will find it useful. Be sure to check the FFTW MPI reference for full details, especially the local_size calls necessary to obtain data distribution information.

Please tell me if you catch any mistakes. Many thanks to Steven G. Johnson for correcting my earlier misunderstanding of the real-to-complex and complex-to-real 2D DFT semantics.

							transposed in					transposed out					transposed in\|out
transpose	in	n0/P	×	n1	×	nc	n1	×	n0/P	×	nc	n0/P	×	n1	×	nc	n1	×	n0/P	×	nc
transpose	out	n1/P	×	n0	×	nc	n1/P	×	n0	×	nc	n0	×	n1/P	×	nc	n0	×	n1/P	×	nc
c2c 2D	in	ñ0/P	×	ñ1	×	ñc	ñ1/P	×	ñ0	×	ñc	ñ0/P	×	ñ1	×	ñc	ñ1/P	×	ñ0	×	ñc
c2c 2D	out	ñ0/P	×	ñ1	×	ñc	ñ0/P	×	ñ1	×	ñc	ñ1/P	×	ñ0	×	ñc	ñ1/P	×	ñ0	×	ñc
r2c 2D	in	n0/P	×	2(n1/2+1)	×	nc	2(n1/2+1)/P	×	n0	×	nc	n0/P	×	2(n1/2+1)	×	nc	2(n1/2+1)/P	×	n0	×	nc
r2c 2D	out	ñ0/P	×	(ñ1/2+1)	×	ñc	ñ0/P	×	(ñ1/2+1)	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc
c2r 2D	in	ñ0/P	×	ñ1/2+1	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc	ñ0/P	×	ñ1/2+1	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc
c2r 2D	out	n0/P	×	2(n1/2+1)	×	nc	n0/P	×	2(n1/2+1)	×	nc	2(n1/2+1)/P	×	n0	×	nc	2(n1/2+1)/P	×	n0	×	nc

Notation: Real-valued directions are denoted n0 and n1 while complex-valued directions are ñ0 and ñ1. Half-complex storage is denoted ñ/2+1 and its padded, real-valued counterpart is 2(n/2+1). Directions decomposed along a communicator with P processes are denoted n/P. nc stands for "number of components" and corresponds to the advanced planning API's howmany arguments.

The Return of Agent Zlerich

09 July 2011

FFTW 3.3 MPI Cheat Sheet

No comments:

Popular Posts

Archived Posts