The Return of Agent Zlerich: 201107

Because I believe they provide a nice way to perform P3DFFT- and 2DECOMP&FFT-like MPI-parallel pencil decompositions, I have have spent some time staring at FFTW's 3.3 alpha and beta releases. In particular, poring over their Distributed-memory FFTW with MPI capabilities.

Using 3.3-alpha1 I wrote a small library (called underling) which mimicked P3DFFT 2.3's data movement capabilities. I wholly isolated the data movement from computing the FFTs. This, unlike P3DFFT and 2DECOMP, provides well-defined memory layout at any stage during the pencil transposes. Functionally my first approach was quite sound. However, it performs suboptimal on-node memory reshuffling (a design mistake on my part) and can stand to be improved.

I am revisiting the assumption of separating the parallel FFTs from the parallel data movement. Doing so allows using the higher-level 2D r2c/c2r planning APIs freshly documented for 3.3-beta. It should require less needless memory reshuffling when combined with appropriate FFTW_MPI_TRANPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flag choices.

Because, though wonderfully written, the relevant sections of the FFTW MPI documentation do not provide a cheat sheet, I have cooked my own for the various piece parts I may use in a second attempt. Perhaps someone will find it useful. Be sure to check the FFTW MPI reference for full details, especially the local_size calls necessary to obtain data distribution information.

Please tell me if you catch any mistakes. Many thanks to Steven G. Johnson for correcting my earlier misunderstanding of the real-to-complex and complex-to-real 2D DFT semantics.

							transposed in					transposed out					transposed in\|out
transpose	in	n0/P	×	n1	×	nc	n1	×	n0/P	×	nc	n0/P	×	n1	×	nc	n1	×	n0/P	×	nc
transpose	out	n1/P	×	n0	×	nc	n1/P	×	n0	×	nc	n0	×	n1/P	×	nc	n0	×	n1/P	×	nc
c2c 2D	in	ñ0/P	×	ñ1	×	ñc	ñ1/P	×	ñ0	×	ñc	ñ0/P	×	ñ1	×	ñc	ñ1/P	×	ñ0	×	ñc
c2c 2D	out	ñ0/P	×	ñ1	×	ñc	ñ0/P	×	ñ1	×	ñc	ñ1/P	×	ñ0	×	ñc	ñ1/P	×	ñ0	×	ñc
r2c 2D	in	n0/P	×	2(n1/2+1)	×	nc	2(n1/2+1)/P	×	n0	×	nc	n0/P	×	2(n1/2+1)	×	nc	2(n1/2+1)/P	×	n0	×	nc
r2c 2D	out	ñ0/P	×	(ñ1/2+1)	×	ñc	ñ0/P	×	(ñ1/2+1)	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc
c2r 2D	in	ñ0/P	×	ñ1/2+1	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc	ñ0/P	×	ñ1/2+1	×	ñc	(ñ1/2+1)/P	×	ñ0	×	ñc
c2r 2D	out	n0/P	×	2(n1/2+1)	×	nc	n0/P	×	2(n1/2+1)	×	nc	2(n1/2+1)/P	×	n0	×	nc	2(n1/2+1)/P	×	n0	×	nc

Notation: Real-valued directions are denoted n0 and n1 while complex-valued directions are ñ0 and ñ1. Half-complex storage is denoted ñ/2+1 and its padded, real-valued counterpart is 2(n/2+1). Directions decomposed along a communicator with P processes are denoted n/P. nc stands for "number of components" and corresponds to the advanced planning API's howmany arguments.

The Return of Agent Zlerich

09 July 2011

FFTW 3.3 MPI Cheat Sheet

05 July 2011

My divergence divergence.

Popular Posts

Archived Posts