Using FFTW for in-place matrix transposition
The Wikipedia article on in-place matrix transposition tantalizingly hints that FFTW can compute rectangular in-place matrix transposes efficiently (or as efficiently as one can). FFTW's manual, however, is silent. At least as far as section headings, function names, and the index are concerned.
Frigo and Johnson's paper, on page 6, spells out how to tap into the in-place transpose routines: create a rank-0 transform of vector rank 2 using transposed vector strides. I had to bug Matteo Frigo twice (nice guy!) to figure out how to accomplish this using the guru interface.
A C99 solution capable of handling the double
-valued either row major or column major case looks like:
#include <ctype.h>
#include <fftw3.h>
fftw_plan plan_transpose(char storage_type,
int rows,
int cols,
double *in,
double *out)
{
const unsigned flags = FFTW_ESTIMATE; /* Do not destroy input */
fftw_iodim howmany_dims[2];
switch (toupper(storage_type)) {
case 'R':
howmany_dims[0].n = rows;
howmany_dims[0].is = cols;
howmany_dims[0].os = 1;
howmany_dims[1].n = cols;
howmany_dims[1].is = 1;
howmany_dims[1].os = rows;
break;
case 'C':
howmany_dims[0].n = rows;
howmany_dims[0].is = 1;
howmany_dims[0].os = cols;
howmany_dims[1].n = cols;
howmany_dims[1].is = rows;
howmany_dims[1].os = 1;
break;
default:
return NULL;
}
const int howmany_rank = sizeof(howmany_dims)/sizeof(howmany_dims[0]);
return fftw_plan_guru_r2r(/*rank*/0, /*dims*/NULL,
howmany_rank, howmany_dims,
in, out, /*kind*/NULL, flags);
}
flags = FFTW_MEASURE
or its big brothers. I chose to use FFTW_ESTIMATE
in the sample above because I didn't want the planning process to destroy the input array.
You use the solution like:
{
char storage_type = /* One of 'R', 'r', 'C', 'c' */
int rows = /* Number of rows */
int cols = /* Number of columns */
double *in = /* ... */
double *out = /* ... */
/* Plan the transpose once; transpose is in-place iff in == out */
fftw_plan transpose = plan_transpose(storage_type, rows, cols, in, out);
assert(transpose);
/* Execute the plan potentially many times */
fftw_execute(transpose);
/* FFTW New-array Execute functions should be callable, too */
/* Beware of mixing in-place and out-of-place planning and usage */
double *another_in = /* ... */
double *another_out = /* ... */
fftw_execute_r2r(transpose, another_in, another_out);
/* Destroy the plan when completely done */
fftw_destroy_plan(transpose);
}
I wrote myself a toy to test it out. Transposing a Row major, 3-by-2 matrix In-place gives:
[499 rhys@mentes misc]$ ./fftw_transpose I R 3 2
viewing extents [ 3 2 ] with strides [ 2 1 ] {
0 1
100 101
200 201
}
stored contiguously as {
0 1 100 101 200 201
}
Transposing in-place with
(rdft-transpose-cut-3x2
(rdft-rank0-ip-sq-tiledbuf/1-x2-x2)
(rdft-rank0-iter-ci/1-x2))
viewing extents [ 2 3 ] with strides [ 3 1 ] {
0 100 200
1 101 201
}
stored contiguously as {
0 100 200 1 101 201
}
Transposing a Column major, 2-by-3 matrix Out-of-place gives:
[500 rhys@mentes misc]$ ./fftw_transpose O C 2 3
viewing extents [ 2 3 ] with strides [ 1 2 ] {
0 1 2
100 101 102
}
stored contiguously as {
0 100 1 101 2 102
}
Transposing out-of-place with
(rdft-rank0-tiledbuf/1-x3-x2)
viewing extents [ 3 2 ] with strides [ 1 3 ] {
0 100
1 101
2 102
}
stored contiguously as {
0 1 2 100 101 102
}
From burning lots of time playing with FFTW's transpose capabilities, it seems that the planner returns NULL
(i.e. bombs) whenever you attempt to transpose a non-contiguous matrix. That rules out specifying leading dimensions and/or arbitrary strides
as far as I can tell. Someone please correct me if I'm wrong. I'd love to be wrong so I can use FFTW to emulate Intel MKL's mkl_trans.h functionality for systems where a newer version of MKL is not available.
7 comments:
Instead of differentiating between row major and column major, couldn't one just swap the roles of "rows" and "cols"?
Assuming I had a transpose(rows, cols) function written for row major, then to transpose a column major matrix, I could just call transpose(cols, rows)?
Although the guru interface allows us to specify dimensions a la Matlab, that is to say, 1st dimension is smallest stride, rather than the usual C convention of last dimension being the smallest stride, I found that doing the latter ended up with slower fft execution.
Not sure if it makes any difference with fftw transpose. I tried using fftw transpose some time ago but found that in-place rectangular transpose could be much slower than out-of-place.
I called out the row-vs-column major behavior with the 'R' and 'C' values because I was mimicking the mkl_trans.h method signatures. There are definitely other ways, like you suggest, to handle that detail.
Performing any computation across the shortest possible stride is usually a win; the FFT performance you describe is consistent with that.
In-place transposes will generally be much, much slower than out-of-place transposes. For some example numbers, check out Dow's 1995
Transposing a Matrix on a Vector Computer where computing the in-place transpose by first transposing out-of-place and then copying the result ("V1") consistently wins over all in-place algorithms.
Noticed a typo in my comment. Should have been following C convention gives faster fft execution.
That is useful! What if we want to do the FFT in 1 dimension and then store the transposed of that transformed matrix? Could this method be useful for doing so?
Did you see http://www.catanzaro.name/papers/PPoPP-2014.pdf?
No Jeff, I hadn't. Thank you for the pointer.
Greatt blog you have
Post a Comment