20 January 2010

Using FFTW for in-place matrix transposition

The Wikipedia article on in-place matrix transposition tantalizingly hints that FFTW can compute rectangular in-place matrix transposes efficiently (or as efficiently as one can). FFTW's manual, however, is silent. At least as far as section headings, function names, and the index are concerned.

Frigo and Johnson's paper, on page 6, spells out how to tap into the in-place transpose routines: create a rank-0 transform of vector rank 2 using transposed vector strides. I had to bug Matteo Frigo twice (nice guy!) to figure out how to accomplish this using the guru interface.

A C99 solution capable of handling the double-valued either row major or column major case looks like:

#include <assert.h>
#include <ctype.h>
#include <fftw3.h>

fftw_plan plan_transpose(char storage_type,
                         int rows,
                         int cols,
                         double *in,
                         double *out)
{
    const unsigned flags = FFTW_ESTIMATE; /* Do not destroy input */

    fftw_iodim howmany_dims[2];
    switch (toupper(storage_type)) {
        case 'R':
            howmany_dims[0].n  = rows;
            howmany_dims[0].is = cols;
            howmany_dims[0].os = 1;
            howmany_dims[1].n  = cols;
            howmany_dims[1].is = 1;
            howmany_dims[1].os = rows;
            break;
        case 'C':
            howmany_dims[0].n  = rows;
            howmany_dims[0].is = 1;
            howmany_dims[0].os = cols;
            howmany_dims[1].n  = cols;
            howmany_dims[1].is = rows;
            howmany_dims[1].os = 1;
            break;
        default:
            return NULL;
    }
    const int howmany_rank = sizeof(howmany_dims)/sizeof(howmany_dims[0]);

    return fftw_plan_guru_r2r(/*rank*/0, /*dims*/NULL,
                              howmany_rank, howmany_dims,
                              in, out, /*kind*/NULL, flags);
}
You can get a better performing plan if you're willing to use flags = FFTW_MEASURE or its big brothers. I chose to use FFTW_ESTIMATE in the sample above because I didn't want the planning process to destroy the input array.

You use the solution like:

void somelogic()
{
    char storage_type = /* One of 'R', 'r', 'C', 'c' */
    int rows          = /* Number of rows */
    int cols          = /* Number of columns */
    double *in        = /* ... */
    double *out       = /* ... */

    /* Plan the transpose once; transpose is in-place iff in == out */
    fftw_plan transpose = plan_transpose(storage_type, rows, cols, in, out);
    assert(transpose);

    /* Execute the plan potentially many times */
    fftw_execute(transpose);

    /* FFTW New-array Execute functions should be callable, too */
    /* Beware of mixing in-place and out-of-place planning and usage */
    double *another_in  = /* ... */
    double *another_out = /* ... */
    fftw_execute_r2r(transpose, another_in, another_out);

    /* Destroy the plan when completely done */
    fftw_destroy_plan(transpose);
}
There are many restrictions around the the New-array execute interface, so be careful if you go that route.

One very cool thing is that you don't need to know a priori that you're computing an in-place transpose. The planner works for out-of-place transposes as well. Another cool thing is that the plans are useful for rectangular transposes and square transposes alike with more efficient algorithms being used for the latter case.

I wrote myself a toy to test it out. Transposing a Row major, 3-by-2 matrix In-place gives:

[499 rhys@mentes misc]$ ./fftw_transpose I R 3 2

viewing extents [ 3 2 ] with strides [ 2 1 ] {
   0    1 
 100  101 
 200  201 
}
stored contiguously as {
   0    1  100  101  200  201 
}

Transposing in-place with 
(rdft-transpose-cut-3x2
  (rdft-rank0-ip-sq-tiledbuf/1-x2-x2)
  (rdft-rank0-iter-ci/1-x2))

viewing extents [ 2 3 ] with strides [ 3 1 ] {
   0  100  200 
   1  101  201 
}
stored contiguously as {
   0  100  200    1  101  201 
}
Transposing a Column major, 2-by-3 matrix Out-of-place gives:
[500 rhys@mentes misc]$ ./fftw_transpose O C 2 3

viewing extents [ 2 3 ] with strides [ 1 2 ] {
   0    1    2 
 100  101  102 
}
stored contiguously as {
   0  100    1  101    2  102 
}

Transposing out-of-place with 
(rdft-rank0-tiledbuf/1-x3-x2)

viewing extents [ 3 2 ] with strides [ 1 3 ] {
   0  100 
   1  101 
   2  102 
}
stored contiguously as {
   0    1    2  100  101  102 
}

From burning lots of time playing with FFTW's transpose capabilities, it seems that the planner returns NULL (i.e. bombs) whenever you attempt to transpose a non-contiguous matrix. That rules out specifying leading dimensions and/or arbitrary strides as far as I can tell. Someone please correct me if I'm wrong. I'd love to be wrong so I can use FFTW to emulate Intel MKL's mkl_trans.h functionality for systems where a newer version of MKL is not available.

6 comments:

pijyoi said...

Instead of differentiating between row major and column major, couldn't one just swap the roles of "rows" and "cols"?

Assuming I had a transpose(rows, cols) function written for row major, then to transpose a column major matrix, I could just call transpose(cols, rows)?

Although the guru interface allows us to specify dimensions a la Matlab, that is to say, 1st dimension is smallest stride, rather than the usual C convention of last dimension being the smallest stride, I found that doing the latter ended up with slower fft execution.

Not sure if it makes any difference with fftw transpose. I tried using fftw transpose some time ago but found that in-place rectangular transpose could be much slower than out-of-place.

Rhys Ulerich said...

I called out the row-vs-column major behavior with the 'R' and 'C' values because I was mimicking the mkl_trans.h method signatures. There are definitely other ways, like you suggest, to handle that detail.

Performing any computation across the shortest possible stride is usually a win; the FFT performance you describe is consistent with that.

In-place transposes will generally be much, much slower than out-of-place transposes. For some example numbers, check out Dow's 1995
Transposing a Matrix on a Vector Computer
where computing the in-place transpose by first transposing out-of-place and then copying the result ("V1") consistently wins over all in-place algorithms.

pijyoi said...

Noticed a typo in my comment. Should have been following C convention gives faster fft execution.

Luotto said...

That is useful! What if we want to do the FFT in 1 dimension and then store the transposed of that transformed matrix? Could this method be useful for doing so?

Jeff Hammond said...

Did you see http://www.catanzaro.name/papers/PPoPP-2014.pdf?

Rhys Ulerich said...

No Jeff, I hadn't. Thank you for the pointer.

Subscribe Subscribe to The Return of Agent Zlerich