Matrix-Matrix Multiplication

1. General Matrix Multiplication

For an n \times k matrix \mathbf{A} and a k \times m matrix \mathbf{B}, where \mathbf{C} = \mathbf{AB}:

C_{ij} = \sum_{w=1}^k A_{iw} B_{wj}

This is straightforward to implement in C, but can also be optimized with some simple techniques to increase performance by an order of magnitude. Also, a note: instead of one-indexing as is common in mathematics, C uses zero-indexing, which is what will be used for the remainder of this document.

Initially I implemented the multiplication in the order used in the provided mm.c: the variables for the for loops, from outer to inner, went i, j, w. One advantage of having the w loop be the innermost is that the output matrix does not need to be zeroed before entering the main loop, as each element of the output is full computed by the w loop before moving on to the next element. However, zeroing the output matrix is only O(n^2) with a relatively small coefficient, and as such makes little impact on the overall performance.

Moving the j loop to be the innermost enables several compiler optimizations that, at -O3, interact to give an order of magnitude improvement in performance. I did enough testing to determine that it is neither a single mode of optimization (as represented by GCC's -f flags), nor a linear combination of optimization modes, but an interaction between several. -O3 enables 103 -f flags, so I am not currently going to write a script to test all 10 nonillion (10^{31}) combinations.

{Multiply 1}
for (i = 0; i < N; i++) {
    for (j = 0; j < M; j++) {
        C[i*M + j] = 0.0f;
    }
}
for (i = 0; i < N; i++) {
    for (w = 0; w < K; w++) {
        for (j = 0; j < M; j++) {
            C[i*M + j] += ACCESS_A(i,w) * BT[w*M + j];
        }
    }
}

Used in sections 4, 4, 4 and 4

Note that accessing A uses ACCESS_A, but B is not accessed. Instead, there is BT. The optimization described above only works if B is row-major, and so if B is column-major, BT is its transpose. As with zeroing C, this is only O(n^2) with a relatively small coefficient, and as such does not itself add much time to the calculation, even though it requires dynamic memory allocation. The performance improvement from the GCC optimization is just so great that it makes tradeoffs like this worth it.

\pagebreak

{Transpose B 1}
REAL *BT = malloc(sizeof(REAL)*K*M);
for (w = 0; w < K; w++) {
    for (j = 0; j < M; j++) {
        BT[w*M + j] = ACCESS_B(w,j);
    }
}

Used in sections 4 and 4

Otherwise BT is just a pointer to B. The ACCESS_A and ACCESS_B macros are set based on the element order of A and B, either row-major or column-major.

\vspace{-2.25em}

2. Access to Row-Major Matrices

For row-major matrices, the row number i needs to be multiplied by the row length (that is, the number of columns), as the row number is essentially the number of rows that need to be "skipped" to reach the correct row. Then the column number j is added to index into that row. For example, the index for element (2,3) in the following 4 \times 6 matrix is \color{red}{2 \cdot 6}\color{black}{} + \color{blue}{3}\color{black}{} = \color{violet}{15}\color{black}{}.

\begin{pmatrix}\color{red}{0}\color{black}{}&\color{red}{1}\color{black}{}&\color{red}{2}\color{black}{}&\color{red}{3}\color{black}{}&\color{red}{4}\color{black}{}&\color{red}{5}\color{black}{}\\\color{red}{6}\color{black}{}&\color{red}{7}\color{black}{}&\color{red}{8}\color{black}{}&\color{red}{9}\color{black}{}&\color{red}{10}\color{black}{}&\color{red}{11}\color{black}{}\\\color{blue}{12}\color{black}{}&\color{blue}{13}\color{black}{}&\color{blue}{14}\color{black}{}&\color{violet}{15}\color{black}{}&\color{gray}{16}\color{black}{}&\color{gray}{17}\color{black}{}\\\color{gray}{18}\color{black}{}&\color{gray}{19}\color{black}{}&\color{gray}{20}\color{black}{}&\color{gray}{21}\color{black}{}&\color{gray}{22}\color{black}{}&\color{gray}{23}\color{black}{}\end{pmatrix}

So the index of element (i,j) in an n \times m row-major matrix is i \cdot m + j. In C:

{Row-Major A 2}
#define ACCESS_A(I,J) (A[(I)*K + (J)])

Used in sections 4 and 4

{Row-Major B 2}
#define ACCESS_B(I,J) (B[(I)*M + (J)])

Used in sections 4 and 4

\vspace{-2.25em}

3. Access to Column-Major Matrices

For column-major matrices, the column number j needs to be multiplied by the column length (that is, the number of rows), as the column number is essentially the number of columns that need to be "skipped" to reach the correct column. Then the row number i is added to index into that column. For example, the index for element (2,3) in the following 4 \times 6 matrix is \color{red}{2}\color{black}{} + \color{blue}{3 \cdot 4}\color{black}{} = \color{violet}{14}\color{black}{}.

\begin{pmatrix}\color{blue}{0}\color{black}{}&\color{blue}{4}\color{black}{}&\color{blue}{8}\color{black}{}&\color{red}{12}\color{black}{}&\color{gray}{16}\color{black}{}&\color{gray}{20}\color{black}{}\\\color{blue}{1}\color{black}{}&\color{blue}{5}\color{black}{}&\color{blue}{9}\color{black}{}&\color{red}{13}\color{black}{}&\color{gray}{17}\color{black}{}&\color{gray}{21}\color{black}{}\\\color{blue}{2}\color{black}{}&\color{blue}{6}\color{black}{}&\color{blue}{10}\color{black}{}&\color{violet}{14}\color{black}{}&\color{gray}{18}\color{black}{}&\color{gray}{22}\color{black}{}\\\color{blue}{3}\color{black}{}&\color{blue}{7}\color{black}{}&\color{blue}{11}\color{black}{}&\color{gray}{15}\color{black}{}&\color{gray}{19}\color{black}{}&\color{gray}{23}\color{black}{}\end{pmatrix}

So the index of element (i,j) in an n \times m column-major matrix is i + j \cdot n. In C:

{Col-Major A 3}
#define ACCESS_A(I,J) (A[(I) + (J)*N])

Used in sections 4 and 4

\pagebreak

{Col-Major B 3}
#define ACCESS_B(I,J) (B[(I) + (J)*K])

Used in sections 4 and 4

4. Bringing It Together

The function itself basically just switches between copies of the multiplication routine with different definitions for ACCESS_A and ACCESS_B. After each copy of the multiplication routine, it is probably prudent to undefine the macros:

{Undefine Macros 4}
#undef ACCESS_A
#undef ACCESS_B

Used in sections 4, 4 and 4

Here is the function. As you can see, if just switches on the orders of the input matrices, transposing B if necessary and multiplying:

{MM Implementation 4}
// C[N][M] = A[N][K] * B[K][M]
void mm(int N, int K, int M, REAL *A, REAL *B, REAL *C,
        int A_rowMajor, int B_rowMajor) {
    int i, j, w;
    if (A_rowMajor && B_rowMajor) {
        {Row-Major A, 2}
        {Row-Major B, 2}
        REAL *BT = B;
        {Multiply, 1}
        {Undefine Macros, 4}
    } else if (A_rowMajor && !B_rowMajor) {
        {Row-Major A, 2}
        {Col-Major B, 3}
        {Transpose B, 1}
        {Multiply, 1}
        free(BT);
        {Undefine Macros, 4}
    } else if (!A_rowMajor && B_rowMajor) {
        {Col-Major A, 3}
        {Row-Major B, 2}
        REAL *BT = B;
        {Multiply, 1}
        {Undefine Macros, 4}
    } else {
        {Col-Major A, 3}
        {Col-Major B, 3}
        {Transpose B, 1}
        {Multiply, 1}
        free(BT);
        {Undefine Macros, 4}
    }
}

Used in section 6

5. Testing

This program produces a file with the generated inputs and outputs, so that the calculation's validity can be tested by an outside program. The file is three ints: N, K, and M, stored in the native format, and then the two input matrices, also in native format, and then all the output matrices in native format. The file is meant for immediate checking and debugging on the same machine, not results storage.

{Testing 5}
// Start the debug file
FILE *debug_file = fopen(".mm.debug", "w");
if (debug_file == NULL) {
    fprintf(stderr, "Could not open file for writing: .mm.debug");
} else {
    int buf[3] = { N, K, M };
    fwrite(buf, sizeof(int), 3, debug_file);
    fwrite(A, sizeof(REAL), N*K, debug_file);
    fwrite(B, sizeof(REAL), K*M, debug_file);
    fflush(debug_file);
}

Added to in section 5

Used in section 6

The program uses the provided read_timer to test the performance of mm for each of the permutations of matrix orders.

{Testing 5} +=
// Note: neested functions are a GNU C extension.
double time_mm(int A_rowMajor, int B_rowMajor) {
    // Time mm
    double elapsed_mm = read_timer();
    mm(N, K, M, A, B, C, A_rowMajor, B_rowMajor);
    elapsed_mm  = (read_timer() - elapsed_mm);
    // Output matrix to debug file
    if (debug_file != NULL) {
        fwrite(C, sizeof(REAL), N*M, debug_file);
        fflush(debug_file);
    }
    // Print time
    printf("mm (%d, %d):\t\t\t%4f\t%4f\n", A_rowMajor, B_rowMajor,
            elapsed_mm * 1.0e3, (double)M*N*K / (1.0e6 * elapsed_mm));
    fflush(stdout);
    return elapsed_mm;
}

// Do the timing
time_mm(1, 1);
time_mm(1, 0);
time_mm(0, 1);
time_mm(0, 0);

Added to in section 5

Used in section 6

\pagebreak

Finally, if there was an error with the debug file, set the exit code to 5.

{Testing 5} +=
if (debug_file == NULL || ferror(debug_file)) {
    status = 5;
}

Added to in section 5

Used in section 6

6. Base Code

This is roughly based on the provided mm.c, but a lot has been removed.

{mm.c 6}
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <sys/timeb.h>
#include <omp.h>

/* read timer in second */
double read_timer() {
    struct timeb tm;
    ftime(&tm);
    return (double)tm.time + (double)tm.millitm / 1000.0;
}

/* read timer in ms */
double read_timer_ms() {
    struct timeb tm;
    ftime(&tm);
    return (double)tm.time * 1000.0 + (double)tm.millitm;
}

#define REAL float
#define VECTOR_LENGTH 512

/* initialize a vector with random floating point numbers */
void init(REAL *A, int N) {
    int i;
    for (i = 0; i < N; i++) {
        A[i] = (double)drand48();
    }
}

{MM Implementation, 4}

{Main, 6}

\pagebreak

{Main 6}
int main(int argc, char **argv) {
    int N = VECTOR_LENGTH;
    int M = N;
    int K = N;
    double elapsed; /* for timing */
    if (argc < 4) {
        fprintf(stderr, "Usage: mm [<N(%d)> <K(%d)> <M(%d)>]\n", N, K, M);
        fprintf(stderr, "\t Example: ./mm %d %d %d (default)\n", N, K, M);
    } else {
    	N = atoi(argv[1]);
    	K = atoi(argv[2]);
    	M = atoi(argv[3]);
    }
    REAL *A = malloc(sizeof(REAL)*N*K);
    REAL *B = malloc(sizeof(REAL)*K*M);
    REAL *C = malloc(sizeof(REAL)*N*M);

    srand48((1 << 12));
    init(A, N*K);
    init(B, K*M);

    /* you should add the call to each function and time the execution */
    printf( "==================================================="
            "===================================================\n");
    printf("\tC[%d][%d] = A[%d][%d] * B[%d][%d] without OpenMP\n", N, M, N, K, K, M);
    printf( "---------------------------------------------------"
            "---------------------------------------------------\n");
    printf("Performance:\t\t\tRuntime (ms)\t MFLOPS \n");
    printf( "---------------------------------------------------"
            "---------------------------------------------------\n"); 
    fflush(stdout);

    int status = 0;

    {Testing, 5}
    
    free(A);
    free(B);
    free(C);
    return status;
}

Next Chapter