Home

oneDNN Adventures

Tilmann E. Bartsch, 04.03.2023

CPU Integer matrix-matrix multiplication using numpy is catastrophically slow. Other solutions are only slightly better. For x86_64 CPUs, Intel's C++ library oneDNN provides a remedy. See numpy-int-ops for a working implementation.

 

Table of Contents

The neglected son: Integer Matrix Multiplication

Large matrix-matrix multiplications of two int8-matrix accumulated in an int32 matrix in numpy are painfully slow. The computation time is actually more than 10x slower than for float32 values, although the latter require four times more memory space per value!

Graph showing matrix-matrix multiplication performance for int8 and f32 values in numpy. Runtime Analysis of the matrix-matrix multiplication of two identical matrices for differing values of N. In the case of int8-integers the accumulation is performed with int32-integers.

What the fuck! Just wanted to make some simple operations and now everything turns out to be an order of magnitude slower? Turns out that numpy is based on BLAS (Basic Linear Algebra Subprograms) which is heavily optimized for floating point computations but not at all for integer calcuclations.

I assume that until the dawn of quantization methods for neural networks, nobody cared about this specific integer operation. If that's correct, the next best thing to go are neural network libraries supporting some kind of quantization:

Graph comparing multiple int8 matrix multiplication methods with numpy float32 matrix multiplication as baseline. Runtime Analysis of the int8 matrix-matrix multiplication of two identical matrices for differing values of N. In the case of int8-integers the accumulation is performed with int32-integers.

Well, there are all better than numpy, but one would think that int8 matrix-matrix multiplication should be actually faster than float32 matrix-matrix multiplication.

A convenient problem solver: oneDNN

It seems like there's no pip install-ready solution available. Since my Laptop's got an Intel CPU, I decided to give their framework oneDNN for neural network acceleration a try.

The approach immediately proved itself. Starting from the provided example for matrix multiplication, 34 lines of C++ do the Job:

#include "dnnl.hpp"
#include "matmul.h"

void int8_matmul(int8_t* A, int8_t* B, int32_t* C, int64_t D1, int64_t D2, int64_t M, int64_t K, int64_t N) {
    dnnl::engine eng(dnnl::engine::kind::cpu, 0);

    // Create a MatMul primitive descriptor for the following op: C_u32 = A_u8 * B_u8
    const int64_t d_D1 = DNNL_RUNTIME_DIM_VAL;
    const int64_t d_D2 = DNNL_RUNTIME_DIM_VAL;
    const int64_t d_M = DNNL_RUNTIME_DIM_VAL;
    const int64_t d_K = DNNL_RUNTIME_DIM_VAL;
    const int64_t d_N = DNNL_RUNTIME_DIM_VAL;
    dnnl::memory::desc a_md({d_D1, d_D2, d_M, d_K}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd); 
    dnnl::memory::desc b_md({d_D1, d_D2, d_K, d_N}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd);
    dnnl::memory::desc c_md({d_D1, d_D2, d_M, d_N}, dnnl::memory::data_type::s32, dnnl::memory::format_tag::abcd);
    dnnl::matmul::primitive_desc matmul_pd = dnnl::matmul::primitive_desc(eng, a_md, b_md, c_md); 

    // Create Matmul Primitive
    dnnl::matmul matmul_p(matmul_pd);

    // Create memory objects at the locations given by A, B and C
    dnnl::memory A_s8_mem({{D1, D2, M, K}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd}, eng, A);
    dnnl::memory B_s8_mem({{D1, D2, K, N}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd}, eng, B);
    dnnl::memory C_s32_mem({{D1, D2, M, N}, dnnl::memory::data_type::s32, dnnl::memory::format_tag::abcd}, eng, C);

    // Execute the MatMul primitive
    dnnl::stream s(eng);
    matmul_p.execute(s,
        {{DNNL_ARG_SRC, A_s8_mem},
         {DNNL_ARG_WEIGHTS, B_s8_mem},
         {DNNL_ARG_DST, C_s32_mem}});
    s.wait();
}

Adding some manageable boilerplate makes this available in python. Simply compile int8_matmul with C linkage:

#ifndef MATMUL_H
#define MATMUL_H

#include <stdint.h>

int main(int argc, char **argv);

extern "C" {
    void int8_matmul(int8_t* A, int8_t* B, int32_t* C, int64_t D1, int64_t D2, int64_t M, int64_t K, int64_t N);
} //end extern "C"

#endif // MATMUL_H

And call it in python via the foreign function interface:

from cffi import FFI
import numpy as np

import numpy_int_ops_cpu

# Instantiate FFI
dllURI = f"{numpy_int_ops_cpu.__path__[0]}/libIntOps.so"
funDef = """
void int8_matmul(int8_t* A, int8_t* B, int32_t* C, int64_t D1, int64_t D2, int64_t M, int64_t K, int64_t N);
"""
ffi = FFI()
cpp_lib = ffi.dlopen(dllURI)
ffi.cdef(funDef)

def get_pointer_to_np_arr(arr, ctype, ffi):
    return ffi.cast(ctype, arr.ctypes.data)

def int8_matmul(A: np.ndarray[None, np.int8], B: np.ndarray[None, np.int8]) -> np.ndarray[None, np.int32]:
    assert A.dtype == np.int8, "A must be int8"
    assert B.dtype == np.int8, "B must be int8"
    assert len(A.shape) == 4, "A must be 4D tensor"
    assert len(B.shape) == 4, "B must be 4D tensor"
    assert A.shape[:-4] == B.shape[:-4], f"Cannot matmul A (shape={A.shape}) and B (shape={B.shape})"
    assert A.shape[:-3] == B.shape[:-3], f"Cannot matmul A (shape={A.shape}) and B (shape={B.shape})"
    assert A.shape[-1] == B.shape[-2], f"Cannot matmul A (shape={A.shape}) and B (shape={B.shape})"
    D1 = A.shape[-4]
    D2 = A.shape[-3]
    K, M, N = A.shape[-2], A.shape[-1], B.shape[-1]
    C = np.empty(shape=(D1, D2, K, N), dtype=np.int32)
    A_pointer = get_pointer_to_np_arr(A, "int8_t*", ffi)
    B_pointer = get_pointer_to_np_arr(B, "int8_t*", ffi)
    C_pointer = get_pointer_to_np_arr(C, "int32_t*", ffi)
    cpp_lib.int8_matmul(A_pointer, B_pointer, C_pointer, D1, D2, K, M, N)
    return C

That's it!

Conclusion

Rerunning the runtime analysis from above shows:

Graph comparing multiple int8 matrix multiplication methods including the custom one based on oneDNN with accumulation with int32 values. numpy float32 matrix multiplication as baseline. Runtime Analysis for different int8 matrix multiplication methods including the custom oneDNN based implentation. Performance for float32 included for comparison.

int8-matrix-multiplication (with int32-accumulation) is now a lot faster than float32-matrix-multiplication. Neat.

For a complete implementation of the shown code, see my GitHub Repo numpy-int-ops.