oneDNN Adventures
Tilmann E. Bartsch, 04.03.2023
CPU Integer matrix-matrix multiplication using numpy
is catastrophically slow. Other solutions
are only slightly better. For x86_64 CPUs, Intel's C++ library oneDNN
provides a remedy. See
numpy-int-ops
for a working implementation.
The neglected son: Integer Matrix Multiplication
Large matrix-matrix multiplications of two int8
-matrix accumulated in an int32
matrix in numpy are painfully slow.
The computation time is actually more than 10x slower than for float32
values, although the latter require four times more memory space per value!
Runtime Analysis of the matrix-matrix multiplication of two identical matrices for differing values of N. In the case of
int8-integers the accumulation is performed with int32
-integers.
What the fuck! Just wanted to make some simple operations and now everything turns out to be an order of magnitude slower? Turns out that numpy is based on BLAS (Basic Linear Algebra Subprograms) which is heavily optimized for floating point computations but not at all for integer calcuclations.
I assume that until the dawn of quantization methods for neural networks, nobody cared about this specific integer operation. If that's correct, the next best thing to go are neural network libraries supporting some kind of quantization:
Runtime Analysis of the int8 matrix-matrix multiplication of two identical matrices for differing values of N. In the case of
int8-integers the accumulation is performed with int32
-integers.
Well, there are all better than numpy
, but one would think that int8
matrix-matrix multiplication should be actually
faster than float32
matrix-matrix multiplication.
A convenient problem solver: oneDNN
It seems like there's no pip install
-ready solution available. Since my Laptop's got an Intel CPU, I decided to give their framework
oneDNN for neural network acceleration a try.
The approach immediately proved itself. Starting from the provided example for matrix multiplication, 34 lines of C++ do the Job:
#include "dnnl.hpp"
#include "matmul.h"
void int8_matmul(int8_t* A, int8_t* B, int32_t* C, int64_t D1, int64_t D2, int64_t M, int64_t K, int64_t N) {
dnnl::engine eng(dnnl::engine::kind::cpu, 0);
// Create a MatMul primitive descriptor for the following op: C_u32 = A_u8 * B_u8
const int64_t d_D1 = DNNL_RUNTIME_DIM_VAL;
const int64_t d_D2 = DNNL_RUNTIME_DIM_VAL;
const int64_t d_M = DNNL_RUNTIME_DIM_VAL;
const int64_t d_K = DNNL_RUNTIME_DIM_VAL;
const int64_t d_N = DNNL_RUNTIME_DIM_VAL;
dnnl::memory::desc a_md({d_D1, d_D2, d_M, d_K}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd);
dnnl::memory::desc b_md({d_D1, d_D2, d_K, d_N}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd);
dnnl::memory::desc c_md({d_D1, d_D2, d_M, d_N}, dnnl::memory::data_type::s32, dnnl::memory::format_tag::abcd);
dnnl::matmul::primitive_desc matmul_pd = dnnl::matmul::primitive_desc(eng, a_md, b_md, c_md);
// Create Matmul Primitive
dnnl::matmul matmul_p(matmul_pd);
// Create memory objects at the locations given by A, B and C
dnnl::memory A_s8_mem({{D1, D2, M, K}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd}, eng, A);
dnnl::memory B_s8_mem({{D1, D2, K, N}, dnnl::memory::data_type::s8, dnnl::memory::format_tag::abcd}, eng, B);
dnnl::memory C_s32_mem({{D1, D2, M, N}, dnnl::memory::data_type::s32, dnnl::memory::format_tag::abcd}, eng, C);
// Execute the MatMul primitive
dnnl::stream s(eng);
matmul_p.execute(s,
{{DNNL_ARG_SRC, A_s8_mem},
{DNNL_ARG_WEIGHTS, B_s8_mem},
{DNNL_ARG_DST, C_s32_mem}});
s.wait();
}
Adding some manageable boilerplate makes this available in python. Simply compile int8_matmul
with C linkage:
#ifndef MATMUL_H
#define MATMUL_H
#include <stdint.h>
int main(int argc, char **argv);
extern "C" {
void int8_matmul(int8_t* A, int8_t* B, int32_t* C, int64_t D1, int64_t D2, int64_t M, int64_t K, int64_t N);
} //end extern "C"
#endif // MATMUL_H
And call it in python via the foreign function interface:
from cffi import FFI
import numpy as np
import numpy_int_ops_cpu
# Instantiate FFI
dllURI = f"{numpy_int_ops_cpu.__path__[0]}/libIntOps.so"
funDef = """
void int8_matmul(int8_t* A, int8_t* B, int32_t* C, int64_t D1, int64_t D2, int64_t M, int64_t K, int64_t N);
"""
ffi = FFI()
cpp_lib = ffi.dlopen(dllURI)
ffi.cdef(funDef)
def get_pointer_to_np_arr(arr, ctype, ffi):
return ffi.cast(ctype, arr.ctypes.data)
def int8_matmul(A: np.ndarray[None, np.int8], B: np.ndarray[None, np.int8]) -> np.ndarray[None, np.int32]:
assert A.dtype == np.int8, "A must be int8"
assert B.dtype == np.int8, "B must be int8"
assert len(A.shape) == 4, "A must be 4D tensor"
assert len(B.shape) == 4, "B must be 4D tensor"
assert A.shape[:-4] == B.shape[:-4], f"Cannot matmul A (shape={A.shape}) and B (shape={B.shape})"
assert A.shape[:-3] == B.shape[:-3], f"Cannot matmul A (shape={A.shape}) and B (shape={B.shape})"
assert A.shape[-1] == B.shape[-2], f"Cannot matmul A (shape={A.shape}) and B (shape={B.shape})"
D1 = A.shape[-4]
D2 = A.shape[-3]
K, M, N = A.shape[-2], A.shape[-1], B.shape[-1]
C = np.empty(shape=(D1, D2, K, N), dtype=np.int32)
A_pointer = get_pointer_to_np_arr(A, "int8_t*", ffi)
B_pointer = get_pointer_to_np_arr(B, "int8_t*", ffi)
C_pointer = get_pointer_to_np_arr(C, "int32_t*", ffi)
cpp_lib.int8_matmul(A_pointer, B_pointer, C_pointer, D1, D2, K, M, N)
return C
That's it!
Conclusion
Rerunning the runtime analysis from above shows:
Runtime Analysis for different int8 matrix multiplication methods including the custom oneDNN based implentation. Performance for float32 included for comparison.
int8
-matrix-multiplication (with int32
-accumulation) is now a lot faster than float32
-matrix-multiplication. Neat.
For a complete implementation of the shown code, see my GitHub Repo numpy-int-ops.