KMM: Kernel Memory Manager

The Kernel Memory Manager (KMM) is a lightweight, high-performance framework designed for parallel dataflow execution and efficient memory management on multi-GPU platforms.

KMM automatically manages GPU memory, partitions workloads across multiple GPUs, and schedules tasks efficiently. Unlike frameworks that require a specific programming model, KMM integrates existing GPU kernels or functions without the need to fully rewrite your code.

Highlights of KMM:

  • Efficient Memory Management: automatically allocates memory and transfers data between GPU and host only when neccessary.

  • Scalable Computing: seamlessly spills data from GPU to host memory, enabling huge datasets that exceed GPU memory.

  • Optimized Scheduling: DAG scheduler automatically tracks dependencies and executes kernels in a sequentially consistent order.

  • Flexible Work Partitioning: split workloads and data according to user-defined distributions, ensuring utilization of available resources.

  • Portable Execution: supports existing CUDA, HIP, and CPU-based functions; seamless integration with minimal changes.

  • Multi-Dimensional Arrays: handles ND-arrays of any shape, dimensionality, and data type.

Basic Example

This example shows how to run a CUDA kernel implementing a vector add operation with KMM.

 1#include "kmm/kmm.hpp"
 2
 3__global__ void vector_add(
 4    kmm::Range<int64_t> range,
 5    kmm::GPUSubviewMut<float> output,
 6    kmm::GPUSubview<float> left,
 7    kmm::GPUSubview<float> right
 8) {
 9    int64_t i = blockIdx.x * blockDim.x + threadIdx.x + range.begin;
10    if (i >= range.end) return;
11
12    output[i] = left[i] + right[i];
13}
14
15int main() {
16    // 2B items, 10 chunks, 256 threads per block
17    long n = 2'000'000'000;
18    long chunk_size = n / 10;
19    dim3 block_size = 256;
20
21    // Initialize runtime
22    auto rt = kmm::make_runtime();
23
24    // Create arrays
25    auto A = kmm::Array<float> {n};
26    auto B = kmm::Array<float> {n};
27    auto C = kmm::Array<float> {n};
28
29    // Initialize input arrays
30    initialize_inputs(A, B);
31
32    // Launch the kernel!
33    rt.parallel_submit(
34        n, chunk_size,
35        kmm::GPUKernel(vector_add, block_size),
36        _x,
37        write(C[_x]),
38        A[_x],
39        B[_x]
40    );
41
42    // Wait for completion
43    rt.synchronize();
44
45    return 0;
46}

Indices and tables