Optimizing Low-dense Matrix Multiplication

Matrix multiplication is a cornerstone operation in scientific computing, yet its computational efficiency can drastically vary depending on the matrix's sparsity and distribution of elements. During a volunteer research project, I focused on addressing this challenge by developing a novel framework leveraging the Coordinate-Only Sparsity Scheme (COOS). This approach aimed to accelerate computations for sparse matrices with non-zero entries clustered in specific regions.

The Challenge

Traditional matrix multiplication algorithms, optimized for dense or uniformly sparse matrices, often struggle with low-density configurations—matrices with scattered non-zero elements forming clusters amid vast areas of zeros. This inefficiency stems from unnecessary memory allocation and operations on zero entries, leading to suboptimal performance.

The Solution

The COOS format represents a tailored approach that stores only the coordinates and values of non-zero elements in clusters, bypassing redundant operations. By focusing exclusively on the significant regions, this method reduces computational overhead and memory usage. I implemented this framework and benchmarked it against standard dense and CSR-based sparse matrix multiplication techniques.

Results

The COOS implementation demonstrated significant performance improvements, achieving up to 5x faster computation times for matrices with sparsity levels below 10% and clustered non-zero elements. For super low-dense matrices—those with sparsity levels below 1%—COOS further excelled by eliminating redundant computations on vast zero regions, delivering even greater efficiency compared to conventional methods. These results highlight COOS’s adaptability and effectiveness in handling sparse datasets with irregular or highly scattered non-zero distributions.

Technical Contributions

Designed and implemented the COOS framework to efficiently handle low-density matrices with clustered sparsity.
Conducted comparative benchmarks to identify performance advantages over dense and CSR-based multiplication methods.
Optimized memory allocation strategies to ensure scalability for large-scale sparse matrices.

Key Insights

The efficiency of COOS is most pronounced in low-density matrices where non-zero elements form clusters, making it ideal for applications like finite element analysis, graph processing, and sparse machine learning tasks. This project highlighted the importance of aligning computational strategies with the unique properties of the data being processed.

Through this work, I honed my ability to optimize low-level data structures and develop high-performance algorithms, laying the groundwork for future computational challenges in diverse domains.

Code

.intel_syntax noprefix
.global _matr_mult_coos
.text

# For this implementation both matrices need to be read row by row to work properly

_matr_mult_coos:
    # rax: General purpose for conversion/calculation
    # rdi: Adress of matrix A
    # rsi: Adress of matrix B
    # rdx: Adress of result-array -> moved to rcx (safty reasons)

    # r8: Length of matrix A in storage
    # r9: Iterator for matrix B
    # r10: Number of entries in result array
    xor r10, r10
    # r11: Length of matrix B

    # xmm0: General purpose (e.g. used as mask for AND-operation)
    # xmm1: General purpose (e.g. fetch + store result-triple)
    # xmm2: Current triple of matrix A
    # xmm3: Current triple of matrix B
    # xmm4: Register for some calculations/comparison
    # xmm5: Register for some calculations/comparison
    # xmm6: Bit mask for comparison
    mov rcx, rdx # (moved to avoid problems with mul and div)

    # prefetch data to chaches closer to the processor 
    # does not affect program behavior, but might be nice to have (even if its completely unnecessary :D)
    prefetcht0 [rdi]
    prefetcht0 [rsi]
    prefetcht0 [rcx]

.LsetDimensionsOfResult:
    movups xmm0, [rdi]
    cvtss2si rax, xmm0
    imul r8, rax, 0x03              # number of entries in matrix A

    movups xmm1, [rsi]
    cvtss2si rax, xmm1
    imul r11, rax, 0x03             # number of entries in matrix B

    psrldq xmm1, 4
    psrldq xmm0, 4                  # shift both 1 to the right to remove first entry
    movss xmm1, xmm0

    movups [rcx + 4], xmm1          # store it at beginning of result
    # might send some junk too, but this gets erased after first entry is written

.LresetB:
    mov r9, r11

.LfetchA:
    sub r8, 3
    cmp r8, -3                      # r8 is length and iterator of array A
    je .LexitCoos

    movups xmm2, [rdi + r8 * 4 + 12] # triple of A

.LfetchB:
    sub r9, 3
    cmp r9, -3                      # r9 length and iterator for array B
    je .LresetB                       # array B completely searched through

    movups xmm3, [rsi + r9 * 4 + 12] # triple of B

.LtestForMul:
    insertps xmm4, xmm2, 78         # move 1 into 0 and zero out rest
    insertps xmm5, xmm3, 14         # move 0 into 0 and zero out rest

    comiss xmm5, xmm4               # collumn of A needs to be equal to row of B
    jc .LresetB                       # fetches new A, because we know that nothing left to multiply in B
    jne .LfetchB

.LactualMult:
    insertps xmm4, xmm2, 142        # move 2 into 0 and zero out rest
    insertps xmm6, xmm3, 142        # -- || --
    mulss xmm4, xmm6                # content of both triples gets multiplied result in xmm4

    pslldq xmm4, 8                  # move content to the 3rd position (8 bytes)

    insertps xmm6, xmm2, 14         # row of A is row of result
    insertps xmm6, xmm3, 80         # collumn of B is collumn of result
    # xmm6 is now mask for comparison

.LcheckForExistingResult:
    imul rax, r10, 3

    cmp rax, 0                      # iterator for result array
    je .LaddNewTriple

.LloopFindMatchingTriple:
    sub rax, 3
    cmp rax, -3
    je .LaddNewTriple

    movups xmm1, [rcx + rax * 4 + 12] # fetch result triple
    movq xmm5, xmm1

    pcmpeqq xmm0, xmm0              # sets xmm1s lower 64bits with only 1s
    pxor xmm5, xmm6                 # xmm5 is zeroed out if equal
    ptest xmm5, xmm0                # xmm5 && xmm1 -> set zero flag if equal
    jnz .LloopFindMatchingTriple

.LeditExisitingResultTriple:
    addps xmm1, xmm4                # add current result to existing result
    movups [rcx + rax * 4 + 12], xmm1
    jmp .LfetchB

.LaddNewTriple:
    movaps xmm1, xmm4
    addps xmm1, xmm6                # add mask with content -> full triple again
    imul rax, r10, 3
    movups [rcx + rax * 4 + 12], xmm1
    inc r10                         # increment number of entries in result array

    jmp .LfetchB

.LexitCoos:
    cvtsi2ss xmm1, r10
    movss [rcx], xmm1               # set number of entries in result array
    ret