Optimizing Low-dense Matrix Multiplication
Matrix multiplication is a cornerstone operation in scientific computing, yet its computational efficiency can drastically vary depending on the matrix's sparsity and distribution of elements. During a volunteer research project, I focused on addressing this challenge by developing a novel framework leveraging the Coordinate-Only Sparsity Scheme (COOS). This approach aimed to accelerate computations for sparse matrices with non-zero entries clustered in specific regions.
The Challenge
Traditional matrix multiplication algorithms, optimized for dense or uniformly sparse matrices, often struggle with low-density configurations—matrices with scattered non-zero elements forming clusters amid vast areas of zeros. This inefficiency stems from unnecessary memory allocation and operations on zero entries, leading to suboptimal performance.
The Solution
The COOS format represents a tailored approach that stores only the coordinates and values of non-zero elements in clusters, bypassing redundant operations. By focusing exclusively on the significant regions, this method reduces computational overhead and memory usage. I implemented this framework and benchmarked it against standard dense and CSR-based sparse matrix multiplication techniques.
Results
The COOS implementation demonstrated significant performance improvements, achieving up to 5x faster computation times for matrices with sparsity levels below 10% and clustered non-zero elements. For super low-dense matrices—those with sparsity levels below 1%—COOS further excelled by eliminating redundant computations on vast zero regions, delivering even greater efficiency compared to conventional methods. These results highlight COOS’s adaptability and effectiveness in handling sparse datasets with irregular or highly scattered non-zero distributions.
Technical Contributions
- Designed and implemented the COOS framework to efficiently handle low-density matrices with clustered sparsity.
- Conducted comparative benchmarks to identify performance advantages over dense and CSR-based multiplication methods.
- Optimized memory allocation strategies to ensure scalability for large-scale sparse matrices.
Key Insights
The efficiency of COOS is most pronounced in low-density matrices where non-zero elements form clusters, making it ideal for applications like finite element analysis, graph processing, and sparse machine learning tasks. This project highlighted the importance of aligning computational strategies with the unique properties of the data being processed.
Through this work, I honed my ability to optimize low-level data structures and develop high-performance algorithms, laying the groundwork for future computational challenges in diverse domains.
Code
.intel_syntax noprefix
.global _matr_mult_coos
.text
# For this implementation both matrices need to be read row by row to work properly
_matr_mult_coos:
# rax: General purpose for conversion/calculation
# rdi: Adress of matrix A
# rsi: Adress of matrix B
# rdx: Adress of result-array -> moved to rcx (safty reasons)
# r8: Length of matrix A in storage
# r9: Iterator for matrix B
# r10: Number of entries in result array
xor r10, r10
# r11: Length of matrix B
# xmm0: General purpose (e.g. used as mask for AND-operation)
# xmm1: General purpose (e.g. fetch + store result-triple)
# xmm2: Current triple of matrix A
# xmm3: Current triple of matrix B
# xmm4: Register for some calculations/comparison
# xmm5: Register for some calculations/comparison
# xmm6: Bit mask for comparison
mov rcx, rdx # (moved to avoid problems with mul and div)
# prefetch data to chaches closer to the processor
# does not affect program behavior, but might be nice to have (even if its completely unnecessary :D)
prefetcht0 [rdi]
prefetcht0 [rsi]
prefetcht0 [rcx]
.LsetDimensionsOfResult:
movups xmm0, [rdi]
cvtss2si rax, xmm0
imul r8, rax, 0x03 # number of entries in matrix A
movups xmm1, [rsi]
cvtss2si rax, xmm1
imul r11, rax, 0x03 # number of entries in matrix B
psrldq xmm1, 4
psrldq xmm0, 4 # shift both 1 to the right to remove first entry
movss xmm1, xmm0
movups [rcx + 4], xmm1 # store it at beginning of result
# might send some junk too, but this gets erased after first entry is written
.LresetB:
mov r9, r11
.LfetchA:
sub r8, 3
cmp r8, -3 # r8 is length and iterator of array A
je .LexitCoos
movups xmm2, [rdi + r8 * 4 + 12] # triple of A
.LfetchB:
sub r9, 3
cmp r9, -3 # r9 length and iterator for array B
je .LresetB # array B completely searched through
movups xmm3, [rsi + r9 * 4 + 12] # triple of B
.LtestForMul:
insertps xmm4, xmm2, 78 # move 1 into 0 and zero out rest
insertps xmm5, xmm3, 14 # move 0 into 0 and zero out rest
comiss xmm5, xmm4 # collumn of A needs to be equal to row of B
jc .LresetB # fetches new A, because we know that nothing left to multiply in B
jne .LfetchB
.LactualMult:
insertps xmm4, xmm2, 142 # move 2 into 0 and zero out rest
insertps xmm6, xmm3, 142 # -- || --
mulss xmm4, xmm6 # content of both triples gets multiplied result in xmm4
pslldq xmm4, 8 # move content to the 3rd position (8 bytes)
insertps xmm6, xmm2, 14 # row of A is row of result
insertps xmm6, xmm3, 80 # collumn of B is collumn of result
# xmm6 is now mask for comparison
.LcheckForExistingResult:
imul rax, r10, 3
cmp rax, 0 # iterator for result array
je .LaddNewTriple
.LloopFindMatchingTriple:
sub rax, 3
cmp rax, -3
je .LaddNewTriple
movups xmm1, [rcx + rax * 4 + 12] # fetch result triple
movq xmm5, xmm1
pcmpeqq xmm0, xmm0 # sets xmm1s lower 64bits with only 1s
pxor xmm5, xmm6 # xmm5 is zeroed out if equal
ptest xmm5, xmm0 # xmm5 && xmm1 -> set zero flag if equal
jnz .LloopFindMatchingTriple
.LeditExisitingResultTriple:
addps xmm1, xmm4 # add current result to existing result
movups [rcx + rax * 4 + 12], xmm1
jmp .LfetchB
.LaddNewTriple:
movaps xmm1, xmm4
addps xmm1, xmm6 # add mask with content -> full triple again
imul rax, r10, 3
movups [rcx + rax * 4 + 12], xmm1
inc r10 # increment number of entries in result array
jmp .LfetchB
.LexitCoos:
cvtsi2ss xmm1, r10
movss [rcx], xmm1 # set number of entries in result array
ret