Blocking Optimization for Sparse MTTKRP

Abstract

The MTTKRP kernel is the key performance bottleneck in sparse CP-ALS, where it can take up to 95% of the total execution time. We first use a simple roofline based performance model to demonstrate that the kernel is severely memory-bound even on systems with large cache, and use different blocking techniques to achieve significant speedup. In particular, our own rank blocking technique in combination with the traditional multi-dimensional blocking achieves high speedup on both shared memory and distributed settings on real-world datasets.

Date
Location
Cambridge, MA