We discuss the challenges and approaches to optimizing various tensor decomposition algorithms on HPC systesm. We show that data locality is the dominant bottleneck - both in shared- and distributed-memory settings - and optimizing this leads to significant speedup over state-of-the-art.