关键词:基准;通讯;计算机体系结构;超级计算机
摘 要:Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. The critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivoting with the computation of rank-k updates. By shifting the computation-commun icationtrade-off, a modified block-cyclic distribution can beneficially exploitmore available parallelism on the critical path, and reduce panel factorizations memory hierarchy contention on now-ubiquitous multicore architectures.