问题描述
我正在研究并行编程概念并尝试在单核上优化矩阵乘法示例.到目前为止,我提出的最快的实现如下:
结果如下.如何减少循环并提高性能
解决方案
CPU 上最先进的矩阵乘法实现使用
上图(原文来自这篇论文,直接在本教程) 说明了在 BLIS.缓存阻塞参数{MC, NC, KC}确定Bp (KC × NC) 和 Ai (MC × KC) 的子矩阵大小,以便它们适合各种缓存.在计算过程中,行面板 Bp连续打包到缓冲区 Bp 中以适合 L3 缓存.块 Ai 类似地打包到缓冲区 Ai以适合 L2 缓存.寄存器块大小 {MR, NR} 与寄存器中对 C 有贡献的子矩阵有关.在微内核(最内部的循环)中,C 的一个小的 MR × NR 微瓦片由一对 MR × KC 和 KC 更新× Ai 和 Bp 的 NR 条带.
对于复杂度为 O(N^2.87) 的 Strassen 算法,您可能有兴趣阅读 这篇论文.其他渐近复杂度小于 O(N^3) 的快速矩阵乘法算法可以在 this 中轻松扩展纸.有一篇关于实用快速矩阵乘法算法的最近的论文.
如果您想了解有关如何在 CPU 上优化矩阵乘法的更多信息,以下教程可能会有所帮助:
如何优化 GEMM Wiki
GEMM:从纯 C 到 SSE 优化的微内核
BLISlab:针对 CPU 和 ARM 优化 GEMM 的沙箱
关于如何逐步优化 CPU 上的 GEMM(使用 AVX2/FMA)的最新文档可以在这里下载:https://github.com/ULAFF/LAFF-On-HPC/blob/master/LAFF-On-PfHP.pdf
将于 2019 年 6 月开始在 edX 上提供大规模开放在线课程(LAFF-On Programming for High Performance):https://github.com/ULAFF/LAFF-On-HPChttp://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html
I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the following:
The results are like below. how to reduce the loops and increase the performance
解决方案
The state-of-the-art implementation of matrix multiplication on CPUs uses GotoBLAS algorithm. Basically the loops are organized in the following order:
A key insight underlying modern high-performance implementations of matrix multiplication is to organize the computations by partitioning the operands into blocks for temporal locality (3 outer most loops), and to pack (copy) such blocks
into contiguous buffers that fit into various levels of memory for spatial locality (3 inner most loops).
The above figure (originally from this paper, directly used in this tutorial) illustrates the GotoBLAS algorithm as implemented in BLIS. Cache blocking parameters {MC, NC, KC} determine
the submatrix sizes of Bp (KC × NC) and Ai (MC × KC), such that they fit in various caches. During the computation, row panels Bp
are contiguously packed into buffer Bp to fit in the L3 cache. Blocks Ai are similarly packed into buffer Ai
to fit in the L2 cache. Register block sizes {MR, NR} relate to submatrices in registers that contribute to C. In the micro-kernel (the inner most loop), a small MR × NR micro-tile of C is updated by pair of MR × KC and KC × NR slivers of Ai and Bp.
For the Strassen's algorithm with O(N^2.87) complexity, you might be interested in reading this paper. Other fast matrix multiplication algorithms with asymptotic complexity less than O(N^3) can be easily extended in this paper. There is a recent thesis about the practical fast matrix multiplication algorithms.
The following tutorials might be helpful if you want to learn more about how to optimize matrix multiplication on CPUs:
How to Optimize GEMM Wiki
GEMM: From Pure C to SSE Optimized Micro Kernels
BLISlab: A sandbox for optimizing GEMM for CPU and ARM
A most updated document about how to optimize GEMM on CPUs (with AVX2/FMA) step by step can be downloaded here:
https://github.com/ULAFF/LAFF-On-HPC/blob/master/LAFF-On-PfHP.pdf
A Massive Open Online Course to be offered on edX starting in June 2019 (LAFF-On Programming for High Performance):
https://github.com/ULAFF/LAFF-On-HPC
http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html
这篇关于如何优化矩阵乘法 (matmul) 代码以在单个处理器内核上快速运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!
本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!