问题描述
我在搜索时找到的介绍性链接:
The introductory links I found while searching:
- 6.59.14 Loop-Specific Pragmas
- 2.100 Pragma Loop_Optimize
- 如何向 gcc 提示循环计数
- 告诉 gcc 专门展开一个循环
- 如何在 C++ 中强制矢量化
正如你所看到的,它们中的大部分是针对 C 的,但我认为它们也可能适用于 C++.这是我的代码:
As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:
我使用了上面评论的所有提示,但我没有得到任何加速,如示例输出所示(第一次运行已取消注释此#pragma GCC ivdep Unroll Vector
:
I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector
:
还有希望吗?或者优化标志 O3
就可以解决问题?欢迎任何加速此代码(foo
函数)的建议!
Is there any hope? Or the optimization flag O3
just does the trick? Any suggestions to speedup this code (the foo
function) are welcome!
我的 g++ 版本:
<小时>
注意循环体是随机的.我对以其他形式重写它没有兴趣.
Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.
编辑
回答说没有什么可以做的也可以接受!
An answer saying that there is nothing more that can be done is also acceptable!
推荐答案
O3
标志会自动打开 -ftree-vectorize
.https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
-O3 打开 -O2 指定的所有优化,同时打开 -finline-functions、-funswitch-loops、-fpredictive-commoning、-fgcse-after-reload、-ftree-loop-vectorize、-ftree-循环分布模式、-ftree-slp-vectorize、-fvect-cost-model、-ftree-partial-pre 和 -fipa-cp-clone 选项
-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options
所以在这两种情况下,编译器都在尝试进行循环向量化.
So in both cases the compiler is trying to do loop vectorization.
使用 g++ 4.8.2 编译:
Using g++ 4.8.2 to compile with:
给出这个:
在没有-ftree-vectorize
标志的情况下编译:
Compiling without the -ftree-vectorize
flag:
只返回这个:
第 16 行是循环函数的开始,因此编译器肯定会对其进行矢量化.检查汇编程序也证实了这一点.
Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.
我目前正在使用的笔记本电脑上似乎有一些激进的缓存,这使得很难准确测量该函数运行所需的时间.
I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.
但您也可以尝试以下几件事:
But here's a couple of other things you can try too:
使用
__restrict__
限定符告诉编译器数组之间没有重叠.
Use the
__restrict__
qualifier to tell the compiler that there is no overlap between the arrays.
告诉编译器数组与__builtin_assume_aligned
(不可移植)对齐
Tell the compiler the arrays are aligned with __builtin_assume_aligned
(not portable)
这是我的结果代码(我删除了模板,因为您会希望对不同的数据类型使用不同的对齐方式)
Here's my resulting code (I removed the template since you will want to use different alignment for different data types)
就像我说的那样,我无法获得一致的时间测量值,因此无法确认这是否会给您带来性能提升(甚至可能降低!)
Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)
这篇关于如何使用 g++ 向量化我的循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!