c++ - C++ 中 SSE 的内存对齐，_aligned_malloc 等效？

coder 2024-02-21 原文

我想知道如何将此 C 代码转换为 C++ 以实现内存对齐。

float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16);

我看过here然后我试了这个 float *pResult = (float*) __attribute__((aligned(16)));

还有这个

float *pResult = __attribute__((aligned(16)));

但两者都给出了类似的错误。

error: expected primary-expression before '__attribute__'|
error: expected ',' or ';' before '__attribute__'|

完整代码

#include "stdafx.h"
#include <xmmintrin.h>  // Need this for SSE compiler intrinsics
#include <math.h>       // Needed for sqrt in CPU-only version
#include "stdio.h"

int main(int argc, char* argv[])
{
    printf("Starting calculation...\n");

    const int length = 64000;

    // We will be calculating Y = Sin(x) / x, for x = 1->64000

    // If you do not properly align your data for SSE instructions, you may take a huge performance hit.
    float *pResult = (float*) __attribute__((aligned(16))); // align to 16-byte for SSE
    __m128 x;
    __m128 xDelta = _mm_set1_ps(4.0f);      // Set the xDelta to (4,4,4,4)
    __m128 *pResultSSE = (__m128*) pResult;


    const int SSELength = length / 4;

    for (int stress = 0; stress < 100000; stress++) // lots of stress loops so we can easily use a stopwatch
    {
#define TIME_SSE    // Define this if you want to run with SSE
#ifdef TIME_SSE
        x = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f); // Set the initial values of x to (4,3,2,1)
        for (int i=0; i < SSELength; i++)
        {
            __m128 xSqrt = _mm_sqrt_ps(x);
            // Note! Division is slow. It's actually faster to take the reciprocal of a number and multiply
            // Also note that Division is more accurate than taking the reciprocal and multiplying

#define USE_DIVISION_METHOD
#ifdef USE_FAST_METHOD
            __m128 xRecip = _mm_rcp_ps(x);
            pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt);
#endif //USE_FAST_METHOD
#ifdef USE_DIVISION_METHOD
            pResultSSE[i] = _mm_div_ps(xSqrt, x);
#endif  // USE_DIVISION_METHOD

            // NOTE! Sometimes, the order in which things are done in SSE may seem reversed.
            // When the command above executes, the four floating elements are actually flipped around
            // We have already compensated for that flipping by setting the initial x vector to (4,3,2,1) instead of (1,2,3,4)

            x = _mm_add_ps(x, xDelta);  // Advance x to the next set of numbers
        }
#endif  // TIME_SSE
#ifndef TIME_SSE
        float xFloat = 1.0f;
        for (int i=0 ; i < length; i++)
        {
            pResult[i] = sqrt(xFloat) / xFloat; // Even though division is slow, there are no intrinsic functions like there are in SSE
            xFloat += 1.0f;
        }
#endif  // !TIME_SSE
    }

    // To prove that the program actually worked
    for (int i=0; i < 20; i++)
    {
        printf("Result[%d] = %f\n", i, pResult[i]);
    }

    // Results for my particular system
    // 23.75 seconds for SSE with reciprocal/multiplication method
    // 38.5 seconds for SSE with division method
    // 301.5 seconds for CPU

    return 0;
}

最佳答案

对于 C++11，您可以使用类似的东西:

struct aligned_float
{
    alignas(16) float f[4];
};

static_assert(sizeof(aligned_float) == 4 * sizeof(float), "padding issue");

int main()
{
    const int length = 64000;
    std::vector<aligned_float> pResult(length / sizeof(aligned_float));

    return 0;
}

关于c++ - C++ 中 SSE 的内存对齐，_aligned_malloc 等效？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23183628/

amp 等效 float aligned code c++g++malloc sse memory-alignment

有关c++ - C++ 中 SSE 的内存对齐，_aligned_malloc 等效？的更多相关文章

ruby-on-rails - 在 Rails 中将文件大小字符串转换为等效千字节 - 2
我的目标是转换表单输入，例如“100兆字节”或“1GB”，并将其转换为我可以存储在数据库中的文件大小(以千字节为单位)。目前，我有这个:defquota_convert@regex=/([0-9]+)(.*)s/@sizes=%w{kilobytemegabytegigabyte}m=self.quota.match(@regex)if@sizes.include?m[2]eval("self.quota=#{m[1]}.#{m[2]}")endend这有效，但前提是输入是倍数(“gigabytes”，而不是“gigabyte”)并且由于使用了eval看起来疯狂不安全。所以，功能正常，
ruby-on-rails - Ruby net/ldap 模块中的内存泄漏 - 2
作为我的Rails应用程序的一部分，我编写了一个小导入程序，它从我们的LDAP系统中吸取数据并将其塞入一个用户表中。不幸的是，与LDAP相关的代码在遍历我们的32K用户时泄漏了大量内存，我一直无法弄清楚如何解决这个问题。这个问题似乎在某种程度上与LDAP库有关，因为当我删除对LDAP内容的调用时，内存使用情况会很好地稳定下来。此外，不断增加的对象是Net::BER::BerIdentifiedString和Net::BER::BerIdentifiedArray，它们都是LDAP库的一部分。当我运行导入时，内存使用量最终达到超过1GB的峰值。如果问题存在，我需要找到一些方法来更正我的代
ruby-on-rails - 如何优雅地重启 thin + nginx？ - 2
我的瘦服务器配置了nginx，我的ROR应用程序正在它们上运行。在我发布代码更新时运行thinrestart会给我的应用程序带来一些停机时间。我试图弄清楚如何优雅地重启正在运行的Thin实例，但找不到好的解决方案。有没有人能做到这一点？最佳答案 #Restartjustthethinserverdescribedbythatconfigsudothin-C/etc/thin/mysite.ymlrestartNginx将继续运行并代理请求。如果您将Nginx设置为使用多个上游服务器，例如server{listen80;server
ruby-on-rails - Ruby 中的内存模型 - 2
ruby如何管理内存。例如:如果我们在执行过程中采用C程序，则以下是内存模型。类似于这个ruby如何处理内存。C:__________________|||stack|||------------------||||------------------|||||Heap|||||__________________|||data|__________________|text|__________________Ruby:? 最佳答案 Ruby中没有“内存”这样的东西。Class#allocate分配一个对象并返回该对象。这就是程序
ruby - 使用 `+=` 和 `send` 方法 - 2
如何将send与+=一起使用？a=20;a.send"+=",10undefinedmethod`+='for20:Fixnuma=20;a+=10=>30 最佳答案恐怕你不能。+=不是方法，而是语法糖。参见http://www.ruby-doc.org/docs/ProgrammingRuby/html/tut_expressions.html它说Incommonwithmanyotherlanguages,Rubyhasasyntacticshortcut:a=a+2maybewrittenasa+=2.你能做的最好的事情是:
ruby - 如何计算 Liquid 中的变量 +1 - 2
我对如何计算通过{%assignvar=0%}赋值的变量加一完全感到困惑。这应该是最简单的任务。到目前为止，这是我尝试过的:{%assignamount=0%}{%forvariantinproduct.variants%}{%assignamount=amount+1%}{%endfor%}Amount:{{amount}}结果总是0。也许我忽略了一些明显的东西。也许有更好的方法。我想要存档的只是获取运行的迭代次数。最佳答案因为{{incrementamount}}将输出您的变量值并且不会影响{%assign%}定义的变量，我
键删除后 ruby 哈希内存泄漏 - 2
你好，我无法成功如何在散列中删除key后释放内存。当我从哈希中删除键时，内存不会释放，也不会在手动调用GC.start后释放。当从Hash中删除键并且这些对象在某处泄漏时，这是预期的行为还是GC不释放内存？如何在Ruby中删除Hash中的键并在内存中取消分配它？例子:irb(main):001:0>`ps-orss=-p#{Process.pid}`.to_i=>4748irb(main):002:0>a={}=>{}irb(main):003:0>1000000.times{|i|a[i]="test#{i}"}=>1000000irb(main):004:0>`ps-orss=-p
arrays - Ruby 数组 += vs 推送 - 2
我有一个数组数组，想将元素附加到子数组。+=做我想做的，但我想了解为什么push不做。我期望的行为(并与+=一起工作):b=Array.new(3,[])b[0]+=["apple"]b[1]+=["orange"]b[2]+=["frog"]b=>[["苹果"],["橙子"],["Frog"]]通过推送，我将推送的元素附加到每个子数组(为什么？):a=Array.new(3,[])a[0].push("apple")a[1].push("orange")a[2].push("frog")a=>[[“苹果”、“橙子”、“Frog”]、[“苹果”、“橙子”、“Frog”]、[“苹果”、“
ruby-on-rails - HTTParty 的内存问题和下载大文件 - 2
这会导致Ruby出现内存问题吗？我知道如果大小超过10KB，Open-URI会写入TempFile。但是HTTParty会在写入TempFile之前尝试将整个PDF保存到内存吗？src=Tempfile.new("file.pdf")src.binmodesrc.writeHTTParty.get("large_file.pdf").parsed_response 最佳答案您可以使用Net::HTTP。参见thedocumentation(特别是标题为“流媒体响应机构”的部分)。这是文档中的示例:uri=URI('http://e
+= 的 Ruby 方法 - 2
有没有办法让Ruby能够做这样的事情？classPlane@moved=0@x=0defx+=(v)#thisiserror@x+=v@moved+=1enddefto_s"moved#{@moved}times,currentxis#{@x}"endendplane=Plane.newplane.x+=5plane.x+=10putsplane.to_s#moved2times,currentxis15 最佳答案您不能在Ruby中覆盖复合赋值运算符。任务在内部处理。您应该覆盖+，而不是+=。plane.a+=b与plane.a=

c++ - C++ 中 SSE 的内存对齐，_aligned_malloc 等效？

有关c++ - C++ 中 SSE 的内存对齐，_aligned_malloc 等效？的更多相关文章

随机推荐