## Vectorization with SSE

Using OpenCL, CUDA, or OpenACC to take advantage of the computing power offered by GPGPUs is the most effective way to accelerate computationally expensive code nowadays. However, not all machines have powerful GPGPUs.  On the other hand, all modern x86 processors from Intel and AMD support vector instructions in the form of Streaming SIMD Extensions (SSE – SSE4) and most new processors support Advanced Vector Extensions (AVX). Utilizing these instructions makes it easy to improve the performance of your code by as much as a factor of eight (AVX-512 will increase this to a factor of 16, but this is only for Intel MIC cards). In some cases, compilers can automatically vectorize pieces of code to take advantage of the CPU’s vector units. Furthermore, OpenMP 4.0 allows you to automatically vectorize certain sections of code, but writing code with vector operations in mind will generally yield better results. In this post, I will briefly explain how to use SSE to vectorize C and C++ code using GCC 4.8. Most of the instructions should also apply to LLVM/Clang and Microsoft Visual Studio.

SSE registers are 128 bits (16 bytes) wide. This means that each register can store 4 single precision floating point numbers or 2 double precision floating point numbers as well as other types that add up to 16 bytes. Using SSE, we can perform 4 single precision or 2 double precision floating point operations simultaneously on one CPU core using one instruction. Note: A superscalar processor, containing multiple floating point units, can also perform multiple floating point operations simultaneously, however a separate instruction is issued for each operation and the programmer has little, if any, control over which operations are performed simultaneously. Using SSE, we also have more control over the cache and prefetching. It’s even possible to bypass the cache entirely, which can be useful in some cases where cache pollution would cause a performance hit. With SSE, we can even eliminate some branches in the code, thereby reducing the chance of a branch misprediction, which would necessitate a pipeline flush. This can potentially improve performance further.

First-generation SSE instructions can be accessed using the header xmmintrin.h.  SSE2 and 3 instructions can be used by including emmintrin.h and pmmintrin.h, respectively. To access all vector extensions, including SSE4 and AVX, use immintrin.h. You also need to consult your compiler’s documentation to learn which flags / switches are required to enable the instructions. In the code example below, I demonstrate…
• initialization of a vector at compile time
• initialization of a vector from an array
• storing vector data in an array
• vector multiplication
• vector division
• multiplication by a scalar
• using element-wise comparison to create a mask
• bitwise AND, OR and ANDNOT (NAND)
• how to replace a conditional statement using the mask
• the vector square root
• the approximate inverse square root
• the approximate reciprocal instruction
• how to use the shuffle instruction and the SHUFFLE macro.
The code sample compiles on g++ 4.8. Some minor changes may be needed for other compilers. Some of these changes are mentioned in the comments. Click here for a version of the example that compiles in g++ 4.7 and clang++ 3.0 (and probably other compilers).


/*
vector.cpp
compile with g++ vector.cpp -o vector -O2 -msse
*/

#include <iostream>
#include "xmmintrin.h"

#define PRINT(var) print((var), #var)

using namespace std;

// v4si: vector of four single-precision floats
typedef float v4sf __attribute__ ((vector_size (16)));

// these are used for printing the results
void print(const v4sf& vec, const char* name);
void print(const float* vec, const char* name);

int main()
{
// one way to initialize a vector
v4sf vecA = {0, 1, 2, 3};

float arrayA[] = {4, 2, 3, 6};

/* also refer to _mm_setzero_ps(), _mm_set1_ps(), _mm_load_ss(),
and _mm_setr_ps() in xmmintrin.h for vector initialization. */

// one way of adding vectors. The other method explicitly uses
v4sf vecB = vecA + vecA;

float arrayB[4];

// store vector data in an array
_mm_store_ps (arrayB, vecB);

/* NOTE: g++ 4.8 automatically converts scalars to vectors. Other
compilers (such as g++ 4.7) require you to set up the vectors
explicitly. An easy way to do this is to use _mm_set1_ps().
*/

// multiplication by a scalar
v4sf vecC = 2.0 * vecB; // _mm_set1_ps(2.0) * vecB;

// more complicated expressions
v4sf vecD = vecC - 0.5 * vecA;

v4sf vecE = 1.0 + 5.0 * vecC / (vecA + 3.0);

/* for logic operations, TRUE = 0xFFFFFFFF, FALSE = 0x00000000
Thus, the result is a mask which can be bitwise ANDed using
_mm_and_ps() */

v4sf mask = (vecA == vecB);  // _mm_cmpeq_ps(vecA, vecB);

// bitwise AND.
// This is equivalent to vecF[i] = (vecA[i] == vecB[i]) ? vecE[i] : 0

// bitwise OR and ANDNOT.
// equivalent to vecG[i] = (vecA[i] == vecB[i]) ? vecE[i] : vecD[i];

// vecH[i] = sqrt(vecE[i]);
v4sf vecH = _mm_sqrt_ps(vecE);

// vecI[i] approx 1.0 / sqrt(vecE[i]);
v4sf vecI = _mm_rsqrt_ps(vecE);

// vecJ[i] approx 1.0 / vecAA[i];
v4sf vecJ = _mm_rcp_ps(vecAA);

v4sf vecK = 1.0 / vecAA;

// vecL is a permutation of vecA.
v4sf vecL = _mm_shuffle_ps(vecA, vecA, _MM_SHUFFLE(1,0,3,2) );

////////////////////////////// output ///////////////////////////////

PRINT(vecA);
PRINT(vecAA);
PRINT(vecB);
PRINT(arrayB);
PRINT(vecC);
PRINT(vecD);
PRINT(vecE);
PRINT(vecF);
PRINT(vecG);
PRINT(vecH);
PRINT(vecI);
PRINT(vecJ);
PRINT(vecK);
PRINT(vecL);

return 0;
}

void print(const v4sf& vec, const char* name)
{
union { v4sf vector; float array[4]; } value;

value.vector = vec;

print(value.array, name);
}

void print(const float* vec, const char* name)
{
cout << "\n" << name << endl;

for (int i = 0; i < 4; ++i) { cout << "\t" << vec[i] << endl; }
}



Output:

vecA:
0
1
2
3

vecAA:
4
2
3
6

vecB:
0
2
4
6

arrayB:
0
2
4
6

vecC:
0
4
8
12

vecD:
0
3.5
7
10.5

vecE:
1
6
9
11

vecF:
1
0
0
0

vecG:
1
3.5
7
10.5

vecH:
1
2.44949
3
3.31662

vecI:
0.999878
0.408264
0.333313
0.301514

vecJ:
0.249939
0.499878
0.333313
0.166656

vecK:
0.25
0.5
0.333333
0.166667

vecL
2
3
0
1



To understand the shuffle command in part L of the example, refer to this documentation for a more general usage of the shuffle operation.

Studying the example above should give you a basic sense of how to use some of the SSE instructions. Notice the definition of the v4sf data type, which is a vector of size 16 bytes. Also note that each operation performed above on vectors has the form _mm_operation_ps(), but there are other types of operations. If you read through the xmmintrin.h header, you will see that there are many more operations available. Reading through the header and doing a few web searches for specific intrinsic functions is an easy way to learn about the details of SSE.

For an idea of how SSE can be used to solve a real problem, consider the following partial example. Suppose we decide to step some particles forward in space using $$!\mathbf{r}_{i+1} = \mathbf{r}_{i} + \mathbf{v}_{i}\delta t + \frac{1}{2}\mathbf{a}_{i}\delta t^2.$$ Ignore the fact that this isn’t the best expression to use, in general. We could create particle packets containig four particles, like this:
struct particle4
{
v4sf x;
v4sf y;
v4sf z;
v4sf vx;
v4sf vy;
v4sf vz;
v4sf ax;
v4sf ay;
v4sf az;
}


and then step four particles forward during each iteration of a for loop using
// particles[] is an vector of particle4 objects
// the step size, delta_t is already defined

const v4sf dt = _mm_set1_ps(delta_t);

const v4sf dt2 = _mm_set1_ps(0.5 * delta_t * delta_t);

for (int i = 0; i < N; ++i)
{
particles[i].x = particles[i].x + dt * particles[i].vx + dt2 * particles[i].ax;
particles[i].y = particles[i].y + dt * particles[i].vy + dt2 * particles[i].ay;
particles[i].z = particles[i].z + dt * particles[i].vz + dt2 * particles[i].az;
}

// then update vx, vy, vz, ax, ay, az (not shown)



Using OpenMP to spread the work over several threads would further improve performance.

Note: SSE2 and SSE3 merely add more instructions on top of SSE. On the other hand, the newer AVX instructions use 256 bit registers, so they are able to operate on 8 single precision floats or 4 double precision floats simultaneously, while AVX 2 will have 512 bit registers! AVX is quite similar to SSE in terms of usage. For more information, see avxintrin.h.

### 34 Responses to “Vectorization with SSE”

1. rahultanvir Says:

Dear ,
Your post is a glamorous post. very creative and helpful. I just wanted to thank you so much for sahring this great article. This is a nice post in an interesting line of content.once again Thanks for sharing this article, great way of bring this topic to discussion.

2. student essay Says:

The education can give people the high level of income. Education can receive the respect from every human being that lives in our citizen. The education shows the superior status. The education has the greater knowledge about everything.

Being a professor i have seen a most common thing among many of the students that they drop their studies in between due to feeling difficulty in continuing their regular education and they mostly move towards Life Experience Degrees Accredited to get life experience degrees for whatever the reason may be. It is however not a crime to go towards those options but atleast they should try their best to complete the education.

4. Lily Says:

Thanks! You also can make coursework help,/a> in MathLab.

5. Tedd Says:

I am a software engineer, so I know how the team of software designers have to work continuously to make improvements and launching new updates at the frequent intervals.
~ Visit website

6. picaram Says:

Thank you for sharing the post! This is what I need to find.
instagram online

7. maidservices Says:

Couches are large and cumbersome. To thoroughly clean them properly you need to pull away the chair cushions as well as vacuum each and every inch as well as brush aside any crumbs along with other bits which have fallen at the rear of and collected within the corners and across the edges.cleaning company abu dhabi

Our educational systems are not perfect and these need to be made more perfect by some changes in it so that it can made fit for every person and to remove all the allegations on it. Our look for professional writers online will point out the shortcomings in our educational systems.

9. Caster Says:

10. Home Insurance Says:

This was an impressive post thanks for sharing.

11. Samsung Says:

Nice post.

12. Driver Booster 5 Says:

Great post.

13. scholarship path Says:

This is the ultimate guide everyone needs. Great article by all standard.

14. Car insurance Says:

Great work..! Nice to find this post . keep it up.

15. Studio Engineer Says:

Is it possible for a piece of recorded music to be worth \$1 million or more? Are all musicians and music composers doomed to struggle in the music industry and claw their way up into a career in music? Visit Our Website Best Recording Microphone/

16. Hautehome Says:

It is cumbersome to administer lifestyle management services on your own; nearly impossible with a busy schedule.

17. Mediatek Drivers Says:

Good job thanks for sharing me

18. SBOBET Says:

Very informative article. This is something that will be useful to my site s-bobet.com/

19. Masinu Nuoma Says:

The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here at www.turbonuoma.it. Our team will help you with the bookings at the best rates with added discounts. Reach us now and more info. Website:- www.turbonuoma.lt/

20. BytovkaNN Says:
21. Extractionvln Says:

???????,??????????! .

???????,??????????! .

23. Carpetfhv Says:

???????,??????????! .

24. MediaTak USB Driver Says:

Great post keep it up.

25. Authorcom Says:

?????????????? ?????????? ???

26. Qualcomm USB Driver Says:

The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here at www.turbonuoma.it. Our team will help you with the bookings at the best rates with added discounts.
If you also want to downlaod Qualcomm HS-USB QDLoader 9008 Driver

27. Huawei USB Driver Says:

Very informative article. This is something that will be useful.

28. FRP Bypass Tool Says:

Very informative article. This is something that will be useful

29. Huawei USB Driver Says:

Very informative article.

30. FRP Bypass apk Says:

thanks for sharing me

31. Bengali Song Lyrics Says:

Good job thanks for sharing me

32. Root Android Says:

The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here,..

33. noi that biet thu Says:

really interesting noithatdongthanh.vn/go-soi-tu-nhien-bao-t…

34. Mr joy Says:

Nice