Vectorization with SSE
Using OpenCL, CUDA, or OpenACC to take advantage of the computing power offered by GPGPUs is the most effective way to accelerate computationally expensive code nowadays. However, not all machines have powerful GPGPUs. On the other hand, all modern x86 processors from Intel and AMD support vector instructions in the form of Streaming SIMD Extensions (SSE – SSE4) and most new processors support Advanced Vector Extensions (AVX). Utilizing these instructions makes it easy to improve the performance of your code by as much as a factor of eight (AVX-512 will increase this to a factor of 16, but this is only for Intel MIC cards). In some cases, compilers can automatically vectorize pieces of code to take advantage of the CPU’s vector units. Furthermore, OpenMP 4.0 allows you to automatically vectorize certain sections of code, but writing code with vector operations in mind will generally yield better results. In this post, I will briefly explain how to use SSE to vectorize C and C++ code using GCC 4.8. Most of the instructions should also apply to LLVM/Clang and Microsoft Visual Studio.
SSE registers are 128 bits (16 bytes) wide. This means that each register can store 4 single precision floating point numbers or 2 double precision floating point numbers as well as other types that add up to 16 bytes. Using SSE, we can perform 4 single precision or 2 double precision floating point operations simultaneously on one CPU core using one instruction. Note: A superscalar processor, containing multiple floating point units, can also perform multiple floating point operations simultaneously, however a separate instruction is issued for each operation and the programmer has little, if any, control over which operations are performed simultaneously. Using SSE, we also have more control over the cache and prefetching. It’s even possible to bypass the cache entirely, which can be useful in some cases where cache pollution would cause a performance hit. With SSE, we can even eliminate some branches in the code, thereby reducing the chance of a branch misprediction, which would necessitate a pipeline flush. This can potentially improve performance further.
First-generation SSE instructions can be accessed using the header xmmintrin.h. SSE2 and 3 instructions can be used by including emmintrin.h and pmmintrin.h, respectively. To access all vector extensions, including SSE4 and AVX, use immintrin.h. You also need to consult your compiler’s documentation to learn which flags / switches are required to enable the instructions. In the code example below, I demonstrate…
- initialization of a vector at compile time
- initialization of a vector from an array
- storing vector data in an array
- vector addition
- vector multiplication
- vector division
- multiplication by a scalar
- using element-wise comparison to create a mask
- bitwise AND, OR and ANDNOT (NAND)
- how to replace a conditional statement using the mask
- the vector square root
- the approximate inverse square root
- the approximate reciprocal instruction
- how to use the shuffle instruction and the SHUFFLE macro.
/* vector.cpp compile with g++ vector.cpp -o vector -O2 -msse */ #include <iostream> #include "xmmintrin.h" #define PRINT(var) print((var), #var) using namespace std; // v4si: vector of four single-precision floats typedef float v4sf __attribute__ ((vector_size (16))); // these are used for printing the results void print(const v4sf& vec, const char* name); void print(const float* vec, const char* name); int main() { // one way to initialize a vector v4sf vecA = {0, 1, 2, 3}; // another way: loading the data from a 16-byte aligned array float arrayA[] = {4, 2, 3, 6}; v4sf vecAA = _mm_load_ps(arrayA); /* also refer to _mm_setzero_ps(), _mm_set1_ps(), _mm_load_ss(), _mm_load1_ps(), _mm_loadu_ps(), _mm_loadr_ps(), _mm_set_ps(), and _mm_setr_ps() in xmmintrin.h for vector initialization. */ // one way of adding vectors. The other method explicitly uses // _mm_add_ps() v4sf vecB = vecA + vecA; float arrayB[4]; // store vector data in an array _mm_store_ps (arrayB, vecB); /* NOTE: g++ 4.8 automatically converts scalars to vectors. Other compilers (such as g++ 4.7) require you to set up the vectors explicitly. An easy way to do this is to use _mm_set1_ps(). */ // multiplication by a scalar v4sf vecC = 2.0 * vecB; // _mm_set1_ps(2.0) * vecB; // more complicated expressions v4sf vecD = vecC - 0.5 * vecA; v4sf vecE = 1.0 + 5.0 * vecC / (vecA + 3.0); /* for logic operations, TRUE = 0xFFFFFFFF, FALSE = 0x00000000 Thus, the result is a mask which can be bitwise ANDed using _mm_and_ps() */ v4sf mask = (vecA == vecB); // _mm_cmpeq_ps(vecA, vecB); // bitwise AND. // This is equivalent to vecF[i] = (vecA[i] == vecB[i]) ? vecE[i] : 0 v4sf vecF = _mm_and_ps(mask, vecE); // bitwise OR and ANDNOT. // equivalent to vecG[i] = (vecA[i] == vecB[i]) ? vecE[i] : vecD[i]; v4sf vecG = _mm_or_ps(_mm_and_ps(mask, vecE), _mm_andnot_ps(mask, vecD)); // vecH[i] = sqrt(vecE[i]); v4sf vecH = _mm_sqrt_ps(vecE); // vecI[i] approx 1.0 / sqrt(vecE[i]); v4sf vecI = _mm_rsqrt_ps(vecE); // vecJ[i] approx 1.0 / vecAA[i]; v4sf vecJ = _mm_rcp_ps(vecAA); v4sf vecK = 1.0 / vecAA; // vecL is a permutation of vecA. v4sf vecL = _mm_shuffle_ps(vecA, vecA, _MM_SHUFFLE(1,0,3,2) ); ////////////////////////////// output /////////////////////////////// PRINT(vecA); PRINT(vecAA); PRINT(vecB); PRINT(arrayB); PRINT(vecC); PRINT(vecD); PRINT(vecE); PRINT(vecF); PRINT(vecG); PRINT(vecH); PRINT(vecI); PRINT(vecJ); PRINT(vecK); PRINT(vecL); return 0; } void print(const v4sf& vec, const char* name) { union { v4sf vector; float array[4]; } value; value.vector = vec; print(value.array, name); } void print(const float* vec, const char* name) { cout << "\n" << name << endl; for (int i = 0; i < 4; ++i) { cout << "\t" << vec[i] << endl; } }
Output:
vecA: 0 1 2 3 vecAA: 4 2 3 6 vecB: 0 2 4 6 arrayB: 0 2 4 6 vecC: 0 4 8 12 vecD: 0 3.5 7 10.5 vecE: 1 6 9 11 vecF: 1 0 0 0 vecG: 1 3.5 7 10.5 vecH: 1 2.44949 3 3.31662 vecI: 0.999878 0.408264 0.333313 0.301514 vecJ: 0.249939 0.499878 0.333313 0.166656 vecK: 0.25 0.5 0.333333 0.166667 vecL 2 3 0 1
To understand the shuffle command in part L of the example, refer to this documentation for a more general usage of the shuffle operation.
Studying the example above should give you a basic sense of how to use some of the SSE instructions. Notice the definition of the v4sf data type, which is a vector of size 16 bytes. Also note that each operation performed above on vectors has the form _mm_operation_ps(), but there are other types of operations. If you read through the xmmintrin.h header, you will see that there are many more operations available. Reading through the header and doing a few web searches for specific intrinsic functions is an easy way to learn about the details of SSE.
For an idea of how SSE can be used to solve a real problem, consider the following partial example. Suppose we decide to step some particles forward in space using $$!\mathbf{r}_{i+1} = \mathbf{r}_{i} + \mathbf{v}_{i}\delta t + \frac{1}{2}\mathbf{a}_{i}\delta t^2. $$ Ignore the fact that this isn’t the best expression to use, in general. We could create particle packets containig four particles, like this:
struct particle4 { v4sf x; v4sf y; v4sf z; v4sf vx; v4sf vy; v4sf vz; v4sf ax; v4sf ay; v4sf az; }
and then step four particles forward during each iteration of a for loop using
// particles[] is an vector of particle4 objects // the step size, delta_t is already defined const v4sf dt = _mm_set1_ps(delta_t); const v4sf dt2 = _mm_set1_ps(0.5 * delta_t * delta_t); for (int i = 0; i < N; ++i) { particles[i].x = particles[i].x + dt * particles[i].vx + dt2 * particles[i].ax; particles[i].y = particles[i].y + dt * particles[i].vy + dt2 * particles[i].ay; particles[i].z = particles[i].z + dt * particles[i].vz + dt2 * particles[i].az; } // then update vx, vy, vz, ax, ay, az (not shown)
Using OpenMP to spread the work over several threads would further improve performance.
Note: SSE2 and SSE3 merely add more instructions on top of SSE. On the other hand, the newer AVX instructions use 256 bit registers, so they are able to operate on 8 single precision floats or 4 double precision floats simultaneously, while AVX 2 will have 512 bit registers! AVX is quite similar to SSE in terms of usage. For more information, see avxintrin.h.
July 13th, 2015 at 10:41 pm
Dear ,
Your post is a glamorous post. very creative and helpful. I just wanted to thank you so much for sahring this great article. This is a nice post in an interesting line of content.once again Thanks for sharing this article, great way of bring this topic to discussion.
January 20th, 2016 at 1:03 pm
The education can give people the high level of income. Education can receive the respect from every human being that lives in our citizen. The education shows the superior status. The education has the greater knowledge about everything.
March 13th, 2017 at 11:44 am
Being a professor i have seen a most common thing among many of the students that they drop their studies in between due to feeling difficulty in continuing their regular education and they mostly move towards Life Experience Degrees Accredited to get life experience degrees for whatever the reason may be. It is however not a crime to go towards those options but atleast they should try their best to complete the education.
June 4th, 2017 at 2:33 pm
Thanks! You also can make coursework help,/a> in MathLab.
June 6th, 2017 at 10:08 am
I am a software engineer, so I know how the team of software designers have to work continuously to make improvements and launching new updates at the frequent intervals.
~ Visit website
July 25th, 2017 at 11:27 pm
Thank you for sharing the post! This is what I need to find.
instagram online
July 26th, 2017 at 6:15 am
Couches are large and cumbersome. To thoroughly clean them properly you need to pull away the chair cushions as well as vacuum each and every inch as well as brush aside any crumbs along with other bits which have fallen at the rear of and collected within the corners and across the edges.cleaning company abu dhabi
August 2nd, 2017 at 12:29 am
Our educational systems are not perfect and these need to be made more perfect by some changes in it so that it can made fit for every person and to remove all the allegations on it. Our look for professional writers online will point out the shortcomings in our educational systems.
August 12th, 2017 at 12:56 pm
Looking Up Admissions
October 13th, 2017 at 2:45 am
This was an impressive post thanks for sharing.
December 3rd, 2017 at 7:47 am
Nice post.
April 14th, 2018 at 5:48 am
Great post.
June 2nd, 2018 at 1:57 pm
This is the ultimate guide everyone needs. Great article by all standard.
June 4th, 2018 at 10:30 pm
Great work..! Nice to find this post . keep it up.
July 8th, 2018 at 12:14 pm
Is it possible for a piece of recorded music to be worth $1 million or more? Are all musicians and music composers doomed to struggle in the music industry and claw their way up into a career in music? Visit Our Website Best Recording Microphone/
July 9th, 2018 at 3:09 pm
It is cumbersome to administer lifestyle management services on your own; nearly impossible with a busy schedule.
July 30th, 2018 at 10:44 am
Good job thanks for sharing me
September 19th, 2018 at 8:00 pm
Very informative article. This is something that will be useful to my site s-bobet.com/
December 27th, 2018 at 3:11 am
The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here at www.turbonuoma.it. Our team will help you with the bookings at the best rates with added discounts. Reach us now and more info. Website:- www.turbonuoma.lt/
January 30th, 2019 at 4:18 pm
dsk-nn.ru – ??????? ??????? – dsk-nn.ru
March 26th, 2019 at 12:26 pm
???????,??????????! .
March 27th, 2019 at 1:56 am
???????,??????????! .
March 28th, 2019 at 12:31 pm
???????,??????????! .
May 20th, 2019 at 12:56 am
Great post keep it up.
August 8th, 2019 at 4:16 am
?????????????? ?????????? ???
January 10th, 2020 at 5:16 am
The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here at www.turbonuoma.it. Our team will help you with the bookings at the best rates with added discounts.
If you also want to downlaod Qualcomm HS-USB QDLoader 9008 Driver
January 10th, 2020 at 5:18 am
Very informative article. This is something that will be useful.
January 10th, 2020 at 5:19 am
Very informative article. This is something that will be useful
January 10th, 2020 at 5:20 am
Very informative article.
January 10th, 2020 at 3:42 pm
thanks for sharing me
January 10th, 2020 at 3:43 pm
Good job thanks for sharing me
January 10th, 2020 at 3:44 pm
The best part of our parking slots is that we can be reachable from any terminals. Plus we also offer high-end security all round the clock. You can more details here,..
September 18th, 2020 at 7:49 am
really interesting noithatdongthanh.vn/go-soi-tu-nhien-bao-t…
January 25th, 2021 at 11:58 pm
Nice
March 22nd, 2021 at 8:42 pm
Nice
July 11th, 2021 at 10:38 pm
thanks for sharing
January 6th, 2023 at 2:27 am
NFT Marketplace Website Development Services Company
Make your business future-ready by revolutionizing upcoming transactions with Non- Fungible Tokens (NFT) for more security