# Intel SIMD architecture

Computer Organization and Assembly Languages Yung-Yu Chuang

### Overview



- SIMD
- MMX architectures
- MMX instructions
- examples
- SSE/SSE2
- SIMD instructions are probably the best place to use assembly since compilers usually do not do a good job on using these instructions



 Increasing clock rate is not fast enough for boosting performance





In his 1965 paper, Intel co-founder Gordon Moore observed that "the number of transistors per square inch had doubled every 18 months.



- Architecture improvements (such as pipeline/cache/SIMD) are more significant
- Intel analyzed multimedia applications and found they share the following characteristics:
  - Small native data types (8-bit pixel, 16-bit audio)
  - Recurring operations
  - Inherent parallelism



- SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel
- PADDW MM0, MM1



### SISD/SIMD/Streaming







- MMX (<u>Multimedia Extension</u>) was introduced in 1996 (Pentium with MMX and Pentium II).
- SSE (<u>Streaming SIMD Extension</u>) was introduced with Pentium III.
- SSE2 was introduced with Pentium 4.
- SSE3 was introduced with Pentium 4 supporting hyper-threading technology. SSE3 adds 13 more instructions.



- After analyzing a lot of existing applications such as graphics, MPEG, music, speech recognition, game, image processing, they found that many multimedia algorithms execute the same instructions on many pieces of data in a large data set.
- Typical elements are small, 8 bits for pixels, 16 bits for audio, 32 bits for graphics and general computing.
- New data type: 64-bit packed data type. Why 64 bits?
  - Good enough
  - Practical









8 MMX Registers MM0~MM7

NaN or infinity as real because bits 79-64 are ones.

Even if MMX registers are 64-bit, they don't extend Pentium to a 64-bit CPU since only logic instructions are provided for 64-bit data.



- To be fully compatible with existing IA, no new mode or state was created. Hence, for context switching, no extra state needs to be saved.
- To reach the goal, MMX is hidden behind FPU. When floating-point state is saved or restored, MMX is saved or restored.
- It allows existing OS to perform context switching on the processes executing MMX instruction without be aware of MMX.
- However, it means MMX and FPU can not be used at the same time. Big overhead to switch.



- Although Intel defenses their decision on aliasing MMX to FPU for compatibility. It is actually a bad decision. OS can just provide a service pack or get updated.
- It is why Intel introduced SSE later without any aliasing



- 57 MMX instructions are defined to perform the parallel operations on multiple data elements packed into 64-bit data types.
- These include add, subtract, multiply, compare, and shift, data conversion, 64-bit data move, 64-bit logical operation and multiply-add for multiplyaccumulate operations.
- All instructions except for data move use MMX registers as operands.
- Most complete support for 16-bit operations.



- Useful in graphics applications.
- When an operation overflows or underflows, the result becomes the largest or smallest possible representable number.
- Two types: signed and unsigned saturation



### **MMX instructions**



| Category   |                                    | Wraparound                            | Signed Saturation     | Unsigned<br>Saturation |
|------------|------------------------------------|---------------------------------------|-----------------------|------------------------|
| Arithmetic | Addition                           | PADDB, PADDW,<br>PADDD                | PADDSB,<br>PADDSW     | PADDUSB,<br>PADDUSW    |
|            | Subtraction                        | PSUBB, PSUBW,<br>PSUBD                | PSUBSB,<br>PSUBSW     | PSUBUSB,<br>PSUBUSW    |
|            | Multiplication<br>Multiply and Add | PMULL, PMULH<br>PMADD                 |                       |                        |
| Comparison | Compare for Equal                  | PCMPEQB,<br>PCMPEQW,<br>PCMPEQD       |                       |                        |
|            | Compare for Greater<br>Than        | PCMPGTPB,<br>PCMPGTPW,<br>PCMPGTPD    |                       |                        |
| Conversion | Pack                               |                                       | PACKSSWB,<br>PACKSSDW | PACKUSWB               |
| Unpack     | Unpack High                        | PUNPCKHBW,<br>PUNPCKHWD,<br>PUNPCKHDQ |                       |                        |
|            | Unpack Low                         | PUNPCKLBW,<br>PUNPCKLWD,<br>PUNPCKLDQ |                       |                        |



|                    |                                                                     | Packed                                       | Full Quadword                |
|--------------------|---------------------------------------------------------------------|----------------------------------------------|------------------------------|
| Logical            | And<br>And Not<br>Or<br>Exclusive OR                                |                                              | PAND<br>PANDN<br>POR<br>PXOR |
| Shift              | Shift Left Logical<br>Shift Right Logical<br>Shift Right Arithmetic | PSLLW, PSLLD<br>PSRLW, PSRLD<br>PSRAW, PSRAD | PSLLQ<br>PSRLQ               |
|                    |                                                                     | Doubleword Transfers                         | Quadword Transfers           |
| Data Transfer      | Register to Register<br>Load from Memory<br>Store to Memory         | MOVD<br>MOVD<br>MOVD                         | MOVQ<br>MOVQ<br>MOVQ         |
| Empty MMX<br>State |                                                                     | EMMS                                         |                              |

Call it before you switch to FPU from MMX; Expensive operation



- **PADDB/PADDW/PADDD**: add two packed numbers, no EFLAGS is set, ensure overflow never occurs by yourself
- Multiplication: two steps
- **PMULLW**: multiplies four words and stores the four lo words of the four double word results
- **PMULHW/PMULHUW**: multiplies four words and stores the four hi words of the four double word results. **PMULHUW** for unsigned.



#### • PMADDWD

 $DEST[31:0] \leftarrow (DEST[15:0] * SRC[15:0]) + (DEST[31:16] * SRC[31:16]);$  $DEST[63:32] \leftarrow (DEST[47:32] * SRC[47:32]) + (DEST[63:48] * SRC[63:48]);$ 



#### Detect MMX/SSE



- mov eax, 1 ; request version info
- cpuid ; supported since Pentium
- test edx, 00800000h ;bit 23
  - ; 0200000h (bit 25) SSE
  - ; 0400000h (bit 26) SSE2
- jnz HasMMX

### cpuid



| Initial EAX<br>Value | Information Provided about the Processor Basic CPUID Information                                                                 |                                                                                                                                                |  |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                      |                                                                                                                                  |                                                                                                                                                |  |
| он                   | EAX Maximum Input Value for Ba<br>EBX "Genu"<br>ECX "ntel"<br>EDX "inel"                                                         | asic CPUID Information (see Table 3-13)                                                                                                        |  |
| 01H                  | EBX Bits 7-0: Brand Index<br>Bits 15-8: CLFLUSH line siz                                                                         | amily, Model, and Stepping ID (see Figure 3-5<br>e (Value * 8 = cache line size in bytes)<br>er of logical processors in this physical package |  |
|                      | ECX Extended Feature Information<br>EDX Feature Information (see Fig                                                             | on (see Figure 3-6 and Table 3-15)<br>gure 3-7 and Table 3-16)                                                                                 |  |
| 02H                  | EAX Cache and TLB Information<br>EBX Cache and TLB Information<br>ECX Cache and TLB Information<br>EDX Cache and TLB Information | (see Table 3-17)                                                                                                                               |  |

:



Example: add a constant to a vector



char d[]={5, 5, 5, 5, 5, 5, 5, 5}; char clr[]={65,66,68,...,87,88}; // 24 bytes asm{ movq mm1, d mov cx, 3 mov esi, 0 L1: movq mm0, clr[esi] paddb mm0, mm1 movq clr[esi], mm0 add esi, 8 loop L1 emms



- No CFLAGS, how many flags will you need? Results are stored in destination.
- EQ/GT, no LT



PCMPEQB/PCMPGTB Operation



- Pack: converts a larger data type to the next smaller data type.
- Unpack: takes two operands and interleave them. It can be used for expand data type for immediate calculation.

Unpack low-order words into doublewords



# Pack with signed saturation









# **Unpack low portion**





# **Unpack low portion**













# Keys to SIMD programming



- Efficient data layout
- Elimination of branches

# Application: frame difference







Application: frame difference









# Application: frame difference



| MOVQ   | mm1,  | A //move 8 pixels of image A |
|--------|-------|------------------------------|
| MOVQ   | mm2 , | B //move 8 pixels of image B |
| MOVQ   | mm3,  | mm1 // mm3=A                 |
| PSUBSB | mm1,  | mm2 // mm1=A-B               |
| PSUBSB | mm2,  | mm3 // mm2=B-A               |
| POR    | mm1,  | mm2 // mm1= A-B              |





 $A^*\alpha + B^*(1-\alpha) = B + \alpha(A-B)$ 

$$\alpha = 0.75$$















- Two formats: planar and chunky
- In Chunky format, 16 bits of 64 bits are wasted
- So, we use planar in the following example





| Image                                                           | А            |         | Imag    | еВ        |                 |
|-----------------------------------------------------------------|--------------|---------|---------|-----------|-----------------|
| Ar                                                              | 3 Ar2 Ar     | 1 Ar0   |         |           | Br3 Br2 Br1 Br0 |
| 1. Unpack byte R pixel components                               | $\backslash$ |         | >       | $\langle$ |                 |
| from image A & B                                                | Ar3          | Ar2     | Ar1     | Ar0       |                 |
| 2. Subtract image B from image A                                | Br3          | Br2     | Br1     | Br0       |                 |
|                                                                 | r3           | r2      | r1      | r0        | Ī               |
| 3. Multiply subtract result by fade                             | *            | *       | *       | *         | I               |
| value                                                           | fade         | fade    | fade    | fade      | <u> </u>        |
|                                                                 | fade*r3      | fade*r2 | fade*r1 | fade*r0   |                 |
|                                                                 | +            | +       | +       | +         |                 |
| 4. Add image B pixels                                           | Br3          | Br2     | Br1     | Br0       |                 |
| <ol> <li>Pack new composite pixels back<br/>to bytes</li> </ol> | new r3       | new r2  | new r1  | new r0    | Í               |
|                                                                 |              |         | r3 r2   | r1 r0     |                 |



| MOVQ       | mm0,   | alpha//4 16-b zero-padding $lpha$        |
|------------|--------|------------------------------------------|
| MOVD       | mm1,   | A //move 4 pixels of image A             |
| MOVD       | mm2 ,  | B //move 4 pixels of image B             |
| PXOR       | mm3,   | mm3 //clear mm3 to all zeroes            |
| //unpack   | 4 pixe | els to 4 words                           |
| PUNPCKLBW  | mm1,   | mm3 // Because B-A could be              |
| PUNPCKLBW  | mm2 ,  | <pre>mm3 // negative, need 16 bits</pre> |
| PSUBW      | mm1,   | mm2 //(B-A)                              |
| PMULHW     | mm1,   | mm0 //(B-A)*fade/256                     |
| PADDW      | mm1,   | mm2 //(B-A)*fade + B                     |
| //pack for | ur woi | rds back to four bytes                   |
| PACKUSWB   | mm1,   | mm3                                      |

## Data-independent computation



- Each operation can execute without needing to know the results of a previous operation.
- Example, sprite overlay
- for i=1 to sprite\_Size
  - if sprite[i]=clr

then out\_color[i]=bg[i]

else out\_color[i]=sprite[i]



• How to execute data-dependent calculations on several pixels in parallel.

### Application: sprite overlay



| Phase    | e 1         | a3                | a2          | a1                              | a0          |            |          |  |  |  |  |
|----------|-------------|-------------------|-------------|---------------------------------|-------------|------------|----------|--|--|--|--|
|          |             | =                 | =           | =                               | =           |            |          |  |  |  |  |
|          |             | clear_color       | clear_color | clear_color                     | clear_color |            |          |  |  |  |  |
|          |             |                   |             |                                 |             |            |          |  |  |  |  |
|          |             | 11111111          | 00000000    | 11111111                        | 00000000    |            |          |  |  |  |  |
| Phase    | e 2         |                   |             |                                 |             |            |          |  |  |  |  |
| a3       | a2          | a1                | a0          | c3                              | c2          | c1         | c0       |  |  |  |  |
| A a      | und (Comple | ement of <b>M</b> | ask)        |                                 | C and       | C and Mask |          |  |  |  |  |
| 00000000 | 11111111    | 00000000          | 11111111    | 11111111                        | 00000000    | 11111111   | 00000000 |  |  |  |  |
|          |             |                   |             |                                 |             |            |          |  |  |  |  |
|          |             |                   |             |                                 |             |            |          |  |  |  |  |
| 0        | a2          | 0                 | a0          | c3                              | 0           | c1         | 0        |  |  |  |  |
| 0        | a2          | 0                 | OR the t    | c3<br>wo results<br>the overlay | 0           | c1         | 0        |  |  |  |  |
| 0        | a2          | 0                 | OR the t    | wo results                      | 0           | c1         | 0        |  |  |  |  |

### Application: sprite overlay



| MOVQ    | mm0, | sprite |
|---------|------|--------|
| MOVQ    | mm2, | mm0    |
| MOVQ    | mm4, | bg     |
| MOVQ    | mm1, | clr    |
| PCMPEQW | mm0, | mm1    |
| PAND    | mm4, | mm0    |
| PANDN   | mm0, | mm2    |
| POR     | mm0, | mm4    |





Note: Repeat for the other rows to generate ([d<sub>3</sub>, c<sub>3</sub>, b<sub>3</sub>, a<sub>3</sub>] and [d<sub>2</sub>, c<sub>2</sub>, b<sub>2</sub>, a<sub>2</sub>]).

#### MMX code sequence operation:

| movq      | mm1, row1            | ; load pixels from first row of matrix                          |
|-----------|----------------------|-----------------------------------------------------------------|
| movq      | mm2, row2            | ; load pixels from second row of matrix                         |
| movq      | mm3, row3            | ; load pixels from third row of matrix                          |
| movq      | mm4, row4            | ; load pixels from fourth row of matrix                         |
| punpcklwd | mm1, mm2             | ; unpack low order words of rows 1 & 2, mm 1 = [b1, a1, b0, a0] |
| punpcklwd | mm3, mm <del>4</del> | ; unpack low order words of rows 3 & 4, mm3 = [d1, c1, d0, c0]  |
| movq      | mm5, mm1             | ; copy mm1 to mm5                                               |
| punpckldq | mm1, mm3             | ; unpack low order doublewords -> mm2 = [d0, c0, b0, a0]        |
| punpckhdq | mm5, mm3             | ; unpack high order doublewords -> mm5 = [d1, c1, b1, a1]       |
|           |                      |                                                                 |



```
char M1[4][8];// matrix to be transposed
char M2[8][4];// transposed matrix
int n=0;
for (int i=0;i<4;i++)
  for (int j=0;j<8;j++)</pre>
    { M1[i][j]=n; n++; }
 asm{
//move the 4 rows of M1 into MMX registers
movq mm1,M1
movq mm2, M1+8
movq mm3, M1+16
movq mm4, M1+24
```

Application: matrix transport



```
//generate rows 1 to 4 of M2
punpcklbw mm1, mm2
punpcklbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 2 & row 1
punpckhwd mm0, mm3 //mm0 has row 4 & row 3
movq M2, mm1
movq M2, mm1
```

**Application: matrix transport** 



```
//generate rows 5 to 8 of M2
movq mm1, M1 //get row 1 of M1
movq mm3, M1+16 //get row 3 of M1
punpckhbw mm1, mm2
punpckhbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 6 & row 5
punpckhwd mm0, mm3 //mm0 has row 8 & row 7
//save results to M2
movq M2+16, mm1
movq M2+24, mm0
emms
} //end
```





How to use assembly in projects



- Write the whole project in assembly
- Link with high-level languages
- Inline assembly
- Intrinsics



- Assembly is rarely used to develop the entire program.
- Use high-level language for overall project development
  - Relieves programmer from low-level details
- Use assembly language code
  - Speed up critical sections of code
  - Access nonstandard hardware devices
  - Write platform-specific code
  - Extend the HLL's capabilities



- Considerations when calling assembly language procedures from high-level languages:
  - Both must use the same naming convention (rules regarding the naming of variables and procedures)
  - Both must use the same memory model, with compatible segment names
  - Both must use the same calling convention



- Assembly language source code that is inserted directly into a HLL program.
- Compilers such as Microsoft Visual C++ and Borland C++ have compiler-specific directives that identify inline ASM code.
- Efficient inline code executes quickly because CALL and RET instructions are not required.
- Simple to code because there are no external names, memory models, or naming conventions involved.
- Decidedly not portable because it is written for a single platform.

## asm directive in Microsoft Visual C+

- Can be placed at the beginning of a single statement
- Or, It can mark the beginning of a block of assembly language statements
- Syntax:





- An *intrinsic* is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions.
- The compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data.
- Intrinsic functions are inherently more efficient than called functions because no calling linkage is required. But, not necessarily as efficient as assembly.
- \_mm\_<opcode>\_<suffix> ps: packed single-precision ss: scalar single-precision

#### Intrinsics



#include <xmmintrin.h>

56



- Adds eight 128-bit registers
- Allows SIMD operations on packed singleprecision floating-point numbers
- Most SSE instructions require 16-aligned addresses



- Add eight 128-bit data registers (XMM registers) in non-64-bit modes; sixteen XMM registers are available in 64-bit mode.
- 32-bit MXCSR register (control and status)
- Add a new data type: 128-bit packed singleprecision floating-point (4 FP numbers.)
- Instruction to perform SIMD operations on 128bit packed single-precision FP and additional 64-bit SIMD integer operations.
- Instructions that explicitly prefetch data, control data cacheability and ordering of store

## SSE programming environment





## MXCSR control and status register



|                                                                                                                                                                    | 31                                                                                                                                | 16   | 15     | 14 13  | 12     | 11     | 10     | 9      | 8      | 7      | 6           | 5      | 4      | 3      | 2      | 1      | 0      |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|------|--------|--------|--------|--------|--------|--------|--------|--------|-------------|--------|--------|--------|--------|--------|--------|
|                                                                                                                                                                    | Reserved                                                                                                                          |      | F<br>Z | R<br>C | P<br>M | U<br>M | 0<br>M | Z<br>M | D<br>M | I<br>M | D<br>A<br>Z | P<br>E | U<br>E | 0<br>E | Z<br>E | D<br>E | I<br>E |
| Overflow Mask<br>Divide-by-Zero<br>Denormal Operation<br>Invalid Operation<br>Denormals Are<br>Precision Flag<br>Underflow Flag<br>Overflow Flag<br>Divide-by-Zero | trol<br>k<br>sk<br>mask<br>and Mask<br>eration Mask<br>eration Mask<br>e Zeros <sup>*</sup><br>Generally faster, but<br>g<br>Flag | t no | ot     | com    | ١p     | at     |        |        |        | /it    |             |        |        |        | 75     | 4      |        |

#### Exception



MM ALIGN16 float test1[4] = { 0, 0, 0, 1 }; MM ALIGN16 float test2[4] = { 1, 2, 3, 0 }; MM ALIGN16 float out[4]; MM SET EXCEPTION MASK(0);//enable exception Without this, result is 1.#INF try { m128 a = mm load ps(test1); m128 b = mm load ps(test2); a = mm div ps(a, b);mm store ps(out, a); except(EXCEPTION EXECUTE HANDLER) { if ( mm getcsr() & MM EXCEPT DIV ZERO) cout << "Divide by zero" << endl; return;





• ADDPS/SUBPS: packed single-precision FP





• ADDSS/SUBSS: scalar single-precision FP used as FPU?



- Provides ability to perform SIMD operations on double-precision FP, allowing advanced graphics such as ray tracing
- Provides greater throughput by operating on 128-bit packed integers, useful for RSA and RC5



• Add data types and instructions for them



Programming environment unchanged

#### Example



```
void add(float *a, float *b, float *c) {
  for (int i = 0; i < 4; i++)
    c[i] = a[i] + b[i];
               movaps: move aligned packed single-
  asm {
                       precision FP
    eax, a addps: add packed single-precision FP
mov
    edx, b
mov
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
}
```



#### SHUFPS xmm1, xmm2, imm8

# Select[1..0] decides which DW of DEST to be copied to the 1st DW of DEST



ESAC;

- 3: DEST[63:32]  $\leftarrow$  DEST[127:96];
- 2: DEST[63:32]  $\leftarrow$  DEST[95:64];
- 1: DEST[63:32]  $\leftarrow$  DEST[63:32];
- CASE (SELECT[3:2]) OF 0: DEST[63:32]  $\leftarrow$  DEST[31:0];

ESAC;

- DEST[31:0]  $\leftarrow$  DEST[127:96]; 3:
- 2: DEST[31:0]  $\leftarrow$  DEST[95:64];
- 1: DEST[31:0]  $\leftarrow$  DEST[63:32];
- 0: DEST[31:0]  $\leftarrow$  DEST[31:0];

CASE (SELECT[1:0]) OF

ESAC:

- 3: DEST[127:96]  $\leftarrow$  SRC[127:96];
- 2: DEST[127:96]  $\leftarrow$  SRC[95:64];
- 1: DEST[127:96]  $\leftarrow$  SRC[63:32];
- CASE (SELECT[7:6]) OF 0: DEST[127:96]  $\leftarrow$  SRC[31:0];

ESAC;

- 3: DEST[95:64]  $\leftarrow$  SRC[127:96];
- 2: DEST[95:64]  $\leftarrow$  SRC[95:64];
- 1: DEST[95:64]  $\leftarrow$  SRC[63:32];
- 0: DEST[95:64]  $\leftarrow$  SRC[31:0];

CASE (SELECT[5:4]) OF





```
Vector cross(const Vector& a , const Vector& b ) {
    return Vector(
        ( a[1] * b[2] - a[2] * b[1] ) ,
        ( a[2] * b[0] - a[0] * b[2] ) ,
        ( a[0] * b[1] - a[1] * b[0] ) );
}
```



```
/* cross */
 m128 mm cross ps( m128 a , m128 b ) {
 m128 ea , eb;
 // set to a[1][2][0][3] , b[2][0][1][3]
 ea = mm shuffle ps( a, a, MM SHUFFLE(3,0,2,1) );
 eb = mm shuffle ps(b, b, MM SHUFFLE(3,1,0,2));
 // multiply
 m128 xa = mm_mul_ps(ea, eb);
 // set to a[2][0][1][3] , b[1][2][0][3]
 a = mm shuffle ps(a, a, MM SHUFFLE(3,1,0,2));
 b = mm shuffle ps(b, b, MM SHUFFLE(3,0,2,1));
 // multiply
  m128 xb = mm mul_ps(a, b);
 // subtract
 return mm_sub_ps( xa , xb );
}
```



- Given a set of vectors  $\{v_1, v_2, ..., v_n\} = \{(x_1, y_1, z_1), (x_2, y_2, z_2), ..., (x_n, y_n, z_n)\}$  and a vector  $v_c = (x_c, y_c, z_c),$  calculate  $\{v_c \cdot v_i\}$
- Two options for memory layout
- Array of structure (AoS)
   typedef struct { float dc, x, y, z; } Vertex;
   Vertex v[n];
- Structure of array (SoA)
   typedef struct { float x[n], y[n], z[n]; }
   VerticesList;
   VerticesList v;



movaps xmm0, v; xmm0 = DC, x0, y0, z0movaps xmm1, vc; xmm1 = DC, xc, yc, zcmulps xmm0, xmm1; xmm0=DC, x0\*xc, y0\*yc, z0\*zcmovhlps xmm1, xmm0 ; xmm1= DC, DC, DC, x0\*xc addps xmm1, xmm0; xmm1 = DC, DC, DC, x0\*xc+z0\*zc; movaps xmm2, xmm0 shufps xmm2, xmm2, 55h ; xmm2=DC,DC,DC,y0\*yc addps xmm1, xmm2; xmm1 = DC, DC, DC, x0\*xc+y0\*yc+z0\*zc;

movhlps:DEST[63..0] := SRC[127..64]



| ; $X = x1, x2,, x3$                                    |
|--------------------------------------------------------|
| ; $Y = y1, y2, \dots, y3$                              |
| ; $z = z1, z2,, z3$                                    |
| ; $A = xc, xc, xc, xc$                                 |
| ; $B = yc, yc, yc, yc$                                 |
| ; $C = zc, zc, zc, zc$                                 |
| movaps $xmm0$ , X ; $xmm0 = x1, x2, x3, x4$            |
| movaps $xmm1$ , Y ; $xmm1 = y1, y2, y3, y4$            |
| movaps $xmm2$ , Z ; $xmm2 = z1, z2, z3, z4$            |
| <pre>mulps xmm0, A ;xmm0=x1*xc,x2*xc,x3*xc,x4*xc</pre> |
| <pre>mulps xmm1, B ;xmm1=y1*yc,y2*yc,y3*xc,y4*yc</pre> |
| <pre>mulps xmm2, C ;xmm2=z1*zc,z2*zc,z3*zc,z4*zc</pre> |
| addps xmm0, xmm1                                       |
| addps xmm0, xmm2; xmm0=( $x0*xc+y0*yc+z0*zc$ )         |



• Graphics Processing Unit (GPU): nVidia 7800, 24 pipelines (8 vector/16 fragment)





- Each GeForce 8800 GPU stream processor is a fully generalized, fully decoupled, scalar, processor that supports IEEE 754 floating point precision.
- Up to 128 stream processors





- Cell Processor (IBM/Toshiba/Sony): 1 PPE (Power Processing Unit) +8 SPEs (Synergistic Processing Unit)
- An SPE is a RISC processor with 128-bit SIMD for single/double precision instructions, 128 128-bit registers, 256K local cache
- used in PS3.

#### **Cell processor**





# GPUs keep track to Moore's law better

#### Table 1. Tale of the tape: Throughput architectures.

| Туре | Processor                      | Cores/Chip | ALUs/Core <sup>3</sup> | SIMD width | Max T <sup>4</sup> |
|------|--------------------------------|------------|------------------------|------------|--------------------|
| GPUs | AMD Radeon HD<br>4870          | 10         | 80                     | 64         | 25                 |
|      | NVIDIA GeForce<br>GTX 280      | 30         | 8                      | 32         | 128                |
| CPUs | Intel Core 2 Quad <sup>1</sup> | 4          | 8                      | 4          | 1                  |
|      | STI Cell BE <sup>2</sup>       | 8          | 4                      | 4          | 1                  |
|      | Sun UltraSPARC T2              | 8          | 1                      | 1          | 4                  |

<sup>1</sup> SSE processing only, does not account for traditional FPU

<sup>2</sup> Stream processing (SPE) cores only, does not account for PPU cores.

<sup>3</sup> 32-bit floating point operations

<sup>4</sup> Max T is defined as the maximum ratio of hardware-managed thread execution contexts to simultaneously executable threads (not an absolute count of hardware-managed execution contexts). This ratio is a measure of a processor's ability to automatically hide thread stalls using hardware multithreading.

## Different programming paradigms



```
Computing y \_ ax + y with a serial loop:
void saxpy serial(int n, float alpha, float *x, float *y)
{
   for(int i = 0; i < n; ++i)
        y[i] = alpha * x[i] + y[i];
// Invoke serial SAXPY kernel
saxpy serial(n, 2.0, x, y);
Computing y _ ax + y in paraddel using CUDA:
 global
void saxpy parallel(int n, float alpha, float *x, float *y)
ł
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if( i < n ) y[i] = alpha * x[i] + y[i];
// Invoke parallel SAXPY kernel (256 threads per block)
int nblocks = (n + 255) / 256;
saxpy parallel<<<nblocks, 256>>>(n, 2.0, x, y);
```



- Intel MMX for Multimedia PCs, CACM, Jan. 1997
- Chapter 11 *The MMX Instruction Set*, The Art of Assembly
- Chap. 9, 10, 11 of IA-32 Intel Architecture Software Developer's Manual: Volume 1: Basic Architecture
- http://www.csie.ntu.edu.tw/~r89004/hive/sse/page\_1.html