Vector, SIMD, and GPUs

March 5, 2024, 10:30 a.m. ยท 10 min read ยท ๐ŸŒ๏ธŽ en

architecture

ํ•„์ž๊ฐ€ Princeton University๊ฐ€ Coursera์—์„œ ์ œ๊ณตํ•˜๋Š” Computer Architecture (Instructor: David Wentzlaff)๋ฅผ ๋“ฃ๊ณ  ์ผ๋ถ€๋ฅผ ์š”์•ฝํ•ด ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.

์ด ๊ธ€์—์„œ๋Š” ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ์•„ํ‚คํ…์ฒ˜์ธ Vector, SIMD(Single Instruction, Multiple Data), ๊ทธ๋ฆฌ๊ณ  GPU๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค.

Vector

๋จผ์ € ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ์˜ ์ •์˜๋ฅผ ์•Œ์•„๋ณด์ž.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. (Wikipedia)

์ฆ‰, ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ๋Š” vector๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” 1์ฐจ์› ๋ฐฐ์—ด์— ๋Œ€ํ•œ ์ผ๊ด„์ ์ธ ์—ฐ์‚ฐ์„ ์ง€์›ํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฒกํ„ฐ ์•„ํ‚คํ…์ฒ˜๋Š” ์–ด๋–ป๊ฒŒ ๊ตฌํ˜„๋˜๊ณ , ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ฐ„๋žตํ•˜๊ฒŒ ์•Œ์•„๋ณธ๋‹ค.

Vector Programming Model

์†Œํ”„ํŠธ์›จ์–ด์˜ ์ž…์žฅ์—์„œ ๋ฐ”๋ผ๋ณธ Vector Programming Model์„ ๋จผ์ € ์•Œ์•„๋ณด์ž. ๋จผ์ € ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ์—๋Š” ๋ฒกํ„ฐ, ์ฆ‰ 1์ฐจ์› ๋ฐฐ์—ด์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ vector register๊ฐ€ ์žˆ๋‹ค. ๋˜ํ•œ VLR(Vector Length Register)์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ์Šค์นผ๋ผ ๋ ˆ์ง€์Šคํ„ฐ๊ฐ€ ์žˆ์–ด ๋ฒกํ„ฐ์˜ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์–ด ์žˆ๋‹ค.

Vector ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์ง€์›ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ์•„ํ‚คํ…์ฒ˜๋กœ๋Š” VMIPS๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋ฆ„์—์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ MIPS์—์„œ ํŒŒ์ƒ๋œ ์•„ํ‚คํ…์ฒ˜์ด๋‹ค. VMIPS์—์„œ๋Š” MIPS์˜ ์Šค์นผ๋ผ ์—ฐ์‚ฐ instruction๋ช…์— V์™€ S๋ฅผ ๋ถ™์—ฌ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์ž„์„ ๋ช…์‹œํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด,

์ด๋ ‡๊ฒŒ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋จผ์ € ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ์— ๊ฐ’๋“ค์ด ์ €์žฅ๋˜์–ด ์žˆ์–ด์•ผ ํ•  ๊ฒƒ์ด๋‹ค. VMIPS์—์„œ๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋กœ๋ถ€ํ„ฐ ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ๋กœ ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด vector load/store ์—ฐ์‚ฐ๋“ค์„ ์ง€์›ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด,

Example: Element-wise Multiplication
๋‹ค์Œ์˜ C ์ฝ”๋“œ๋ฅผ ์Šค์นผ๋ผ์™€ ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ์ปดํŒŒ์ผํ•œ๋‹ค๊ณ  ํ•ด๋ณด์ž.

for(i = 0; i < 64; i++){
    C[i] = A[i] * B[i];
}
# Scalar Assembly Code

    LI R4, 64
loop:
    L.D F0, 0(R1)
    L.D F2, 0(R2)
    MUL.D F4, F2, F0
    S.D F4, 0(R3)
    DADDIU R1, 8
    DADDIU R2, 8
    DADDIU R3, 8
    DADDIU R4, 1
    DADDIU R1, 8
    BNEZ R4, loop

์Šค์นผ๋ผ ํ”„๋กœ์„ธ์„œ์—์„œ๋Š” ์‹ค์ œ๋กœ i์˜ ๊ฐ’์„ ์ฆ๊ฐ€์‹œ์ผœ ๊ฐ€๋ฉด์„œ &A[i]์™€ &B[i]์— ์ ‘๊ทผํ•ด ์ด๋“ค ๊ฐ’์„ ๋ ˆ์ง€์Šคํ„ฐ๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ณ , ์ด๋ฅผ ๊ณฑํ•œ ํ›„ &C[i]์˜ ์ฃผ์†Œ๋กœ ์ €์žฅํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค. ๋ฐ˜๋ฉด ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋ฅผ ๋ณด์ž.

# Vector Assembly Code

L1 VLR, 64
LV V1, R1
LV V2, R2
MULVV.D V3, V1, V2
SV V3, R3

๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ vector length register์— item์˜ ๊ฐœ์ˆ˜์ธ 64๋ฅผ ๋Œ€์ž…ํ•œ ํ›„, A[i]๊ณผ B[i] (i=0, 1, ... 63)๋ฅผ ๊ฐ๊ฐ LV๋ฅผ ์‚ฌ์šฉํ•ด ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋”ฉํ•ด์˜จ ํ›„ MULVV.D๋ฅผ ์‚ฌ์šฉํ•ด ํ•œ๋ฒˆ์— ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

Vector Arithmetic Execution

๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ๋“ค์€ ๋งค์šฐ ๊นŠ์€ pipeline์„ ์‚ฌ์šฉํ•ด vector register์— ์ €์žฅ๋œ element๋“ค์„ ํ•˜๋‚˜์”ฉ ์—ฐ์‚ฐ์„ ํ•œ๋‹ค. pipeline์ด ๊นŠ๋‹ค๋Š” ๊ฒƒ์€ ์—ฐ์‚ฐ ๋กœ์ง์„ ์ž˜๊ฒŒ ์ชผ๊ฐœ ๋†“์•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฏ€๋กœ clock์„ ๋†’๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๊ณ , ์—ฐ์‚ฐ์„ ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋•Œ ๋ฒกํ„ฐ์˜ ๊ฐ ์›์†Œ๋“ค์€ ์„œ๋กœ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ data hazard์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์•„ bypassing ๋“ฑ์€ ๋ถˆํ•„์š”ํ•˜๋‹ค.

Interleaved Vector Memory System

ํ•œํŽธ, vector๊ฐ€ ํƒ‘์žฌ๋œ system์˜ ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์— ๋ณ€๊ฒฝ์ด ํ•„์š”ํ•˜๋‹ค. Vector load ์—ฐ์‚ฐ์„ ํ•  ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ read์˜ ํšŸ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์—, memory load ๊ณผ์ •์„ interleavingํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ด๋ฅผ ์œ„ํ•ด ๋งŽ์€ ๋ฒกํ„ฐ ์‹œ์Šคํ…œ์€ ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ์— banking์„ ๋„์ž…ํ•œ๋‹ค. ์˜ˆ์‹œ๋กœ Cray-1์˜ ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋Š” 16๊ฐœ์˜ memory bank๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์–ด, ๊ฐ๊ฐ์ด ์ „์ฒด ์ฃผ์†Œ ๊ณต๊ฐ„์˜ $\frac{1}{16}$์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด๋‹ค. Cray-1์—์„œ ๊ฐ๊ฐ์˜ bank๋Š” busy time (bank๊ฐ€ ๋‹ค์Œ request๋ฅผ ๋ฐ›์„ ์ค€๋น„๊ฐ€ ๋˜๊ธฐ๊นŒ์ง€ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„)์ด 4 ์‚ฌ์ดํด, latency๊ฐ€ 16 ์‚ฌ์ดํด์œผ๋กœ ์„ค๊ณ„๋˜์–ด ์žˆ๋‹ค. ์ฆ‰ (๋ฒกํ„ฐ์˜ ์ตœ๋Œ€ ๊ธธ์ด์ธ) 64๊ฐœ์˜ ์›์†Œ๋ฅผ ๋กœ๋”ฉํ•œ๋‹ค๋ฉด, ๊ฐ bank์— ํƒ€๊ฒŸ ์ฃผ์†Œ๊ฐ€ ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌํ•œ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— $4 \times 3 + 16 = 28$ ์‚ฌ์ดํด๋งŒ์— 64๊ฐœ์˜ ์›์†Œ๋ฅผ ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋‹ค.

Vector Processing Optimization

Example: Pipeline Diagram

for(i = 0; i < 4; i++){
    C[i] = A[i] * B[i];
}

๋ผ๋Š” ์ฝ”๋“œ๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ์„ ๋•Œ, ์ปดํŒŒ์ผ๋œ ์–ด์…ˆ๋ธ”๋ฆฌ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” pipeline diagram์„ ๊ทธ๋ ค๋ณด์ž.

ํŒŒ์ดํ”„๋ผ์ธ์€ ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ F, D, R, W์€ ๊ฐ๊ฐ fetch, decode, register access, write-back์ด๊ณ  X, L, S, Y๋Š” ๊ฐ๊ฐ ๋ง์…ˆ, memory load/store, ๊ณฑ์…ˆ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

LI VLR, 4
LV V1, R1
LV V2, R2
MULVV.D V3, V1, V2
SV V3, R3
LV V2..FDRL0L1W                   
    L0L1W                  
      RL0 L1                  
       RL0 L1W                
MULVV.D FDDDDDDDRY0Y1Y2Y3W          
           RY0Y1Y2Y3         
            Y0Y1Y2Y3W        
             RY0Y1Y2Y3W       
SV  DDDDDDDRS0S1W   
                    RS0S1W  
                     RS0S1W 
                      RS0S1W

๋ฒกํ„ฐ instruction์˜ ์‹คํ–‰์—์„œ chime์ด๋ผ๋Š” ๊ฐœ๋…์„ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” instruction์„ ์‹คํ–‰ํ•˜๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” overhead๋ฅผ ๋ชจ๋‘ ์ œ์™ธํ•˜๊ณ , ํ•ด๋‹น instruction์œผ๋กœ vector sequence๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์„ ์˜๋ฏธํ•œ๋‹ค. ์œ„์˜ ๊ฒฝ์šฐ, vector length๊ฐ€ 4์ผ ๋•Œ chime์ด 4๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

diagram์„ ๋ณด๋ฉด, LV์—์„œ 4๊ฐœ์˜ load๊ฐ€ ๋ชจ๋‘ ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋‹ค์Œ ๋‹จ๊ณ„๊ฐ€ ์ง„ํ–‰๋˜์ง€ ์•Š๋Š” ๋น„ํšจ์œจ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

ํŠนํžˆ ์œ„์˜ pipeline diagram์—์„œ๋Š” ๊ฐ cycle์— ํ•˜๋‚˜์˜ functional unit๋งŒ์ด ์‹คํ–‰๋˜์—ˆ๋‹ค. ์ด๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ, superscalar processor์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ FU๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ vector instruction parallelism์„ ์ตœ๋Œ€ํ•œ exploitํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ฐœ์„ ํ•ด๋ณด์ž.

Vector Chaining

๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ์—์„œ๋„ ๊ธฐ์กด ์Šค์นผ๋ผ ํ”„๋กœ์„ธ์„œ์˜ bypassing๊ณผ ์œ ์‚ฌํ•œ ํ…Œํฌ๋‹‰์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ Vector Chaining์ด๋ผ ํ•œ๋‹ค.

LV V1
MULVV V3, V1, V2
ADDVV V5, V3, V4

์˜ˆ๋ฅผ ๋“ค์–ด ์œ„์˜ ์ฝ”๋“œ์—์„œ๋Š” LV V1์˜ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ V1 ๋ ˆ์ง€์Šคํ„ฐ์˜ ๊ฐ’์ด ๊ณง๋ฐ”๋กœ MULVV์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๊ณ , MULVV์˜ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ V3๋Š” ๋‹ค์‹œ ADDVV์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค.

์ด๋•Œ chaining(ํ˜น์€ ๊ธฐ์กด ์šฉ์–ด๋กœ bypassing)์€ register file์„ ๊ฑฐ์ณ์„œ ์ผ์–ด๋‚œ๋‹ค. ์ฆ‰, ์Šค์นผ๋ผ ํ”„๋กœ์„ธ์„œ์—์„œ ํ”ํžˆ ๋ณด๋˜ ๊ฒƒ์ฒ˜๋Ÿผ execution stage๊ฐ€ ๋๋‚˜ ๊ฐ’์ด ์ƒ์„ฑ๋˜์ž๋งˆ์ž ๋‹ค์Œ ์‚ฌ์ดํด์— ์ด๋ฅผ ์‚ฌ์šฉํ•ด ๋‹ค๋ฅธ instruction์ด execution์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๊ณ , register์— write๋ฅผ ํ•˜๊ณ  ๋‹ค์‹œ read๋ฅผ ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ๋‘ ์‚ฌ์ดํด(์•ž์„  instruction์˜ register write, ์ด์— ์˜์กดํ•˜๋Š” instruction์˜ register read)์˜ ๋‚ญ๋น„๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค.

Vector chaining์ด ์ ์šฉ๋œ ๊ฒฝ์šฐ๋ฅผ ๊ฐ€์ •ํ•˜๊ณ  pipeline diagram์„ ๋‹ค์‹œ ๊ทธ๋ ค๋ณด์ž.

LV V2..DRL0L1W             
    L0L1W            
      RL0 L1            
       RL0 L1W          
MULVV.D DDDRY0Y1Y2Y3W       
        Y0 Y1Y2Y3      
         Y0 Y1Y2Y3W     
          Y0 Y1 Y2Y3W     
SV  DDRS0 S1 W   
              RS0 S1 W  
               RS0 S1  
                RS0S1W

์ด๋•Œ MULVV.D์˜ ์ฒซ ์‚ฌ์ดํด์„ ๋ณด๋ฉด LV V2, R2์˜ ์ฒซ ๋ฒˆ์งธ item์ด ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์— write๋  ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฐ ํ›„ ์ด๋ฅผ ์ฝ์–ด์˜ค๊ธฐ ๋•Œ๋ฌธ์— ์•ž์„œ ์„ค๋ช…ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์ด 2์‚ฌ์ดํด์ด ๋‚ญ๋น„๋˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์„ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ์Šค์นผ๋ผ ํ”„๋กœ์„ธ์„œ์—์„œ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ bypassing network๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด ๋‚ญ๋น„๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์œผ๋‚˜, ์ด ๊ฒฝ์šฐ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์„ multiported๋กœ ๋ฐ”๊พธ์ง€ ์•Š๋Š” ํ•œ register read์—์„œ structural hazard๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค.

๋งŒ์•ฝ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์ด multiported๋ผ vector chaining์ด ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š์•„๋„ ๋˜๋Š” ๊ฒฝ์šฐ, pipeline diagram์€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ๊ฒƒ์ด๋‹ค.

LV V2..DRL0L1W         
    L0L1W        
     RL0L1W       
      RL0L1W      
MULVV.D DRY0Y1Y2Y3W     
      Y0Y1Y2Y3W    
       Y0Y1Y2Y3W   
        Y0Y1Y2Y3W   
SV  DS0S1W   
          RS0S1  
           RS0S1W 
            S0S1W

Vector Instruction Execution with Multiple Lanes

Chime์„ ๊ฐ์†Œ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” execution step์˜ ๋ณ‘๋ ฌํ™”๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋™์ผํ•œ functional unit ๋“ค์ด ๋‘ ๊ฐœ ์ด์ƒ์”ฉ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ 2-way vector stripping์ด ์ ์šฉ๋œ ๊ฒฝ์šฐ, pipeline diagram์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ”๋€Œ๊ฒŒ ๋œ๋‹ค.

LV V2...DRL0L1W       
     L0L1W      
    RL0L1W       
     L0L1W      
MULVV.D  DY0Y1Y2Y3W   
       RY0Y1Y2Y3W  
      Y0Y1Y2Y3W   
       RY0Y1Y2Y3W  
SV   FFDDDDRS0S1W 
           RS0S1W
          RS0S1W 
           RS0S1W

๋‹ค์ด์–ด๊ทธ๋žจ์„ ๋ณด๋ฉด, ๊ฐ๊ฐ์˜ ๋ช…๋ น์–ด(LV, MULVV.D, SV)๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋‘ ๊ฐœ์”ฉ์˜ operation์ด ์ง์ง€์–ด์„œ ๋™์‹œ์— ์ผ์–ด๋‚œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ, ๋‹ค์ด์–ด๊ทธ๋žจ์˜ ์ฒซ ๋‘ ์ค„์€ A[0]์™€ A[2]๊ฐ€ ๋กœ๋”ฉ๋˜๋Š” ๊ฒƒ์œผ๋กœ, ๊ทธ ๋‹ค์Œ ๋‘ ์ค„์€ A[1]๊ณผ A[3]๊ฐ€ ๋กœ๋”ฉ๋˜๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด chime์€(VLR=4์ผ ๋•Œ) 4์—์„œ 2๋กœ ๊ฐ์†Œํ•œ๋‹ค. ์ด๊ฒƒ์€ ๋™์ผ functional unit์ด ๋‘ ๊ฐœ์”ฉ ์žˆ๊ธฐ์— ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด๋‹ค.

Vector Stripmining

์ง€๊ธˆ๊นŒ์ง€๋Š” ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ์— ๋“ค์–ด๊ฐ€๋Š” ์›์†Œ์˜ ์ˆ˜๊ฐ€ ๋งŽ์ง€ ์•Š์€ ๊ฒฝ์šฐ๋งŒ์„ ๋‹ค๋ฃจ์—ˆ์œผ๋‚˜, ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ์— ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š” ์›์†Œ์˜ ๊ฐœ์ˆ˜๋Š” ์œ ํ•œํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ๋„˜๋Š” ๊ธธ์ด์˜ array์— ๋Œ€ํ•ด์„œ๋Š” vector instruction์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ชผ๊ฐœ์–ด ์‹คํ–‰ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ vector stripmining์ด๋ผ๊ณ  ํ•œ๋‹ค.

๊ทธ๋ฆผ์˜ ์ฝ”๋“œ์˜ ๊ฒฝ์šฐ์—๋Š” ์ตœ๋Œ€ ๋ฒกํ„ฐ ๊ธธ์ด๊ฐ€ 64์ธ ์•„ํ‚คํ…์ฒ˜๋กœ, $N\mod 64$๊ฐœ์˜ ์›์†Œ์— ๋Œ€ํ•ด ๋จผ์ € ๊ณ„์‚ฐ์„ ๋งˆ์นœ ํ›„ 64๊ฐœ์”ฉ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ณ„์‚ฐ์„ ํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. Vector stripmining์„ ํ•˜๋Š” ๊ฒฝ์šฐ ์ด์ฒ˜๋Ÿผ ๋ฒกํ„ฐ ์—ฐ์‚ฐ ์ž์ฒด๋ฅผ ์ œ์™ธํ•˜๋”๋ผ๋„ ์ถ”๊ฐ€์ ์ธ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š” ์ฝ”๋“œ์˜ ๋น„์ค‘์ด ๋งŽ์•„์ง€๋ฏ€๋กœ, $N$์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘๋‹ค๋ฉด ์ด๋Ÿฌํ•œ ์ฒ˜๋ฆฌ์˜ overhead์˜ ๋น„์ค‘์ด ์ปค์ ธ ์„ฑ๋Šฅ์ด ๊ธฐ๋Œ€์— ๋ฏธ์น˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค.

Vector ISA์˜ ์žฅ์ 

Vector ํ”„๋กœ์„ธ์„œ๋Š” ์ผ๋ฐ˜์ ์ธ ์Šค์นผ๋ผ ํ”„๋กœ์„ธ์„œ์— ๋น„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์„ ๊ฐ€์ง„๋‹ค.

Code Vectorization

์ฝ”๋“œ๋ฅผ ์ปดํŒŒ์ผ ํ•  ๋•Œ, sequentialํ•œ ์ฝ”๋“œ๋ฅผ ์žฌ๋ฐฐ์—ดํ•˜์—ฌ ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์ฝ”๋“œ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ code vectorization(์ฝ”๋“œ ๋ฒกํ„ฐํ™”)์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๋Š” ๋ฐ˜๋ณต๋ฌธ์˜ ๊ฐ iteration๊ฐ„ ์˜์กด์„ฑ์— ๋Œ€ํ•œ ๊นŠ์€ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค. ์ด ๋‹จ๋ฝ์—์„œ๋Š” ๋ฐ˜๋ณต๋ฌธ์—์„œ ํ”ํžˆ ๋ฐœ์ƒํ•˜๋Š” ์—ฌ๋Ÿฌ ํŒจํ„ด๋“ค์„ vectorizeํ•˜๋Š” ์—ฌ๋Ÿฌ ์˜ˆ์‹œ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค.

Conditional Execution

for(i = 0; i < N; i++){
    if(A[i] > 0) A[i] = B[i];
}

์œ„ ์ฝ”๋“œ์˜ ๊ฒฝ์šฐ ๋ฐฐ์—ด์˜ ๋ชจ๋“  ์›์†Œ์— ๋Œ€ํ•ด์„œ ๋™์ผํ•œ ์—ฐ์‚ฐ์ด ์ ์šฉ๋˜์ง€ ์•Š๊ณ  ์žˆ๋‹ค. A[i]๊ฐ€ ์–‘์ˆ˜์ธ ๊ฒฝ์šฐ์—๋งŒ ๋Œ€์ž… ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ, predicated register์˜ ๋ฒกํ„ฐ ๋ฒ„์ „์ด๋ผ๊ณ ๋„ ํ•  ์ˆ˜ ์žˆ๋Š” vector mask(๋˜๋Š” vector flag)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ฝ”๋“œ๋ฅผ ๋ฒกํ„ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. Vector mask๋Š” ๊ฐ ์›์†Œ๋‹น 1bit๋ฅผ ์ฐจ์ง€ํ•˜๋„๋ก ๋˜์–ด ์žˆ๋Š” ๋ฐฐ์—ด์œผ๋กœ, ๊ฐ๊ฐ์˜ ์›์†Œ์— ๋Œ€ํ•ด ์—ฐ์‚ฐ์„ ์ ์šฉํ• ์ง€ ๋ง์ง€๋ฅผ ๋ช…์‹œํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

Conditional execution์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Vector Reductions

Vector reduction์ด๋ž€ ๋ฒกํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์Šค์นผ๋ผ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ์—ฐ์‚ฐ์„ ์˜๋ฏธํ•œ๋‹ค. ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ๋กœ ๋ฒกํ„ฐ ์ „์ฒด์˜ ํ•ฉ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค.

sum = 0;
for(i = 0; i < N; i++)
    sum += A[i];

์œ„ ์ฝ”๋“œ์˜ ๊ฒฝ์šฐ ํ‘œ๋ฉด์ ์œผ๋กœ๋Š” sum์ด๋ผ๋Š” ๋ณ€์ˆ˜์— inter-loop dependence๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ฝ”๋“œ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€ํ˜•ํ•˜๋ฉด dependency๋ฅผ ์—†์• ๊ณ  ๋ฒกํ„ฐ ์—ฐ์‚ฐ๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

sum[0:VL-1]
for(i = 0; i < N; i+= VL)
    sum[0:VL-1] += A[i:i+VL-1]       // accumulate A[VL * k + i]'s to sum[i]
do{
    VL = VL / 2;
    sum[0:VL-1] += sum[VL:2*VL-1]
} while (VL > 1)

์œ„์ฒ˜๋Ÿผ ์ด์ง„ํŠธ๋ฆฌ์™€ ๊ฐ™์ด ๋ฒกํ„ฐ๊ฐ„์˜ ๋ง์…ˆ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์‹คํ–‰ํ•˜๋ฉด $\lceil\lg N\rceil$๋ฒˆ ๋งŒ์— ํ•ฉ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

Vector Scatter/Gather

for(i = 0; i < N; i++)
    A[i] = B[i] + C[D[i]];

์œ„ ์ฝ”๋“œ์˜ ๊ฒฝ์šฐ loop์— indirect access๊ฐ€ ์กด์žฌํ•œ๋‹ค. ์ ‘๊ทผํ•ด์•ผ ํ•  ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ๊ฐ€ ์—ฐ์†์ ์ด๊ฑฐ๋‚˜ strided pattern์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค๋ฅธ ๋ฒกํ„ฐ ๋ ˆ์ง€์Šคํ„ฐ์— ์ €์žฅ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด vector scatter*์™€ vector gather**๋ผ๋Š” instruction์ด ์žˆ๋‹ค. Gather๋Š” ๋ฒกํ„ฐ์— ์ €์žฅ๋˜์–ด ์žˆ๋Š” ์ฃผ์†Œ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•ด loadํ•˜๋Š” ๊ฒƒ์„, scatter๋Š” ๋ฒกํ„ฐ์— ์ €์žฅ๋˜์–ด ์žˆ๋Š” ์ฃผ์†Œ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•ด storeํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

์œ„์˜ ์ฝ”๋“œ๋ฅผ ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ๋กœ ์ปดํŒŒ์ผํ•˜๋ฉด vector scatter/gather๋ฅผ ํ™œ์šฉํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ์–ด์…ˆ๋ธ”๋ฆฌ ์ฝ”๋“œ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

LV vD, rD
LVI vC, rC, vD
LV vB, rB
ADDV.D vA, vB, vC
SV vA, rA

SIMD

Single Instruction, Multiple Data(SIMD) ๋˜๋Š” Multimedia Extension์€ ๋ฒกํ„ฐ์˜ ์นœ์ฒ™๊ณผ๋„ ๊ฐ™์€ ์•„ํ‚คํ…์ฒ˜๋กœ, ์˜ค๋Š˜๋‚ ์˜ ๋ฐ์Šคํฌํƒ‘ ์ปดํ“จํ„ฐ๊ฐ€ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ผ๋ฐ˜์ ์ธ instruction์ด ํ•œ instruction์œผ๋กœ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๋Š” ๋ฐ ๋น„ํ•ด, SIMD instruction๋“ค์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 64๋น„ํŠธ์งœ๋ฆฌ wide register์— 32๋น„ํŠธ์งœ๋ฆฌ ๋ฐ์ดํ„ฐ 2๊ฐœ๋ฅผ ์ €์žฅํ•˜๋Š” ์‹์ด๋‹ค. ์ด๋Ÿฌํ•œ wide register 2๊ฐœ๋กœ ๋ง์…ˆ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด 32๋น„ํŠธ์งœ๋ฆฌ ๋ฐ์ดํ„ฐ ๋‘ ์Œ์— ๋Œ€ํ•ด ๋™์‹œ์— ๋ง์…ˆ์ด ์ˆ˜ํ–‰๋œ๋‹ค.

SIMD๋Š” x86์—์„œ MMX(MultiMedia eXtension)๊ฐ€ ์ฒ˜์Œ ๋„์ž…๋œ ์ดํ›„, SSE, SSE2, SSE3๋ฅผ ๊ฑฐ์ณ AVX(Advanced Vector eXtension)๊ฐ€ ์ž๋ฆฌ์žก๊ฒŒ ๋˜๋ฉด์„œ ๋Œ€์ค‘์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ฒŒ ๋˜์—ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ๊ฐ์˜ ์ฐจ์ด๋Š” ๋ ˆ์ง€์Šคํ„ฐ์˜ ํฌ๊ธฐ์™€ ๊ฐœ์ˆ˜, instruction์˜ ์ข…๋ฅ˜ ๋“ฑ์ด๋‹ค.

SIMD๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” datapath ์ž์ฒด์— ๋ณ€ํ˜•์ด ํ•„์š”ํ•˜๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 32๋น„ํŠธ ๊ฐ€์‚ฐ๊ธฐ๋ฅผ 8๋น„ํŠธ ์ •์ˆ˜ 4์Œ์˜ ๋ง์…ˆ์„ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, carry chain์„ ๋Š์–ด์•ผ ํ•œ๋‹ค. ๋ฐ˜๋ฉด bitwise OR, AND์™€ ๊ฐ™์€ ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ์—๋Š” ์•„๋ฌด ๋ณ€๊ฒฝ ์—†์ด๋„ ๊ทธ๋ƒฅ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•  ๊ฒƒ์ด๋‹ค.

SIMD vs Vectors

SIMD์™€ ๋ฒกํ„ฐ์˜ ์ฐจ์ด์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ํ•œํŽธ, SIMD๋Š” ๋ฒ„์ „์ด ์˜ฌ๋ผ๊ฐ€๋ฉฐ ์ ์  ๋ฒกํ„ฐ์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋‹ค. ์ตœ๊ทผ ๋ฒ„์ „์—์„œ๋Š” misaligned memory access๋‚˜ double precision์„ ์ง€์›ํ•˜๋ฉฐ, New AVX(2008๋…„)์˜ ๊ฒฝ์šฐ 1024๋น„ํŠธ๊นŒ์ง€ ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•˜์—ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ๋‹ค.

GPUs

Graphic Processing Unit(GPU)๋Š” ์›๋ž˜๋Š” general computing์ด ์•„๋‹Œ 3D ๊ทธ๋ž˜ํ”ฝ ๋ Œ๋”๋ง์„ ์šฉ๋„๋กœ ํƒ„์ƒํ•œ ํ”„๋กœ์„ธ์„œ์ด๋‹ค. ์ด๋Š” pixel shader ๋“ฑ์˜ ๋ช‡ ๊ฐ€์ง€ ๋ชฉ์ ์œผ๋กœ๋งŒ ์‚ฌ์šฉ๋˜์—ˆ๊ณ , ์ด ๋•Œ๋ฌธ์— ์ดˆ๊ธฐ GPU์˜ ๊ฒฝ์šฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋„๊ฐ€ ๋งค์šฐ ์ œํ•œ์ ์ด์—ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ์ผ๋ถ€ ์‚ฌ์šฉ์ž๋“ค์€ GPU์˜ ํผํฌ๋จผ์Šค๋ฅผ ๋‹ค๋ฅธ ์ž‘์—…์—๋„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ ์ž‘์—…๋“ค์„ pixel shading๊ณผ ๊ฐ™์ด GPU๊ฐ€ ์ง€์›ํ•˜๋Š” ์ž‘์—…๋“ค๋กœ ๋งคํ•‘ํ•œ ํ›„, ๊ทธ ์ถœ๋ ฅ๊ฐ’์„ ๋‹ค์‹œ ์›๋ž˜์˜ ๋„๋ฉ”์ธ์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ์‹์œผ๋กœ GPU๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

์ดํ›„ NVIDIA์—์„œ CUDA(Compute Unified Device Architecture)๊ฐ€ ์ถœ์‹œ๋˜๋ฉฐ, GPGPU(General-Purpose GPU)๊ฐ€ ํƒ„์ƒํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋งŽ์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ๋“ค์ด GPU์˜ ๊ณ„์‚ฐ ์„ฑ๋Šฅ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ general computing์„ ๊ฐ€์†ํ™”์‹œํ‚ค๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค.

GPGPU์—์„œ๋Š” attached processor model์ด๋ผ๋Š” ๊ฒƒ์„ ๊ฐ€์ •ํ•˜๋Š”๋ฐ, ์ด๋Š” host CPU๊ฐ€ GPGPU์— data-parallel kernel์„ issueํ•˜์—ฌ ์‹คํ–‰์‹œํ‚ค๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

CUDA Programming Model

CUDA๋Š” ๊ณ„์‚ฐ์„ thread ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋•Œ ๊ฐ๊ฐ์˜ thread๋Š” CUDA thread ๋˜๋Š” microthread๋ผ๊ณ  ๋ถˆ๋ฆฌ๋ฉฐ, thread block์ด๋ผ๋Š” ๋‹จ์œ„๋กœ ๋ฌถ์ธ๋‹ค. ๊ฐ๊ฐ์˜ thread๊ฐ€ ์„œ๋กœ ๋…๋ฆฝ์ ์ธ scalar ์—ฐ์‚ฐ์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์‹์ด๋‹ค.

์˜ˆ์‹œ๋กœ DAXPY๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” CUDA ์ฝ”๋“œ๋ฅผ ์‚ดํŽด๋ณด์ž. DAXPY๋ž€ double-precision ax plux y์˜ ์•ฝ์ž๋กœ, ๋‹ค์Œ์˜ C ์ฝ”๋“œ์™€ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์ด๋‹ค.

void daxpy(int n, double a, double *x, double *y){
    for(int i = 0; i < n; i++){
        y[i] = a * x[i] + y[i];
    }
}

์ด๋ฅผ CUDA๋กœ ๋ณ‘๋ ฌํ™”์‹œํ‚จ ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

__host__   // Piece run on host processor
int nblocks = (n+255) / 256;    // 256 CUDA threads/block
daxpy<<<nblocks,256>>>(n, 2.0, x, y)

__device__  // Piece run on GPGPU
void daxpy(int n, double a, double *x, double *y){
    int i = blockIdx.x * blockDim.x + threadId.x;
    if (i < n) 
        y[i] = a * x[i] + y[i];
}

CUDA๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด C์—์„œ ํ™•์žฅํ•œ ์ฝ”๋“œ๋ฅผ ์ด์šฉํ•ด, GPU์—์„œ ์ž‘์—…์„ thread ๋‹จ์œ„๋กœ ๋ณ‘๋ ฌํ™”ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. CUDA์˜ ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ Single Instruction, Multiple Thread(SIMT)๋ผ๊ณ  ํ•œ๋‹ค. Scalar instruction ํ•˜๋‚˜๋งŒ์œผ๋กœ (CUDA์˜ ๊ฒฝ์šฐ) 32๊ฐœ์˜ thread๋ฅผ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

SIMT์˜ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.