สถาปัตยกรรมแบบ SIMD ในรูปแบบอื่นๆ

Slides:

Advertisements

งานนำเสนอที่คล้ายกัน

DSP 6 The Fast Fourier Transform (FFT) การแปลงฟูริเยร์แบบเร็ว

Advertisements

Suphot Sawattiwong Function ใน C# Suphot Sawattiwong

การประยุกต์ Logic Gates ภาค 2

กระบวนการ (Process).

Lists Data Structure LAUKHWAN NGAMPRASIT LAUKHWAN NGAMPRASIT.

ตัวแปรชุด การเขียนโปรแกรมภาษาคอมพิวเตอร์ 1

DSP 6 The Fast Fourier Transform (FFT) การแปลงฟูริเยร์แบบเร็ว

EEET0485 Digital Signal Processing Asst.Prof. Peerapol Yuvapoositanon DSP3-1 ผศ.ดร. พีระพล ยุวภูษิตานนท์ ภาควิชา วิศวกรรมอิเล็กทรอนิกส์ DSP 6 The Fast.

โปรแกรมจำลองการทำงาน

ลำดับเรขาคณิต Geometric Sequence.

บทที่ 3 ตอนที่ 1 คำสั่งเงื่อนไขและการตัดสินใจ(p

Intermediate Representation (รูปแบบการแทนในระยะกลาง)

Register Allocation and Graph Coloring

สภาวะแวดล้อมในขณะโปรแกรมทำงาน

Ordering and Liveness Analysis ลำดับและการวิเคราะห์บอกความ เป็นอยู่หรือความตาย.

Central Processing Unit

การแทนค่าข้อมูล และ Primary Storage (Memory)

การใช้งานโปรแกรม Excel เบื้องต้น

หลักการทำงานของคอมพิวเตอร์

เนื้อหา ประเภทของโปรแกรมภาษา ขั้นตอนการพัฒนาโปรแกรม

หลักการทำงานของคอมพิวเตอร์

NUMBER SYSTEM เลขฐานสิบ (Decimal Number) เลขฐานสอง (Binary Number)

ตัวแปรชุด.

Number Representations

Image Processing & Computer Vision

ขั้นตอนการแปลงไฟล์.

Operating System ฉ NASA 4.

การประยุกต์ ใช้งานมัลติมีเดีย

Asst.Prof. Dr.Surasak Mungsing

บทที่ 3 ตัวดำเนินการ และ นิพจน์

Functional components of a computer

Introduction to Cache Memory Systems

การเขียนรายงานการใช้เอกสารประกอบการสอน

Kampol chanchoengpan it สถาปัตยกรรมคอมพิวเตอร์ Arithmetic and Logic Unit 1.

Computer Coding & Number Systems

แผนผังคาร์โนห์ Kanaugh Map

แนวทางการปฏิบัติโครงการจูงมือ น้องน้อยบนดอยสูง 1.

โครงสร้างข้อมูลแบบคิว

บทที่ 8 File Management. ประเด็นที่ต้องพิจารณา ถ้าต้องการเก็บข้อมูลจะเก็บข้อมูลไว้ที่ไหน สามารถเก็บข้อมูลตรงไปยัง media โดยไม่ต้อง จัดรูปแบบการเก็บได้หรือไม่

หน่วยประมวลผลกลางและหน่วยความจำ Central Processing Unit and Memory

ง30212 การเขียนโปรแกรมภาษาคอมพิวเตอร์ โรงเรียนปลวกแดงพิทยาคม

ค21201 คณิตศาสตร์เพิ่มเติม 1

บทที่ 3 การวิเคราะห์ Analysis.

คำสั่งควบคุมการทำงานของ ActionScripts

สถาปัตยกรรมคอมพิวเตอร์ (Computer Architecture)

ระบบฐานข้อมูล (Database Management System)

คำสั่งเงื่อนไข (Condition Statement)

CHAPTER 4 Control Statements

Computer Components CPU: Intel Core i5-3210M (2.50 GHz, up to 3.10 GHz , 3MB cache) Display: 15.6 inch (1366x768) High Definition (1080p) LED Display Graphic:

Chapter 3 - Stack, - Queue,- Infix Prefix Postfix

School of Information Communication Technology,

เรื่องการประยุกต์ของสมการเชิงเส้นตัวแปรเดียว

Chapter 4 Cache Memory Computer Memory System Overview

Recursive Method.

AMD ตระกูล K6. ซีพียู K6 ได้รับการออกแบบมา เพื่อใช้งานกับแอพพลิเคชัน 16 บิตและ 32 บิต เมื่อใช้กับ Windows 95 หรือ Windows NT แล้ว จะได้ ประสิทธิภาพความเร็ว.

Addressing Modes ธนวัฒน์ แซ่เอียบ.

ค่าความจริงของประโยคที่มีตัวบ่งปริมาณ 2 ตัว

บทที่ 7 เงื่อนไขในภาษาซี

Introduction to Database

stack #1 ผู้สอน อาจารย์ ยืนยง กันทะเนตร

การจัดการกับความผิดปกติ

Week 13 Basic Algorithm 2 (Searching)

สถาปัตยกรรม AMD K10.

Week 5 While and Do-While loop (Control Structure 2)

2 มิ. ย 2547 โดย วชิราวุธ ธรรมวิเศษ 1 บทที่ 1 แนะนำเทคโนโลยีจาวา Introduction to Java Technology.

Computer Programming Asst. Prof. Dr. Choopan Rattanapoka

Introduction to Microprocessors II

Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)

บทที่ 3 การจัดการหน่วยความจำ (Memory Management)

ใบสำเนางานนำเสนอ:

สถาปัตยกรรมแบบ SIMD ในรูปแบบอื่นๆ สไลด์ของ O. Mutlu มหาวิทยาลัย Carnegie Mellon

Intel Pentium MMX Operations ใช่หนึ่งคำสั่งกระทำกับข้อมูลหลายๆชุดในขณะเดียวกัน ลักษณะคล้ายๆ array processing แต่ข้อจำกัดมากกว่า ออกแบบมาสำหรับงาน multimedia โดยเฉพาะ ค่า opcode ของคำสั่งเป็นตัวบอกชนิดของข้อมูล 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit quadword Stride มีค่าเท่ากับ 1 เสมอ บทความที่เกี่ยวข้อง: Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996.

ตัวอย่างการประมวลผลโดยใช้ MMX

ตัวอย่างการประมวลผลโดยใช้ MMX

Graphics Processing Units (GPU)

ภาพรวมของ GPU

แนวคิดเรื่อง thread warp และ SIMT Warp คือกลุ่มของ thread ที่ประมวลผลชุดคำสั่งเดียวกัน แต่ว่าทำบนชุดข้อมูลที่ต่างกัน Nvidia เรียกว่าเป็นการประมวลผลแบบ SIMT (Single Instruction Multiple Thread) กลุ่มของ thread ใน warp รัน kernel ตัวเดียวกัน Thread Warp 3 Thread Warp 8 Common PC Thread Warp In SIMD, you need to specify the data array + an instruction (on which to operate the data on) + THE INSTRUCTION WIDTH. Eg: You might want to add 2 integer arrays of length 16, then a SIMD instruction would look like (the instruction has been cooked-up by me for demo) add.16 arr1 arr2 However, SIMT doesn't bother about the instruction width. So, essentially, you could write the above example as: arr1[i] + arr2[i] and then launch as many threads as the length of the array, as you want. Note that, if the array size was, let us say, 32, then SIMD EXPECTS you to explicitly call two such 'add.16' instructions! Whereas, this is not the case with SIMT. Scalar Scalar Scalar Scalar Thread Warp 7 Thread Thread Thread Thread W X Y Z SIMD Pipeline

มอง loop iteration เป็น thread for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vector Instruction load add store Iter. 1 Iter. 2 Vectorized Code Time load add store Iter. 1 Iter. 2 Scalar Sequential Code Slide credit: Krste Asanovic

การเข้าถึงข้อมูลในหน่วยความจำของ SIMT ชุดคำสั่งเดียวกันแต่ใช้ thread id เป็นตัว index เข้าหาชุดข้อมูลที่แตกต่างกัน Let’s assume N=16, blockDim=4  4 blocks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 + 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 + + + + Slide credit: Hyesoon Kim

ตัวอย่างโค๊ด GPU แบบง่าย CPU code for (ii = 0; ii < 100; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are 100 threads __global__ void KernelFunction(…) { int tid = blockDim.x * blockIdx.x + threadIdx.x; int varA = aa[tid]; int varB = bb[tid]; C[tid] = varA + varB; } Slide credit: Hyesoon Kim

ตัวอย่างโปรแกรม GPU ที่ใช้งานจริง Slide credit: Hyesoon Kim

ซ่อน latency โดยใช้ thread warps ทำสไตล์เดียวกับ fine-grained multithreading หนึ่งคำสั่งจากหนึ่ง thread อยู่ใน pipeline ณ เวลาเดียวกัน (ไม่มีการทำ branch prediction) ซ่อน latency ด้วยกันนำ warp อื่นๆ มา schedule แบบคละกันไป (interleave warp execution) มี register file รองรับจำนวน thread ใน warp ได้ ไม่มีการทำ context switching โดย OS ถ้า warp ไหน miss ใน D-Cache จะถูกนำออกไปแขวน Decode R F A L U D-Cache Thread Warp 6 Thread Warp 1 Thread Warp 2 Data All Hit? Miss? Thread Warp 3 Thread Warp 8 Writeback Warps สำหรับทำการ schedule Thread Warp 7 I-Fetch SIMD Pipeline With a large number of shader threads multiplexed on the same execution re- sources, our architecture employs fine-grained multithreading where individual threads are interleaved by the fetch unit to proactively hide the potential latency of stalls before they occur. As illustrated by Figure, warps are issued fairly in a round-robin queue. When a thread is blocked by a memory request, shader core simply removes that thread’s warp from the pool of “ready” warps and thereby allows other threads to proceed while the memory system processes its request. With a large number of threads (1024 per shader core) interleaved on the same pipeline, FGMT effectively hides the latency of most memory operations since the pipeline is occupied with instructions from other threads while memory operations complete. also hides the pipeline latency so that data bypassing logic can potentially be omitted to save area with minimal impact on performance. simplify the dependency check logic design by restricting each thread to have at most one instruction running in the pipeline at any time. Slide credit: Tor Aamodt

เทียบ SIMD แบบ warp กับแบบธรรมดา SIMD ธรรมดามีเพียงหนึ่ง thread ทำงานเป็นแบบ lock step โปรแกรมค่อนข้างยาก ต้องรุ้จักการใช้ control register ต่างๆ (เช่น VLEN) ต้องรู้รายละเอียดของไปป์ไลน์ SIMD แบบ warp มีหลาย thread แต่ว่าแต่ละ thread ใช้ชุดคำสั่งเดียวกัน ไม่ต้องทำงานแบบ lock step โปรแกรมค่อนข้างง่าย เหมือนโปรแกรมแบบ single-thread แต่ compiler และ GPU hardware จัดการให้เป็น multiple-thread

การจัดการกับ branch ใน SIMD แบบใช้ warp มี 4 thread ใน warp ใช้ชุดคำสั่งเดียวกัน code block A ถึง G ตาม control flow graph ด้านล่าง แต่ทางเดินของการประมวลผลของแต่ละ thread แตกต่างกัน B C D E F A G Thread Warp Common PC Thread 1 Thread 2 Thread 3 Thread 4 Slide credit: Tor Aamodt

การจัดการ branch ใน GPU ปรากฏการณ์ branch divergence เกิดขึ้นเมื่อแต่ละ thread ใน warp เดียวกัน มีเส้นทางการประมวลผลต่างกันไป (แม้ว่าจะมีชุดคำสั่งเดียวกัน) Branch Path A Path B Branch Path A Path B Slide credit: Tor Aamodt

การจัดการกับ branch divergence (I) Reconv. PC Next PC Active Mask Stack B C D E F A G A/1111 E D 0110 C 1001 TOS - 1111 - E 1111 TOS E D 0110 1001 TOS - 1111 - A 1111 TOS - B 1111 TOS E D 0110 TOS - 1111 - G 1111 TOS B/1111 C/1001 D/0110 Thread Warp Common PC Thread 2 3 4 1 E/1111 G/1111 A B C D E G A Time Slide credit: Tor Aamodt

การจัดการกับ branch divergence (II) if (some condition) { B; } else { C; } D; One per warp Control Flow Stack Next PC Recv PC Amask TOS A -- 1111 D -- 1111 B D 1110 A C D 0001 D Execution Sequence 1 A 1 C 1 B 1 D B C D Time Slide credit: Tor Aamodt

การควบรวม warp (I) แนวคิดหลัก: ควบรวม thread ที่มีการ diverge ออกไปเพราะ branch เข้าด้วยกัน ทำให้เกิด warp ใหม่ Warp เพิ่มประสิทธิภาพการใช้งาน SIMD ไปป์ไลน์ และทำให้เวลาในการประมวลผลลดลง (นั่นคือลด cycle ในการประมวลผล)

การควบรวม warp (II) Branch Path A Path B Branch Path A บทความที่เกี่ยวข้อง: Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007

ตัวอย่างการควบรวม warp x/1111 y/1111 Execution of Warp x at Basic Block A Execution of Warp y Legend A B x/1110 y/0011 C x/1000 D x/0110 F x/0001 y/0010 y/0001 y/1100 A new warp created from scalar threads of both Warp x and y executing at Basic Block D D E x/1110 y/0011 G x/1111 y/1111 A A B B C C D D E E F F G G A A Baseline Time Dynamic Warp Formation A A B B C D E E F G G A A Time Slide credit: Tor Aamodt

NVIDIA GeForce GTX 285 ข้อมูลโฆษณาจาก NVIDIA: 240 stream processors “SIMT execution” ข้อมูลด้านเทคนิคที่แท้จริง: 30 cores 8 SIMD functional ในแต่ละ core Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 “core” … 64 KB of storage for fragment contexts (registers) 30 * 32 * 32 = 30 * 1024 = 30K fragments 64KB register file = 16 32-bit registers per thread = 64B (1/32 that of LRB) 16KB of shared scratch 80KB / core available to software = SIMD functional unit, control shared across 8 units = instruction stream decode = multiply-add = execution context storage = multiply Slide credit: Kayvon Fatahalian

NVIDIA GeForce GTX 285 “core” … 64 KB of storage for thread contexts (registers) To get maximal latency hiding: Run 1/32 of the time 16 words per thread = 64B ในหนึ่ง warp มี 32 threads สามารถ schedule ได้ 32 warp ในขณะเดียวกัน ดังนั้นจะได้ถึง 1024 thread contexts ณ ขณะใดขณะหนึ่ง Slide credit: Kayvon Fatahalian

ทั้งหมด 30 cores รับได้ทั้งหมด 30,720 threads NVIDIA GeForce GTX 285 Tex Tex … … … … Tex Tex … … Tex Tex … … Tex Tex If you’re running a CUDA program, and your not launching 30K threads, you are certainly not getting full latency hiding, and you might not be using the GPU well … … Tex Tex … … ทั้งหมด 30 cores รับได้ทั้งหมด 30,720 threads Slide credit: Kayvon Fatahalian