Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)

Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)
Not to be confused with Millions of instructions per second

Computer Architecture

เปิดเผยแค่ instruction set ให้บริษัทอื่นทำแข่งได้ ไม่เปิดเผย architecture ข้างใน

Instruction set ของ MIPS ดูผ่านๆ ก็พอ

Ford Assembly Line

อาจจะมี hardware มากกว่า 1 ชิ้นก็ได้ ก็จะรับผ้าได้ทีละหลายๆ ถัง
ทำ pipeline ได้

A B C D E1 A B C D E2 A B C D E3

(A1 + B1) * (C1+D1) / E1 (A1 + B1) * (C1+D1) E1 (LATCH) (A2 + B2) * (C2+D2) E2 (A2 + B2) (C2 + D2) (LATCH) A B C D E3

เริ่มอ่าน register ในเวลาครึ่งหลัง
Data ที่จะเขียนมาในครึ่งแรก (แต่เขียนเมื่อจบ cycle)

IM / IF = instruction memory / fetch Reg = register (read)
ALU = arithmetic logic unit DM = data memory Reg = register (writer)

LATCH Data Dependency OK

อันนี้ก็ต้อง forward ด้วย !!!
Forwarding

Forwarding

ต้อง forward จาก Reg ไม่ใช่ LATCH
ทำไม่ได้

ไม่มีรายละเอียดเหมือนในรูปก่อนหน้า ว่า forward จากไหน

Stall เพราะไม่รู้ว่าจะ taken หรือไม่

Execution Out of Order Instructions ใน pipeline ไม่วิ่งแซงกัน วิ่งแซงกันไม่ได้ ทำหลายๆ instruction ไปพร้อมๆ กันได้ใน processor (core) เดียว เรียกว่า “Superscalar” ต้องมีฮาร์ดแวร์สำหรับทำ IF, ID, EX, MEM, WB หลายชุด Scoreboard Tomasulo’s algorithm ใช้ hardware หาคำสั่งที่ไม่ dependent กัน คำสั่งไหน execute ได้ก็ทำไปเลย วิ่งแซงได้ แบบนี้ออกแบบ hardware ยากมาก ในความเป็นจริง instruction จะรอ cache และ I/O นานมาก instruction ที่มาทีหลังอาจจะแซงไปได้

Very Long Instruction Word (VLIW)
ใช้ compiler หาคำสั่งที่ไม่ dependent กัน ให้ execute ไปพร้อมกัน แบบนี้ compiler ต้องฉลาดมากๆ hardware ไม่ซับซ้อน ใช้ OS ที่คอมไพล์มาสำหรับ ITANIUM

Clock rate สูง ไม่ใช่ perf ดี
Intel Itanium i7 X86 (32 บิต), X64 (64 บิต) Itanium Benchmark สำหรับการคำนวณ integer และ floating point number น่าจะใช้แค่ 1 core แอพ + เกม คณิต + วิทย์ Clock rate สูง ไม่ใช่ perf ดี 8.73 10.42 ต้อง normalize ด้วย clock rate 10.76 11.64 Itanium เกิดคอขวด (bottleneck) ที่ memory ต้องใช้ cache ใหญ่มาก หา instructions มา execute พร้อมกันทีละมากๆ ไม่ได้ compiler ทำดีที่สุดแล้ว สมรรถนะสู้ i7 ไม่ได้ ดังนั้นไม่พัฒนา Itanium ต่อ ไปทำ multicore ดีกว่า

สาเหตุที่ VLIW ใช้ cache ใหญ่มาก (อ คิดเองนะ ไม่ยืนยันว่าถูก)
VLIW ไม่สามารถ tradeoff ขนาด cache ได้ เนื่องจาก compiler กำหนดมาแล้ว ว่าต้อง execute คำสั่งอะไรบ้าง ถ้า cache ไม่พอ performance จะตกลงอย่างรวดเร็ว จึงต้องให้ cache มาก Superscalar ยืดหยุ่นกว่าเพราะ execute เมื่อมี hardware resource ว่าง (ไม่ได้กำหนดไว้ล่วงหน้าว่าจะต้อง execute คำสั่งไหนพร้อมกัน) เมื่อใช้ cache มากขึ้น ก็จะ stall และ execute ให้ช้าลงอัตโนมัติ (ไม่ fetch คำสั่งใหม่เข้าไป) หรือพูดอีกอย่างคือ compiler มันไม่ได้คิดถึงเรื่อง cache effect VLIW เลยได้ performance เท่ากับ Superscalar (ต้อง normalize clock rate) แต่ VLIW ใช้ cache มากกว่าถึง 4 เท่า !!!

16 cores ที่จริงแสดง #multi-threads หรือ logical processor

สำหรับเครื่องพิมพ์แบบ dot matrix
Grey Scale Image B&W Image 0 Black 255 White 1 – 254 Grey 0 Black 1 White สำหรับเครื่องพิมพ์แบบ dot matrix

Thread n + 1 ต้องไม่วิ่งแซง Thread n (มีเครื่องมือให้ใช้ lock)
อธิบายย่อๆ 1 thread คือ มีฟังก์ชัน main 1 อัน 2 thread คือ มีฟังก์ชัน main 2 อัน … แต่ละ thread execute พร้อมกัน

เวลารวมทั้งหมดทุกๆ อย่างที่ใช้รันโปรแกรม (บางทีจะวัดเฉพาะ CPU time)
จะได้ประโยชน์จาก multi-core โปรแกรมเกอร์ต้องออกแรงเขียนโปรแกรม CPU ไม่ทำให้อัตโนมัติ Applications ที่ใช้ต้องเขียนมาสำหรับ multi-core ถึงจะได้สมรรถนะที่ดีขึ้น ปัจจุบัน OS และ Apps เขียนมาสำหรับ multi-core แล้ว ใช้ notebook ที่มี 2 cores

Billion Transistors on a Chip
2010

The Knights Ferry die, Aubrey Isle
The Knights Ferry die, Aubrey Isle. Die size on KNF, at 45nm, was rumored to be roughly 700mm sq. 32 cores.

Multi-threading (in one page)
1 physical processor (core) Registers 1 ชุด PC 1 ตัว PC 1 ตัว 1 logical processor (core) 1 logical processor (core) รอ I/O รอ I/O HW ที่ช่วย switch ระหว่างโปรแกรม หรือ thread ใน 1 clock cycle รอ I/O 2 cores (no multi-thread) ได้สมรรถนะดีกว่า เร็วขึ้น 2 เท่า เทียบกับ 1 core 1 core (multi-thread, 2 threads) เร็วขึ้น 30% เทียบกับ 1 threads รอ I/O Support แค่ 2 threads ถึงมี thread อื่นพร้อมก็ทำไม่ได้

มี 2 threads คือ Main() และ RunMe() ถ้ามี 1 core ก็ทำทีละ thread ถ้ามี 2 core ก็ทำพร้อมกัน 2 threads ได้

Java ใช้ synchronized

Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)

งานนำเสนอที่คล้ายกัน

งานนำเสนอเรื่อง: "Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)"— ใบสำเนางานนำเสนอ:

งานนำเสนอที่คล้ายกัน

เรื่องโครงการ

การติดต่อกลับ

เข้าสู่ระบบ

ลงทะเบียนผ่านเครือข่ายสังคม:

Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)

งานนำเสนอที่คล้ายกัน

งานนำเสนอเรื่อง: "Chapter 12 Microprocessor without Interlocked Pipeline Stages (MIPS)"— ใบสำเนางานนำเสนอ:

งานนำเสนอที่คล้ายกัน

เรื่องโครงการ

การติดต่อกลับ