程序员深度学习·神经网络·计算机视觉

NVIDIA CUDA Learning Note 1

2018-02-08  本文已影响0人  戬杨Jason

1)CPU architecture

What is CPU?

What is instruction

For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end

Optimize Objective:

cycles for instruction * seconds/cycle

CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

Desktop Programs

Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.

Moore's Law

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?


image.png

8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

Pipelining

Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline


image.png

Pros

Cons

Bypassing

image.png

If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

Stalls

image.png

If load is not finished, the pipeline must stop to wait.

Branch

image.png

Branch Prediction

Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history

Modern predictors > 90%

Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase

Predication

Replace branches with conditional instructions
Avoids branch predictor

GPU also use prediction

Increase IPC

Superscalar

Peak IPC is N (for N-way superscalar)


image.png

Scheduling

xor r1,r2 -> r3
add r3,r4 -> r4

sub r5,r2 ->r3
addi r3,1->r1

xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW

Register Renaming

xor r1,r2 -> r6
add r6,r4 -> r7

sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute

Out-of-Order(OoO) Execution

Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read - > Execute -> Memory -> Writeback ->
Commit

Reorder Buffer
Issue Queue/Scheduler

Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost

Modern Desktop/ Mobile In-order CPUs
Modern Desktop/Mobile OoO CPUs

Memory Hierarchy

image.png

Caching

Put the data in a position as close as possible。

Cpu parallel

Vectors Motivation

for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]

Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]

X86 Vector Motivation

Thread-Level Parallelism

Programmers can destroy and create.
Programmers or OS can dispatch.

Multicore

Locks, Coherence and Consistency

Power Wall

The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

CPU provides optimization for series program

上一篇 下一篇

猜你喜欢

热点阅读