Computer Graphics

经典GPU:GeForce 6800架构

2020-07-01  本文已影响0人  Cesium4Unreal

Overview

以英伟达的geforce6800为例,我们将进一步了解现代gpu的架构。自1993年成立以来,英伟达公司已经成为最大的gpu制造商之一(除了ATI),已经发布了重要的芯片,如Geforce 256和Geforce 3。Geforce 6800发布于2004年,属于Geforce 6系列,是英伟达的第六代graphicschipset和第四代以可编程为特色的(稍后将详细介绍)。下图是Geforce 6800及其功能部件的示意图。图13:Geforce 6800的示意图您已经可以看到每个功能单元如何对应图形管道的各个阶段。我们从六个并行的顶点处理器开始,它们从主机(CPU)接收数据并执行诸如转换和光照等操作。


Geforce 6800 原理图

您已经可以看到每个功能单元是如何对应于图形管道的各个阶段的。我们从六个并行的顶点处理器开始,它们从主机(CPU)接收数据并执行转换和光照等操作。
接下来,输出进入三角设置阶段,该阶段负责原语组装、裁剪和剪辑,然后进入生成片段的光栅化程序。Geforce 6800有一个额外的z -cull单元,允许根据深度执行早期碎片可见性检查,进一步提高效率。然后我们继续到16个片段处理器,它在4个并行单元中操作,并计算每个片段的输出颜色。片段交叉条是一个链接元素,主要负责将输出像素定向到任何可用的像素引擎(也称为ROP,简称forRasterOperator),从而避免管道堵塞。16像素引擎是处理的最后阶段,在将最终的像素发送到帧缓冲区之前,它会执行诸如alpha混合、深度测试等操作。

GPU如何融入整个计算机系统

在现代计算机系统中,CPU通过图形连接器(如a)与GPU通信主板上的PCI Express或AGP插槽。因为图形连接器负责传输从CPU到GPU的所有命令、纹理和顶点数据,总线技术也随之发展在过去的几年里。原来的AGP槽运行在66兆赫和32位宽,给予一个转移速率为264 MB/秒。AGP 2x, 4x,和8x随后,每个加倍可用带宽,直到最后的PCIExpress standard于2004年推出,最大理论带宽同时为4gb /s可用于和从GPU。(你的里程可能不同;目前可用的主板芯片组秋天略低于这个限制——大约3.2 GB/秒或更少。)重要的是要注意到GPU的内存接口带宽和带宽之间的巨大差异系统其他部分如表所示。
计算机系统不同部分的可用内存带宽

Component Bandwidth
GPU Memory Interface 35 GB/sec
PCI Express Bus (x16) 8 GB/sec
CPU Memory Interface (800 MHz Front-Side Bus) 6.4 GB/sec

GPU内部使用的带宽非常大,因此,在GPU上运行的算法可以利用这一点,实现了性能改进。

The Overall System Architecture of a PC

In Detail

A more detailed view of the Geforce 6800

虽然GPU的大多数部分是固定的功能单元,顶点和片段处理器的Geforce6800提供可编程性,这是第一次引入geforce芯片组线geforce 3(2001)。

Vertex Processor

A vertex processor
image.png

顶点处理器是负责所有顶点变换和at-tribute计算的可编程单元。它们操作与顶点的上述齐次坐标对应的四维数据向量,使用每个坐标32位(因此寄存器的128位)。指令有123位长,存储在指令RAM中。
顶点处理器的数据路径包括:

指令集(Instruction set)

主要指令集包括:


指令集

寄存器(Registers)

Fragment Processor

A fragment processor
image.png

Geforce 6800有16个片段处理器。它们被分成4个更大的单元,每个单元模拟地运行在4个片段上(一个所谓的四轴飞行器)。它们可以采用位置、颜色、深度、雾以及其他任意的4维属性作为输入。
数据路径包括:

超尺度(Superscalarity)

一个片段处理器与4个向量一起工作(面向向量的指令集),有时向量的组成部分需要分开处理(例如颜色,alpha)。因此,片段处理器支持数据的共同发行,这意味着分裂成两个部分的矢量,并在同一个时钟上执行不同的操作。它支持3-1和2-2的分裂(2-2的共同问题之前是不可能的)。此外,它还具有双重问题,这意味着在同一个时钟中对2个向量mathunits执行不同的操作。

纹理单元(Texture Unit)

纹理单元是一个浮点纹理处理器,用于获取和过滤纹理数据。它连接到一级纹理缓存(存储部分使用的纹理)。

Shader units 1 and 2

每个着色器单元在其能力上是有限的,当一起使用时提供完整的功能。

Block diagram of Shader Unit 1 and 2
Shader Unit 1

Green:A crossbar which distributes the input coming eiter from the rasterizer or from the loopback
Red:Interpolators
Yellow:A special function unit (for functions such as Reciprocal, Reciprocal Square Root, etc.)
Cyan:MUL channels
Orange:A unit for texture operations (not the fragment texture unit)

着色器单位可以执行2个操作每个时钟:一个MUL在一个三维矢量和一个特殊的功能,一个特殊的功能和一个纹理操作,或2MULs。特殊功能单元的输出可以进入MUL通道。纹理从MUL单元获得输入,并在将数据传递到实际的片段纹理单元之前进行LOD(细节级别)计算。然后片段纹理单元执行实际的采样,并为第二个着色单元写入数据到寄存器中。着色单元也可以简单地传递数据。

Shader Unit 2

Red:A crossbar
Cyan:4 MUL channels
Gray:4 ADD channels
Yellow:1 special function unit

横杆将输入分割为5个通道(4个组件,1个通道保持空闲)。添加单元被另外连接,允许在一个时钟中进行高级操作,如dotproduct。同样,着色单元可以处理2个独立的操作每个周期或它可以简单地传递数据。如果不使用特殊的功能,MAD单元可以执行这个列表中的2项操作:MUL、ADD、MAD、DP或基于这些操作的任何其他指令。

Instruction set
一些值得注意的关于顶点处理器的说明包括:

image.png

可修改片段处理器指令中的寄存器

image.png

Pixel Engine

A pixel engine

管道中的最后一个是16像素的引擎(光栅操作符)。每个像素引擎连接到GPU的一个特定内存分区。在无损的颜色和深度压缩之后,深度和颜色单元在写入最终像素之前生成深度、颜色和模板操作。当激活像素引擎也执行多重反锯齿。

Memory

"The memory system is partitioned into up to four independent memory partitions, eachwith its own dynamic random-access memories (DRAMs). GPUs use standard DRAM modulesrather than custom RAM technologies to take advantage of market economies and thereby reducecost. Having smaller, independent memory partitions allows the memory subsystem to operateefficiently regardless of whether large or small blocks of data are transferred. All rendered surfacesare stored in the DRAMs, while textures and input data can be stored in the DRAMs or insystem memory. The four independent memory partitions give the GPU a wide (256 bits),flexible memory subsystem, allowing for streaming of relatively small (32-byte) memory accessesat near the 35 GB/sec physical limit."

内存系统被划分成四个独立的内存分区,每个分区都有自己的动态随机访问内存(DRAMs)。gpu使用标准的DRAM模块而不是定制的RAM技术来利用市场经济,从而减少成本。拥有较小的、独立的内存分区使内存子系统能够高效地运行,而不管传输的数据块是大是小。所有渲染的表面都存储在DRAMs中,而纹理和输入数据可以存储在DRAMs或insystem内存中。四个独立的内存分区给GPU一个宽的(256位),灵活的内存子系统,允许流的相对较小的(32字节)内存访问接近35gb /秒的物理限制。

Performance

GPU Features

Fixed-Function Features

Geometry Instancing

使用Shader Model 3.0,一个Direct3D调用可以添加发送多个批次的几何图形,在这些情况下大大减少了驱动开销。支持实例化的硬件特性顶点流的频率-读取顶点属性的频率小于每次输出一次的能力或对顶点子集进行多次循环。实例化最有用的时候是同一个对象以不同的位置多次绘制,例如,绘制军队、战场时草。

Early Culling/Clipping

GeForce 6系列gpu能够在着色之前在一个高速率和剪辑部分可见原语在全速上剔除不可见几何。以前的NVIDIA产品会以原始设置的速度剔除不可见的原语,并以全速剪辑所有部分可见的几何。

Rasterization

与之前的NVIDIA产品一样,GeForce 6系列gpu能够渲染以下对象:

还支持多重反锯齿,允许精确的反锯齿多边形渲染。Multisample反锯齿支持所有光栅化。在以前的NVIDIA产品中支持多层采样,GeForce 6系列gpu通过4x multisample模式改进。

Z-Cull

从GeForce3开始,NVIDIA的gpu就有了一种名为z-cull的技术,可以快速移除隐藏的表面比传统的渲染快得多。GeForce 6系列z-cull单元是第三代技术,提高了更大范围案件的效率。此外,在没有模板的情况下更新后,早期模板拒绝可用于在模板测试失败。

Occlusion Query

Occlusion query is the ability to collect statistics on how many fragments passed or failed the depth test and
to report the result back to the host CPU. Occlusion query can be used either while rendering objects or with
color and z-write masks turned off, returning depth test status for the objects that would have been rendered,without modifying the contents of the frame buffer. This feature has been available since the GeForce3 was introduced.

Texturing

Like previous GPUs, GeForce 6 Series GPUs support bilinear(双线性), trilinear(三线性), and anisotropic filtering on 2D and cube-map textures of various formats. Three-dimensional textures support bilinear, trilinear, and quad-linear filtering, with and without mipmapping. Here are the new texturing features on GeForce 6 Series GPUs:

Shadow Buffer Support

NVIDIA GPUs support shadow buffering directly. The application first renders the scene from the light source into a separate z-buffer. Then during the lighting phase, it fetches the shadow buffer as a projective texture and performs z-compares of the shadow buffer data against a value corresponding to the distance from the light. If the distance passes the test, it's in light; if not, it's in shadow. NVIDIA GPUs have dedicated
transistors to perform four z-compares per pixel (on four neighboring z-values) per clock, and to perform
bilinear filtering of the pass/fail data. This more advanced variation of percentage-closer filtering saves many shader instructions compared to GPUs that don't have direct shadow buffer support.

High-Dynamic-Range Blending Using fp16 Surfaces, Texture Filtering, and Blending(HDR)

GeForce 6 Series GPUs allow for fp16x4 (four components, each represented by a 16-bit float) filtered
textures in the pixel shaders; they also allow performing all alpha-blending operations on fp16x4 filtered
surfaces. This permits intermediate rendered buffers at a much higher precision and range, enabling
high-dynamic-range rendering, motion blur, and many other effects. In addition, it is possible to specify a
separate blending function for color and alpha values. (The lowest-end member of the GeForce 6 Series
family, the GeForce 6200 TC, does not support floating-point blending or floating-point texture filtering
because of its lower memory bandwidth, as well as to save area on the chip.)

Vertex Processor

 Increased instruction count. The total instruction count is now 512 static instructions and 65,536
dynamic instructions. The static instruction count represents the number of instructions in a program
as it is compiled. The dynamic instruction count represents the number of instructions actually
executed. In practice, the dynamic count can be much higher than the static count due to looping
and subroutine calls.

Fragment Processor

Achieving Optimal Performance

References

[1] Wikipedia entry on GPUshttp://en.wikipedia.org/wiki/GPU
[2] Kees Huizing, Han-Wei Shen: “The Graphics Rendering Pipeline”http://www.win.tue.nl/~keesh/ow/2IV40/pipeline2.pdf
[3] Cyril Zeller: “Introduction to the Hardware Graphics Pipeline”, Presentation at ACM SIGGRAPH2005http://download.nvidia.com/developer/presentations/2005/I3D/I3D_05_IntroductionToGPU.pdf
[4] ExtremeTech 3D Pipeline Tutorialhttp://www.extremetech.com/article2/0,1697,9722,00.asp
[5] Ashu Rege: “Introduction to 3D Graphics for Games”http://developer.nvidia.com/docs/IO/11278/Intro-to-Graphics.pdf
[6] DirectX Developer Center: “The Direct3D Transformation Pipeline”http://msdn.microsoft.com/en-us/library/bb206260(VS.85).aspx
[7] Mark Colbert: “GPU Architecture & CG”http://graphics.cs.ucf.edu/gpuseminar/seminar1.ppt
[8] GPU Gems 2, Chapter 30: “The GeForce 6 Series GPU Architecture”http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch30.pdf
[9] IEEE Micro, Volume 25 , Issue 2 (March 2005): “The GeForce 6800”http://portal.acm.org/citation.cfm?id=1069760[10] www.3dcenter.de: “NV40-Technik im Detail”http://www.3dcenter.de/artikel/nv40_pipeline/23
[11] www.digit-life.com: “NVIDIA GeForce 6800 Ultra (NV40)”http://www.digit-life.com/articles2/gffx/nv40-part1-a.html
[12] Austin Robison, Abe Winter: “An Overview of Graphics Processing Hardware”http://people.cs.uchicago.edu/~robison/src/gpu_paper.pdf
[13] John Montrym, Henry Moreton: “NVIDIA GeForce 6800”, Hot Chips 16http://www.hotchips.org/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_bw.pdf
[14] Ajit Datar, Apurva Padhye: “Graphics Processing Unit Architecture”http://www.d.umn.edu/~data0003/Talks/gpuarch.pdf
[15] Sven Schenk: “Eine Einfuehrung in die Architektur moderner Graphikprozessoren”http://sus.ti.uni-mannheim.de/Lehre/Seminar0506/04modernGPUs.pdf
[16] Thomas Scott Crow: “Evolution of the Graphical Processing Unit”http://www.cse.unr.edu/~fredh/papers/thesis/023-crow/GPUFinal.pdf
[17] DirectX Developer Center: “Asm Shader Reference”http://msdn.microsoft.com/en-us/library/bb219840(VS.85).aspx
[18] Erik Lindholm, Stuart Oberman: “NVIDIA GeForce 8800 GPU”http://www.hotchips.org/archives/hc19/2_Mon/HC19.02/HC19.02.01.pdf
[19] www.digit-life.com: “Say Hello To DirectX 10, Or 128 ALUs In Action: NVIDIA GeForce 8800 GTX (G80)”http://www.digit-life.com/articles2/video/g80-part1.html
[20] Richard Hough, Richard Yu: “GPU Architecture”http://www.csl.cornell.edu/courses/ece685/slides/GPUArchitecture.ppt
[21] Technical Brief: “NVIDIA GeForce 8800 GPU Architecture Overview”http://www.nvidia.com/object/IO_37100.html
[22] GPU Gems 2, Chapter 46: “Improved GPU Sorting”
[23] Tim Purcell: “Sorting and Searching”, SIGGRAPH 2005 GPGPU COURSEhttp://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.ppt
[24] Peter Kipfer, Mark Segal, Ruediger Westermann: “UberFlow: A GPU-Based Particle Engine”http://www.graphicshardware.org/previous/www_2004/Presentations/PeterKipfer.pdf
[25] Wikipedia entry on Nvidiahttp://en.wikipedia.org/wiki/Nvidia_Corporation
[26] Wikipedia entry on ATIhttp://en.wikipedia.org/wiki/ATI_Technologies_Inc.
[27] Wikipedia entry on CUDAhttp://en.wikipedia.org/wiki/CUDA
[28] Wikipedia entry on CTMhttp://en.wikipedia.org/wiki/Close_to_Metal
[29] William Mark, Henry Moreton: “3D Graphics Architecture Tutorial”
http://www-csl.csres.utexas.edu/users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_BillMarkParts.pdf24

上一篇下一篇

猜你喜欢

热点阅读