Game Engine Architecture 6

2023年5月31日上午5:26 • 技术杂谈 • 阅读 75

【 Game Engine Architecture 6】

1、Data-Parallel Computations

A GPU is a specialized coprocessor designed specifically to accelerate those computations that involve a high degree of data parallelism. It does so by combining SIMD parallelism (vectorized ALUs) with MIMD parallelism (by employing a form of preemptive multithreading). NVIDIA coined the term single instruction multiple thread (SIMT) to describe this SIMD/MIMD hybrid design, all GPUs employ the general principles of SIMT parallelism in their designs.

In order for a computational task to be well-suited to execution _on a GPU,_the computations performed on any one element of the dataset must be independentof the results of computations on other elements.

void DotArrays_ref(unsigned count,
                    float r[],
                    const float a[],
                    const float b[])
{
    for (unsigned i = 0; i < count; ++i)
    {
        // treat each block of four floats as a
        // single four-element vector
        const unsigned j = i * 4;
        r[i] = a[j+0]*b[j+0] // ax*bx
                + a[j+1]*b[j+1] // ay*by
                + a[j+2]*b[j+2] // az*bz
                + a[j+3]*b[j+3]; // aw*bw
    }
}

View Code

Upper code, the computation performed by each iteration of this loop is independent of the computations performed by the other iterations.

2、Compute Kernels

GPGPU compute kernels are typically written in a special shading language which can be compiled into machine codethat’s understandable by the GPU. Some examples of shading languages include DirectX’s HLSL (high-level shader language), OpenGL’s GLSL, NVIDIA’s Cg and CUDA C languages, and OpenCL.

Some shading languages require you to move your kernel code into special source files, separate from your C++ application code. OpenCL and CUDA C, however, are extensions to the C++ language itself.

OpenCL、CUDA C 扩展了 C++，使得可以在C++文件中编写 kernel函数。

As a concrete example, here’s our DotKernel() function written in CUDA C:

__global__ void DotKernel_CUDA(int count,
                                float* r,
                                const float* a,
                                const float* b)
{
    // CUDA provides a magic "thread index" to each
    // invocation of the kernel -- this serves as
    // our loop index i
    size_t i = threadIdx.x;

    // make sure the index is valid
    if (i < count)
    {
        // treat each block of four floats as a
        // single four-element vector
        const unsigned j = i * 4;
        r[i] = a[j+0]*b[j+0] // ax*bx
        + a[j+1]*b[j+1] // ay*by
        + a[j+2]*b[j+2] // az*bz
        + a[j+3]*b[j+3]; // aw*bw
    }
}

View Code

You’ll notice that the loop index i is taken from a variable called threadIdx within the kernel function itself. The thread index is a “magic” input provided by the compiler, in much the same way that the this pointer “magically” points to the current instance within a C++ class member function.

3、Executing a Kernel

here’s how we’d kick off our compute kernel in CUDA C:

void DotArrays_gpgpu2(unsigned count, float r[], const float a[], const float b[]) { // allocate "managed" buffers that are visible // to both CPU and GPU int cr, ca, *cb; cudaMallocManaged(&cr, count * sizeof(float)); cudaMallocManaged(&ca, count * sizeof(float) * 4); cudaMallocManaged(&cb, count * sizeof(float) * 4);

</span><span>//</span><span> transfer the data into GPU-visible memory</span>
memcpy(ca, a, count * <span>sizeof</span>(<span>float</span>) * <span>4</span><span>);
memcpy(cb, b, count </span>* <span>sizeof</span>(<span>float</span>) * <span>4</span><span>);

</span><span>//</span><span> run the kernel on the GPU</span>
DotKernel_CUDA &lt;&lt;&lt;<span>1</span>, count&gt;&gt;&gt;<span> (cr, ca, cb, count);

</span><span>//</span><span> wait for GPU to finish</span>

cudaDeviceSynchronize();

</span><span>//</span><span> return results and clean up</span>
memcpy(r, cr, count * <span>sizeof</span>(<span>float</span><span>));

cudaFree(cr);
cudaFree(ca);
cudaFree(cb);

}

View Code

cudaDeviceSynchronize()：// wait for gpu to finish

cudaMallocManaged()：//allocate “managed” buffers that are visible to both CPU and GPU

cudaFree()：// free memory from cudaMallocManaged

The first argument G in the angled brackets tells the driver how many thread groups(known as thread blocks in NVIDIA terminology) to use when running this compute kernel. A compute job with a single thread group is constrained to run on a single compute uniton the GPU. A compute unit is esssentially a core within the GPU—a hardware component that is capable of executing an instruction stream. Passing a larger number for G allows the driver to divide the workload up across more than one compute unit.

4、GPU Threads and Thread Groups

_a thread group is comprised of some number of GPU threads._a GPU is comprised of multiple compute units, each of which contains some number of SIMD units.

GPU 由 compute unit 组成，每个 compute unit 包含一些 SIMD units。compute unit 可以想象成 cpu core，拥有一些vpu(SIMD)。

For example， SIMDs may 16 lanes wide.

The CU is not capable of speculative or out-of-order execution—it simply reads a stream of instructions and executes them one by one, using the SIMD units to apply each instruction to 16 elements of input data in parallel.

To execute a compute kernel on a CU, the driver first subdivides the input databuffers into blocks consisting of 64 elements each. For each of these 64-element blocks of data, the compute kernel is invoked on one CU. Such an invocation is called a wavefront(also known as a warp in NVIDIA speak). When executing a wavefront, the CU fetches and decodes the instructions of the kernel one by one. Each instruction is applied to 16 data elements in lock step using a SIMD unit. Internally, the SIMD unit consists of a four-stage pipeline, so it takes four clock cycles to complete. So rather than allow the stages of this pipeline to sit idle for three clock cycles out of every four, the CU applies the same instruction three more times, to three more blocks of 16 data elements. This is why a wavefront consists of 64 data elements, even though the SIMDs in the CU are only 16 lanes wide.

the term “GPU thread” actually refers to a single SIMD lane. You can think of a GPU thread, therefore, as a single iteration of the original nonvectorized loop that we started with, before we converted its body into a compute kernel. Alternatively, you can think of a GPU thread as a single invocation of the kernel function, operating on a single input datum and producing a single output datum.

a GPU thread 可想象成迭代次数或 kernel 调用次数。

5、From SIMD to SIMT

SIMT a form of preemptive multithreading to time-slice between wavefronts. any one instruction in the program might end up having to access memory, and that introduces a large stall while the SIMD unit waits for the memory controller to respond.

To eliminate the cost of saving registers during context switches, each SIMD unit contains a very large register file. The number of physical registers in this register file is many times larger than the number of logical registers available to any one wavefront (typically on the order of ten timeslarger).

This means that the contents of the logical registers for up to ten wavefronts can be maintained at all times in these physical registers. And that in turn implies that context switches can be performed between wavefronts without saving or restoring any registers whatsoever.

每个 SIMD 的 register file 拥有十倍于 logic register 的 physical register，所以 SIMD 在切 context 时，根本不需要 saving or restoring registers。

6、Cylindrical coordinates

This system employs a vertical “height” axis h, a radial axis r emanating out from the vertical, and a yaw angle theta (q). In cylindrical coordinates, a point P is represented by the triple of numbers (Ph, Pr, Pq).

7、Left-Handed versus Right-Handed Coordinate Systems

It is easy to convert from left-handed to right-handed coordinates and vice versa. We simply flip the direction of any one axis, leaving the other two axes alone.

8、Dot Product and Projection

• the dot product (a.k.a. scalar product or inner product), and
• the cross product (a.k.a. vector product or outer product).

The dot product is commutative(i.e., the order of the two vectors can be reversed) and distributiveover addition:

For any two arbitrary vectors a and b, game programmers often use the following tests, as shown in Figure 5.11:

• Collinear. (a b) = |a| |b|= ab (i.e., the angle between them is exactly 0 degrees—this dot product equals +1 when a and b are unit vectors).

• Collinear but opposite. (a b) = -ab (i.e., the angle between them is 180 degrees—this dot product equals +1 when a and b are unit vectors).

• Perpendicular. (a b) = 0 (i.e., the angle between them is 90 degrees).

• Same direction. (a b) > 0 (i.e., the angle between them is less than 90 degrees).

• Opposite directions. (a b) < 0 (i.e., the angle between them is greater than 90 degrees).

点到平面的高度：

9、Cross Product

When converting from a right-handed system to a left-handed system or vice versa, the numerical representations of all the points and vectors stay the same, but one axis flips.

左右手坐标系下，cross product 的数值相同，只是表现上会轴相反。

Properties of the Cross Product：

The Cartesian basis vectors are related by cross products as follows:

10、 Pseudovectorsand Exterior Algebra ，伪矢量， 外代数

cross productdoesn’t actually produce _a vector—it produces a special kind of mathematical object known as a pseudovector._The surface normalof a triangle (which is calculated using a cross product) is also a pseudovector

exterior algebra or Grassman algebra, which describe how vectors and pseudovectors work and allow us to calculate areas of parallelograms (in 2D), volumes of parallelepipeds (in 3D), and so on in higher dimensions.

Grassman algebra is to introduce a special kind of vector product known as the wedge product, denoted A^ B. A pairwise wedge product yields a pseudovector and is equivalent to a cross product, which also represents the signed area of the parallelogram formed by the two vectors.

_Doing two wedge products in a row, A ^ B ^ C,_is equivalent to the scalar triple product A (B xC) and produces another _strange mathematical object known as a pseudoscalar._which can be interpreted as the signed volume of the parallelepiped

formed by the three vectors.

11、Linear Interpolation of Points and Vectors

12、Matrices

An affine matrix is a 4×4 transformation matrix that preserves parallelism of lines and relative distance ratios, but not necessarily absolute lengths and angles. An affine matrix is any combination of the following operations: rotation,

translation, scale and/or shear.

13、Matrix Multiplication

Matrix multiplication is not commutative. In other words, the order in which matrix multiplication is done matters:

14、Representing Points and Vectors as Matrices

15、Matrix Inversion

The inverse of a matrix A is another matrix (denoted A^-1) that undoesthe effects of matrix A.

all affine matrices(combinations of pure rotations, translations, scales and shears) do have inverses.

16、Transposition

the inverse of an orthonormal (pure rotation) matrix is exactly equal to its transpose.

17、Homogeneous Coordinates

When a point or vector is extendedfrom three dimensions to four in this manner, we say that it has been written in homogeneouscoordinates. A point in homogeneous coordinates always has w = 1.

18、Transforming Direction Vectors

homogeneous (four-dimensional) coordinates can be converted into non-homogeneous (three-dimensional) coordinates by dividing the x, y and z components by the w component:

19、 Atomic Transformation Matrices

20、Translation

To invert a pure translation matrix, simply negate the vector t (i.e., negate tx, ty and tz).

21、Rotation

上面三个 xyz 旋转，对角线都是 cos，cos，1，另一个对角线是 sin，-sin，旋转方向后面轴的 sin为正，旋转方向前面轴的sin为-。

22、Scale

Scale 的逆矩阵不能通过 transpose 来求，因为 scale后，每一个轴向量不再是 unit vector，不满足 orthonormal 性质，所以无法通过 transpose 来求逆矩阵。

• When a uniform scale matrix Su and a rotation matrix R are concatenated, the orderof multiplication is unimportant (i.e., SuR = RSu). This only works for uniform scale!

23、4×3 Matrices

As such, game programmers often omit the fourth column to save memory. You’ll encounter 4 x 3 affine matrices frequently in game.

24、 Coordinate Spaces

We said above that a point is a vector whose tail is fixed to the origin of some coordinate system. This is another way of saying that a point (position vector) is always expressed relative to a set of coordinate axes.

25、Model Space

• pitch is rotation about L or R,
• yaw is rotation about U, and
• roll is rotation about F.

26、Building a Change of Basis Matrix

• iC is the unit basis vector along the child space x-axis, expressed in _ parent-space coordinates;_

• jC is the unit basis vector along the child space y-axis, in parent space;

• kC is the unit basis vector along the child space z-axis, in parent space; and

• tC is the translation of the child coordinate system relative to parent space.

27、Transforming Normal Vectors

Special care must be taken when transforming a normal vector to ensure that both its lengthand perpendicularityproperties are maintained.

In general, if a point or (non-normal) vector can be rotated from space A to space B via the 3×3 matrixMA->B, then a normal vector n will be transformed from space A to space B via the inverse transpose of that matrix.

if the matrix MA->B contains only _ uniform scale and no shear, then the angles between all surfaces and vectors in_ space B will be the same as they were in space A.

28、Quaternions

a matrix is not always an ideal representation of a rotation, for a number of reasons:

1） We need nine floating-point values to represent a rotation, which seems excessiveconsidering that we only have three degrees of freedom— pitch, yaw and roll.

费空间。

2）Rotating a vector requires a vector-matrix multiplication, which involves three dot products, or a total of nine multiplications and six additions. We would like to find a rotational representation that is less expensive to calculate, if possible

费CPU，计算量大。

3） to find rotations that are some percentage of the way between two known rotations is difficult.

不便插值。

29、Unit Quaternions as 3D Rotations

So the unit quaternion q can be written as：

a is a unit vector along the axis of rotation, and q is the angle of rotation.

30、Quaternion Multiplication

Given two quaternions p and q representing two rotations P and Q, respectively, the product pq represents the composite rotation (i.e., rotation Q followed by rotation P).

四元数乘积 PQ 的含意为，先应用P，再应用Q。

31、Conjugate and Inverse.

Conjugate:

negate the vector part but leave the scalar part unchanged.

Inverse quaternion：

32、 Conjugate and Inverse of a Product

33、Rotating Vectors with Quaternions

Therefore, the rotated vector v′ can be found as follows:

when quaternion is unit, we can use conjugate：

34、Quaternion Concatenation

35、Quaternion-Matrix Equivalence

If we let q = [qV qS] = [qVx qVy qVz qS] = [ x y z w ] , then we can find R as follows:

36、Rotational Linear Interpolation

The resultant interpolated quaternion had to be renormalized.

37、Spherical Linear Interpolation

LERP 的 beta 平滑插值，但并不代表 quaternion 平滑插值。实际上，beta 的平滑插值，会导致 quaternion 两头变化慢，中间变化快的问题。
SLERP 可以解决上面的问题。

38、To SLERP or Not to SLERP

Jonathan Blow wrote a great article positing that SLERP is too expensive, and LERP’s quality is not really that bad—therefore, he suggests, we should understand SLERP but avoid it in our game engines.

On the other hand, some of my colleagues at Naughty Dog have found that a good SLERP implementation performs nearly as well as LERP. (For example, on the PS3’s SPUs, Naughty Dog’s Ice team’s implementation of SLERP takes 20 cycles per joint, while its LERP implementation takes 16.25 cycles per joint.)

39、Euler Angles

Euler angles are prone to a condition known as gimbal lock.

Another problem with Euler angles is that the order in which the rotations are performed around each axis matters. The order could be PYR, YPR, RYP and so on, and each ordering may produce a different composite rotation. So the rotation angles [ qY qP qR ] do not uniquely define a particular rotation—you need to know the rotation order to interpret these numbers properly

40、3 x 3 Matrices

It does not suffer from gimbal lock.

However, rotation matrices are not particularly intuitive.

41、Axis + Angle

The benefits of the axis+angle representation are that it is reasonably intuitive and also compact.

42、Quaternions

it permits rotations to be concatenated and applied directly to points and vectors via quaternion multiplication.

43、

44、

Original: https://www.cnblogs.com/tekkaman/p/10559507.html
Author: Tekkaman
Title: Game Engine Architecture 6

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/549010/

转载文章受原作者版权保护。转载请注明原作者出处！

技术杂谈

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Deepin 15.4 挂载分区

硬盘是 500G，先前安装 Windows7 时，分了3个区，大小分别为 90G、100G、310G，现在用 Deepin 重装系统时，挂载情况如下： /dev/sda1 分区作为…

技术杂谈 2023年5月30日
0093
MySQL的FIND_IN_SET函数

语法：FIND_IN_SET(str,strlist) https://www.cnblogs.com/xiaoxi/p/5889486.html Original: https:…

技术杂谈 2023年7月25日
0079
CentOS 压缩解压

打包：将多个文件合成一个总的文件，这个总的文件通常称为 “归档”。压缩：将一个大文件通过某些压缩算法变成一个小文件。 1.1、tar 压缩格式： tar …

技术杂谈 2023年7月10日
0061
我的企业微信

服务项目技术咨询微信图书微信视频微信代码定制开发其他福利服务入口 QQ群有问必答查看详情;一本书解决90%问题查看详情微信开发视频小程序开发视频免费代码￥188…

技术杂谈 2023年5月31日
0071
TOGAF有哪些学习要点？

TOG已为学员定义了详细的学习要点，你可以下载大纲（英文）阅读，或者获得完全依照学习要点编写的认证授课讲义。 http://ea.zhoujingen.cn/5041.h tml …

技术杂谈 2023年5月31日
00109
js中常用的语法

一、注释二、输出输出有三种：三、变量概述：变量是在内存中生成一个空间用来存储数据。 1.声明变量 var age； 2.同时声明多个变量，使用逗号隔开。 Original:…

技术杂谈 2023年5月31日
00111
iOS 使用第三方字体

在iOS的项目开发中经常遇到需要使用一些自定义的字体文件，比如 仿宋_GB2312、 华康&…

技术杂谈 2023年5月30日
0063
【转】SVG动态图标是如何实现的

这篇文章给大家分享的内容是SVG动态图标是如何实现的，有需要的朋友可以参考一下。在 loading.io 上能看到好多效果惊艳的loading图标。它们都是用svg写成的，寥寥几…

技术杂谈 2023年6月1日
0077
WINDOWS下JMETER压测端口占用问题

2 压测环境配置压测发起设备：windows 10网络：局域网压测软件：Jmeter 5.0压测系统内存：Ubuntu 4核8GPHP 版本：php7.1.23Nginx 版本：…

技术杂谈 2023年5月31日
0087
Sentinel 系统规则

https://github.com/alibaba/Sentinel/wiki/%E7%B3%BB%E7%BB%9F%E8%87%AA%E9%80%82%E5%BA%94%E9%…

技术杂谈 2023年5月31日
0088
8086汇编 jmp 指令

8086 汇编 jmp 指令功能：修改CS、IP的指令、可以通过改变CS、IP中的内容，来控制CPU要执行的目标指令。背景：一般情况下指令是顺序地逐条执行的，而在实际中，常需要…

技术杂谈 2023年6月1日
0086
解决 IdentityServer 授权与登录分离的问题

园子 open api (api.cnblogs.com) 的授权服务器(authorization server，oauth.cnblogs.com)基于 IdentyServe…

技术杂谈 2023年5月31日
00104
find 命令常用解释

背景色是：orange #### find命令 find * path: 所有搜索的目录以及其所有子目录。默认为当前目录 * expression: 所有搜索的文件的特征 * cm…

技术杂谈 2023年7月10日
0076
59_Scala 中的 Queue

1. mutable.Queue API object ScalaQueue extends App { //创建 mutable 队列 var queue: mutable.Qu…

技术杂谈 2023年7月10日
0078
Sololearn All In One

Sololearn All In One web skills playground https://www.sololearn.com/profile/3476348 Learn…

技术杂谈 2023年6月1日
00108
Java数据类型

Java数据类型 Java是一种强类型语言，这意味着必须为每一个变量声明一种类型。 Java的数据类型可分为两大类：基本数据类型（primitive data type）和引用数据…

技术杂谈 2023年7月11日
0068

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Game Engine Architecture 6

大家都在看