Game Engine Architecture 5

Game Engine Architecture 5

1、Memory Ordering Semantics

These mysterious and vexing problems can only occur on a multicore machine with a multilevel cache.

Game Engine Architecture 5

A cache coherency protocolis a communication mechanism that permits cores to share data between their local L1 caches in this way. Most CPUs use either the MESI or MOESI protocol.

2、The MESI Protocol

• Modified. This cache line has been modified (written to) locally.

• Exclusive. The main RAM memory block corresponding to this cache line exists only in this core’s L1 cache—no other core has a copy of it.

• Shared. The main RAM memory block corresponding to this cache line exists in more than one core’s L1 cache, and all cores have an identical copy of it.

• Invalid. This cache line no longer contains valid data—the next read will need to obtain the line either from another core’s L1 cache, or from main RAM.

The MOESI protocol adds another state named Owned, which allows cores to share modified data without writing it back to main RAM first.

Under the MESI protocol, all cores’ L1 caches are connected via a special bus called the interconnect bus (ICB).

2.1、How MESI Can Go Wrong

As with compiler optimizations and CPU out-of-order execution optimizations, MESI optimizations are carefully crafted so as to be undetectable by a single thread.

和 compiler 优化、CPU OOO优化一样,MESI精心设计以使单线程程序无法察觉。

Under certain circumstances, optimizations in the MESI protocol can cause the new value of g_ready to become visible to other cores within the cache coherency domain before the updated value of g_data becomes visible.

MESI的优化,可能导致某值先写的值,更晚传递到其他Cache,从而导致其他Core的线程提前运行,而读取到错误数据的问题。

3、Memory Fences / memory barriers

To prevent the memory effects of a read or write instruction passing other reads and/or writes.

There are four ways in which one instruction can pass another:

  1. A read can pass another read,
  2. a read can pass a write,
  3. a write can pass another write, or
  4. a write can pass a read.

We could imagine a CPU that provides twelvedistinct fence instructions—a bidirectional, forward, and reversevariant of each of the four basic fence types listed above.

All fence instructions have two very useful side-effects:

1)They serve as compiler barriers, and

_2)_they prevent the CPU’s out-of-order logic from reordering instructions across the fence.

4、Acquire and Release Semantics

Memory ordering semantics are really properties of read or write instructions:

• Release semantics. This semantic guarantees that a write to shared memory can never be passed by any other read or write that precedesit in program order.

_write 前面的 rw 保证在前面, write 后面的 rw 不管。
_

• Acquire semantics. This semantic guarantees that a read from shared memory can never be passed by any other read or write that occurs after it in program order.

read 后面的 rw 保证在后面,read 前面的 rw不管。

• Full fence semantics.

5、When to Use Acquire and Release Semantics

A write-release is most often used in a producer scenario,performs two consecutive writes (e.g., writing to g_data and then g_ready), and we need to ensure that all other threads will see the two writes in the correct _order._We can enforce this ordering by making the second of these two writes a write-release.

A read-acquire is typically used in a consumer scenario—in which a thread performs two consecutive reads in which the second is conditional on the first (e.g., only reading g_data after a read of the flag g_ready comes back true). We enforce this ordering by making sure that the first read is a read-acquire.

Game Engine Architecture 5Game Engine Architecture 5
int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
    g_data = 42;

    // make the write to g_ready into a write-release
    // by placing a release fence *before* it
    RELEASE_FENCE();

    g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
    // make the read of g_ready into a read-acquire
    // by placing an acquire fence *after* it
    while (!g_ready)
        PAUSE();

    ACQUIRE_FENCE();

    // we can now read g_data safely...

    ASSERT(g_data == 42);
}

View Code

6、CPU Memory Models

the DEC Alpha has notoriously weak semantics by default, requiring careful fencing in almost all situations. At the other end of the spectrum, an Intel x86 CPU has quite strong memory ordering semantics by default.

7、Fence Instructions on Real CPUs

The Intel x86 ISA specifies three fence instructions: sfenceprovides release semantics, lfenceprovides acquire semantics, and mfenceacts as a full fence.

Certain x86 instructions may also be prefixed by a lock modifierto make them behave atomically, and to provide a memory fence prior to execution of the instruction.

The x86 ISA is strongly ordered by default, meaning that fences aren’t actually required in many cases where they would be on CPUs with weaker default ordering semantics.

8、Atomic Variables

C++11, the class template std::atomic

the std::atomic family of templates provides its variables with ” full fence” memory ordering semantics by default (although weaker semantics can be specified if desired). This frees us to write lock-free code withouthaving to worry about any of the three causes of data race bugs.

Game Engine Architecture 5Game Engine Architecture 5
std::atomic<float> g_data;
std::atomic_flag g_ready = false;

void ProducerThread(){ // produce some data g_data = 42;
</span><span>//</span><span> inform the consumer</span>
g_ready = <span>true</span><span>;

}

void ConsumerThread()
{
// wait for the data to be ready
while (!g_ready)
PAUSE();
</span><span>//</span><span> consume the data</span>
ASSERT(g_data == <span>42</span><span>);

}

View Code

atomic<> 变量,不仅仅保证了原子,也保证了 compiler、cpu、memory order 导致的bug。

the implementations of std::atomic

atomic<> 根据 target platform 的不同,而被编译成不同的 ML instruction。

atomic

your target hardware.

9、C++ Memory Order

By default, C++ atomic variables make use of full memory barriers, to ensure that they will work correctly in all possible use cases. However, it is possible to relax these guarantees, by passing a memory order semantic.

1) Relaxed. guarantees only atomicity. No barrier or fence is used.

_2)Consume._no other read or write within the same thread can be reordered before this read. In other words, this semantic only serves to prevent compiler optimizations and out-of-order execution from reordering the instructions—it does nothing to ensure any particular memory ordering semantics within the cache coherency domain.

_类似于 acquire,但不负责 memory-order 问题。(acquire负责)
_

_3)Release._A write performed with release semantics guarantees that no other read or write in this thread can be reordered after it, and the write is guaranteed to be visible by other threads reading the same address. Itemploys a release fence in the CPU’s cache coherency domain to accomplish this.

等同于 Release semantics。

4)Acquire.

完全等同于 acquire。

_5)Acquire/Release. full memory fence.

_

Using these memory order specifiers requires switching fromstd:: atomic’s overloaded assignmentand cast operators to explicit calls to store() and load().

Game Engine Architecture 5Game Engine Architecture 5
std::atomic<float> g_data;
std::atomic<bool> g_ready = false;

void ProducerThread(){ // produce some data g_data.store(42, std::memory_order_relaxed);
</span><span>//</span><span> inform the consumer</span>
g_ready.store(<span>true</span><span>, std::memory_order_release);

}

void ConsumerThread()
{
// wait for the data to be ready
while (!g_ready.load(std::memory_order_acquire))
PAUSE();
</span><span>//</span><span> consume the data</span>
ASSERT(g_data.load(std::memory_order_relaxed) == <span>42</span><span>);

}

View Code

10、Concurrency in Interpreted Programming Languages

C++ compile or assemble down to raw machine code that is executed directly by the CPU. As such, atomic operations and locks must be implemented with the aid of special machine language instructions that provide atomic operations and cache coherent memory barriers, with additional help from the kernel (which makes sure threads are put to sleep and awoken appropriately) and the compiler (which respects barrier instructionswhen optimizing the code).

In C/C++ a volatile variable is not atomic. However, in Java and C#, the volatile type qualifier does guarantee atomicity.

11、Basic Spin Lock

std::memory_order_acquire:保证后面的读不会发生在本次读的前面。

std::memory_order_release:保证所有前面的写都已完成。

Game Engine Architecture 5Game Engine Architecture 5
class SpinLock
{
    std::atomic_flag m_atomic;
    public:
    SpinLock() : m_atomic(false) { }

    bool TryAcquire()
    {
        // use an acquire fence to ensure all subsequent
        // reads by this thread will be valid
        bool alreadyLocked = m_atomic.test_and_set(std::memory_order_acquire);
        return !alreadyLocked;
    }

    void Acquire()
    {
        // spin until successful acquire
        while (!TryAcquire())
        {
            // reduce power consumption on Intel CPUs
            // (can substitute with std::this_thread::yield()
            // on platforms that don't support CPU pause, if
            // thread contention is expected to be high)
            PAUSE();
        }
    }

    void Release()
    {
        // use release semantics to ensure that all prior
        // writes have been fully committed before we unlock
        m_atomic.clear(std::memory_order_release);
    }
};

View Code

12、Scoped Locks

类似智能指针。自动给 lock 加锁、解锁。

13、Reentrant Locks

We can relax this reentrancy restriction if we can arrange for our spin lock class to cache the id of the thread that has locked it.

Game Engine Architecture 5Game Engine Architecture 5
class ReentrantLock32
{
    std::atomic m_atomic;
    std::int32_t m_refCount;

public:
    ReentrantLock32() : m_atomic(0), m_refCount(0) { }

    void Acquire()
    {
        std::hash hasher;
        std::size_t tid = hasher(std::this_thread::get_id());

        // if this thread doesn't already hold the lock...

        if (m_atomic.load(std::memory_order_relaxed) != tid)
        {
            // ... spin wait until we do hold it
            std::size_t unlockValue = 0;
            while (!m_atomic.compare_exchange_weak(
                                unlockValue,
                                tid,
                                std::memory_order_relaxed, // fence below!

                                std::memory_order_relaxed))
            {
                unlockValue = 0;
                PAUSE();
            }
        }

        // increment reference count so we can verify that
        // Acquire() and Release() are called in pairs
        ++m_refCount;

        // use an acquire fence to ensure all subsequent
        // reads by this thread will be valid
        std::atomic_thread_fence(std::memory_order_acquire);
    }

    void Release()
    {
        // use release semantics to ensure that all prior
        // writes have been fully committed before we unlock
        std::atomic_thread_fence(std::memory_order_release);

        std::hash hasher;
        std::size_t tid = hasher(std::this_thread::get_id());
        std::size_t actual = m_atomic.load(std::memory_order_relaxed);
        assert(actual == tid);

        --m_refCount;

        if (m_refCount == 0)
        {
            // release lock, which is safe because we own it
            m_atomic.store(0, std::memory_order_relaxed);
        }
    }

    bool TryAcquire()
    {
        std::hash hasher;
        std::size_t tid = hasher(std::this_thread::get_id());
        bool acquired = false;
        if (m_atomic.load(std::memory_order_relaxed) == tid)
        {
            acquired = true;
        }
        else
        {
            std::size_t unlockValue = 0;
            acquired = m_atomic.compare_exchange_strong(
            unlockValue,
            tid,
            std::memory_order_relaxed, // fence below!

            std::memory_order_relaxed);
        }
        if (acquired)
        {
            ++m_refCount;
            std::atomic_thread_fence(std::memory_order_acquire);
        }

        return acquired;
    }

};

View Code

14、Readers-Writer Locks

Whenever a writer thread attempts to acquire the lock, it should wait until all readers are finished.

we’ll store a reference count indicating how many readerscurrently hold the lock.

A readers-writer lock suffers from starvationproblems: A writer that holds the lock too long can cause all of the readers to starve, and likewise a lot of readers can cause the writers to starve.

15、Lock-Not-Needed Assertions

通过 Assert 来 Check 多个线程是否会互相 overlap。

Game Engine Architecture 5Game Engine Architecture 5
class UnnecessaryLock
{
    volatile bool m_locked;

public:
    void Acquire()
    {
        // assert no one already has the lock
        assert(!m_locked);

        // now lock (so we can detect overlapping
        // critical operations if they happen)
        m_locked = true;
    }

    void Release()
    {
        // assert correct usage (that Release()
        // is only called after Acquire())
        assert(m_locked);

        // unlock
        m_locked = false;
    }

};

#if ASSERTIONS_ENABLED
#define BEGIN_ASSERT_LOCK_NOT_NECESSARY(L) (L).Acquire()
#define END_ASSERT_LOCK_NOT_NECESSARY(L) (L).Release()
#else
#define BEGIN_ASSERT_LOCK_NOT_NECESSARY(L)
#define END_ASSERT_LOCK_NOT_NECESSARY(L)
#endif

// Example usage...

UnnecessaryLock g_lock;
void EveryCriticalOperation()
{
    BEGIN_ASSERT_LOCK_NOT_NECESSARY(g_lock);
    printf("perform critical op...\n");
    END_ASSERT_LOCK_NOT_NECESSARY(g_lock);
}

View Code

We could also wrap the locks in a janitor.

Game Engine Architecture 5Game Engine Architecture 5
class UnnecessaryLockJanitor
{
    UnnecessaryLock* m_pLock;
public:
    explicit
    UnnecessaryLockJanitor(UnnecessaryLock& lock)
    : m_pLock(&lock) { m_pLock->Acquire(); }

    ~UnnecessaryLockJanitor() { m_pLock->Release(); }
};

#if ASSERTIONS_ENABLED
    #define ASSERT_LOCK_NOT_NECESSARY(J,L) \
    UnnecessaryLockJanitor J(L)
#else
    #define ASSERT_LOCK_NOT_NECESSARY(J,L)
#endif

// Example usage...

UnnecessaryLock g_lock;
void EveryCriticalOperation()
{
    ASSERT_LOCK_NOT_NECESSARY(janitor, g_lock);
    printf("perform critical op...\n");
}

View Code

We implemented this at Naughty Dog and it successfully caught a number of cases of critical operations overlapping when the programmers had assumed they never could do so.

16、Lock-Free Transactions

To perform a critical operation in a lock-free manner, we need to think of each such operation as a transaction that can either succeed in its entirety, or fail in its entirety. If it fails, the transaction is simply retried until it does succeed.

To implement any transaction, no matter how complex, we perform the majority of the work locally(i.e., using data that’s visible only to the current thread, rather than operating directly on the shared data). When all of our ducks are in a row and the transaction is ready to commit, we execute a single atomic instruction, such as CAS or LL/SC. If this atomic instruction succeeds, we have successfully “published” our transaction globally—it becomes a permanent part of the shared data structure on which we’re operating. But if the atomic instruction fails, that means some other thread was attempting to commit a transaction at the same time we were.

This fail-and-retry approach works because whenever one thread fails to commit its transaction, we know that it was because some other thread managed to succeed. As a result, one thread in the system is always making forward progress (just maybe not us). And that is the definition of lock-free.

17、A Lock-Free Linked List

Game Engine Architecture 5
Game Engine Architecture 5Game Engine Architecture 5
template< class T >
class SList
{
    struct Node
    {
        T m_data;
        Node* m_pNext;
    };
std::atomic</span>&lt; Node* &gt;<span> m_head { nullptr };

public:
</span><span>void</span><span> push_front(T data)
{
    </span><span>//</span><span> prep the transaction locally</span>
    auto pNode = <span>new</span><span> Node();
    pNode</span>-&gt;m_data =<span> data;
    pNode</span>-&gt;m_pNext =<span> m_head;

    </span><span>//</span><span> commit the transaction atomically
    </span><span>//</span><span> (retrying until it succeeds)</span>
    <span>while</span> (!<span>m_head.compare_exchange_weak(
            pNode</span>-&gt;<span>m_pNext, pNode))
    { }
}

};

View Code

注意,compare_exchange_weak 并不等于同 CAS。CAS为false时,并不会修改 expectedValue,但是 compare_exchange_weak会修改 expectedValue。所以上述代码 pNode->m_pNext 会自动更新。

Game Engine Architecture 5

18、SIMD/Vector Processing

SIMD and multithreading are combined into a form of parallelsm known single instruction multiple thread (SIMT), which forms the basis of all modern GPUs.

Intel first introduced its MMX instruction set with their Pentium line _of CPUs in 1994._These instructions permitted SIMD calculations to be performed on eight 8-bit integers, four 16-bit integers, or two 32-bit integers packed into special 64-bitMMX registers.

MMX 为64位。

streaming SIMD extensions, or SSE, the first version of which appeared in the Pentium III processor. The SSE instruction set utilizes 128-bit registers that can contain integer or IEEE floating-point data.

SSE 为 128位。

The SSE mode most commonly used by game engines is called packed 32-bit floating-point mode. In this mode, four 32-bit float values are packed into a single 128-bit register. An operation such as addition or multiplication can thus be performed on four pairs of floats in parallel by taking two of these 128-bit registers as its inputs. Intel has since introduced various upgrades to the SSE instruction set, named SSE2, SSE3, SSSE3 and SSE4.

In 2011, Intel introduced a new, wider SIMD register file and accompanying instruction set named advanced vector extensions (AVX). AVX registers are 256 bits wide, permitting a single instruction to operate on pairs of up to eight 32-bit floating-point operands in parallel. The AVX2 instruction set is an extension to AVX. Some Intel CPUs now support AVX-512, an extension to AVX permitting 16 32-bit floats to be packed into a 512-bit register.

Avx 为256位, Avx-512为 512位。

Game Engine Architecture 5

19、The SSE Instruction Set and Its Registers

These instructions are denoted with a ps suffix, indicating that we’re dealing with packed data (p), The SSE registers are named XMMi, where i is an integer between 0 and 15. In packed 32-bit floating-point mode, each 128-bit XMMi register contains four 32-bit floats.

In AVX, the registers are 256 bits wide and are named YMMi; in AVX-512, they are 512 bits in width and are named ZMMi.

20、The __m128 Data Type

Declaring automatic variables and function arguments to be of type __m128 often results in the compiler treating those values as direct _ proxies for SSE registers._

自动变量、函数参数的 __m126 会编译成对 SSE register 的使用。

But using the __m128 type to declare global variables, structure/class members, and sometimes local variables results in the data being stored as a 16-byte aligned array of float in memory.

全局变量、成员变量会编译成128 bit 对齐的内存数据。

Using a memory-based __m128 variable in an SSE calculation will cause the compiler to implicitly emit instructions for loading the data from memory into an SSE register prior to performing the calculation on it, and likewise to emit instructions for storing the results of the calculation back into the memory that “backs” each such variable.

21、Alignment of SSE Data

Whenever data that is intended for use in an XMMi register is stored in memory, it must be 16-byte (128-bit) aligned. Likewise.

The compiler ensures that global and local variables of type __m128 are aligned automatically. It also pads struct and class members so that any ___m128 members are aligned properly relative to the start of the object, and_ ensures that the alignment of the entire struct or class is equal to the worstcase alignment of its members.

22、SSE Compiler Intrinsics

modern compilers provide intrinsics—special syntax that looks and behaves like a regular C function, but is actually boiled down to inline assembly code by the compiler.

In order to use SSE and AVX intrinsics, your .cpp file must #include _ ****, or when compiling with_ Clang or gcc.

23、Some Useful SSE Intrinsics

• __m128 _mm_set_ps(float w, float z, float y, float x);

• __m128 _mm_load_ps(const float* pData);

• void _mm_store_ps(float* pData, __m128 v);

• __m128 _mm_add_ps(__m128 a, __m128 b);

• __m128 _mm_mul_ps(__m128 a, __m128 b);

Game Engine Architecture 5Game Engine Architecture 5
#include

void TestAddSSE()
{
    alignas(16) float A[4];
    alignas(16) float B[4] = { 2.0f, 4.0f, 6.0f, 8.0f };

    // set a = (1, 2, 3, 4) from literal values, and
    // load b = (2, 4, 6, 8) from a floating-point array
    // just to illustrate the two ways of doing this
    // (remember that _mm_set_ps() is backwards!)
    __m128 a = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
    __m128 b = _mm_load_ps(&B[0]);

    // add the vectors
    __m128 r = _mm_add_ps(a, b);

    // store '__m128 a' into a float array for printing
    _mm_store_ps(&A[0], a);

    // store result into a float array for printing
    alignas(16) float R[4];
    _mm_store_ps(&R[0], r);

    // inspect results
    printf("a = %.1f %.1f %.1f %.1f\n", A[0], A[1], A[2], A[3]);
    printf("b = %.1f %.1f %.1f %.1f\n", B[0], B[1], B[2], B[3]);
    printf("r = %.1f %.1f %.1f %.1f\n", R[0], R[1], R[2], R[3]);
}

View Code

24、Using SSE to Vectorize a Loop

Game Engine Architecture 5Game Engine Architecture 5
void AddArrays_ref(int count,
                    float* results,
                    const float* dataA,
                    const float* dataB)
{
    for (int i = 0; i < count; ++i)
    {
        results[i] = dataA[i] + dataB[i];
    }
}

View Code

上面的代码,使用SSE优化后如下。

Game Engine Architecture 5Game Engine Architecture 5
void AddArrays_sse(int count,
                    float* results,
                    const float* dataA,
                    const float* dataB)
{
    // NOTE: the caller needs to ensure that the size of
    // all 3 arrays are equal, and a multiple of four!

    assert(count % 4 == 0);
    for (int i = 0; i < count; i += 4)
    {
        __m128 a = _mm_load_ps(&dataA[i]);
        __m128 b = _mm_load_ps(&dataB[i]);
        __m128 r = _mm_add_ps(a, b);
        _mm_store_ps(&results[i], r);
    }
}

View Code

This is called vectorizing _ our loop._

This particular example isn’t quite four times faster, due to the overhead of having to load the values in groups of four and store the results on each iteration; but when I measured these functions running on very large arrays of floats, the nonvectorized loop took roughly 3.8 times as long to do its work as the vectorized loop.

25、Vectorized Dot Products

Game Engine Architecture 5Game Engine Architecture 5
void DotArrays_ref(int count,
                    float r[],
                    const float a[],
                    const float b[])
{
    for (int i = 0; i < count; ++i)
    {
        // treat each block of four floats as a
        // single four-element vector
        const int j = i * 4;
        r[i] = a[j+0]*b[j+0] // ax*bx
             + a[j+1]*b[j+1] // ay*by
             + a[j+2]*b[j+2] // az*bz
             + a[j+3]*b[j+3]; // aw*bw
    }
}

View Code

_mm_hadd_ps() (horizontal add). This intrinsic operates on a single input register (x, y, z,w) and calculates two sums: s = x +y and t = z+w. It stores these two sums into the four slots of the destination register as (t, s, t, s). Performing this operation twice allows us to calculate the sum d = x +y+z+w. This is called adding across a register.

first attempt at using SSE intrinsics:

Game Engine Architecture 5Game Engine Architecture 5
void DotArrays_sse_horizontal(int count,
                                float r[],
                                const float a[],
                                const float b[])
{
    for (int i = 0; i < count; ++i)
    {
        // treat each block of four floats as a
        // single four-element vector
        const int j = i * 4;
        __m128 va = _mm_load_ps(&a[j]); // ax,ay,az,aw
        __m128 vb = _mm_load_ps(&b[j]); // bx,by,bz,bw
        __m128 v0 = _mm_mul_ps(va, vb);

        // add across the register...

        __m128 v1 = _mm_hadd_ps(v0, v0);

        // (v0w+v0z, v0y+v0x, v0w+v0z, v0y+v0x)
        __m128 vr = _mm_hadd_ps(v1, v1);

        // (v0w+v0z+v0y+v0x, v0w+v0z+v0y+v0x,
        // v0w+v0z+v0y+v0x, v0w+v0z+v0y+v0x)
        _mm_store_ss(&r[i], vr); // extract vr.x as a float
    }
}

View Code

上面的第一个优化代码. Adding across a register is not usually a good idea because it’s a very slow operation. Profiling the DotArrays_sse() implementation shows that it actually takes a little bit more time than the reference implementation. Using SSE here has actually slowed us down!12

A better approach. 优化掉耗时的 _mm_hadd_ps().注意输入是 :By storing them in transposed order。

Game Engine Architecture 5Game Engine Architecture 5
void DotArrays_sse(int count,
                    float r[],
                    const float a[],
                    const float b[])
{
    for (int i = 0; i < count; i += 4)
    {
        __m128 vaX = _mm_load_ps(&a[(i+0)*4]); // a[0,4,8,12]
        __m128 vaY = _mm_load_ps(&a[(i+1)*4]); // a[1,5,9,13]
        __m128 vaZ = _mm_load_ps(&a[(i+2)*4]); // a[2,6,10,14]
        __m128 vaW = _mm_load_ps(&a[(i+3)*4]); // a[3,7,11,15]
        __m128 vbX = _mm_load_ps(&b[(i+0)*4]); // b[0,4,8,12]
        __m128 vbY = _mm_load_ps(&b[(i+1)*4]); // b[1,5,9,13]
        __m128 vbZ = _mm_load_ps(&b[(i+2)*4]); // b[2,6,10,14]
        __m128 vbW = _mm_load_ps(&b[(i+3)*4]); // b[3,7,11,15]

        __m128 result;
        result = _mm_mul_ps(vaX, vbX);
        result = _mm_add_ps(result, _mm_mul_ps(vaY, vbY));
        result = _mm_add_ps(result, _mm_mul_ps(vaZ, vbZ));
        result = _mm_add_ps(result, _mm_mul_ps(vaW, vbW));

        _mm_store_ps(&r[i], result);
    }
}

View Code

26、The MADD Instruction

It’s interesting to note that a multiply followed by an add is such a common operation that it has its own name—madd.

前面代码中的 _mm_add_ps(result, _mm_mul_ps(vaY, vbY)); 可以用另一个函数来替代 madd。

27、Transpose as We Go

25中的最后一个示例,要求输入的 a、b 是 transpose的。但通常我们面对的是正常存储的数据,为了transpose,可以使用 TRANSPOSE4() 方法。

Game Engine Architecture 5Game Engine Architecture 5
void DotArrays_sse_transpose(int count,
                                float r[],
                                const float a[],
                                const float b[])
{
    for (int i = 0; i < count; i += 4)
    {
        __m128 vaX = _mm_load_ps(&a[(i+0)*4]); // a[0,1,2,3]
        __m128 vaY = _mm_load_ps(&a[(i+1)*4]); // a[4,5,6,7]
        __m128 vaZ = _mm_load_ps(&a[(i+2)*4]); // a[8,9,10,11]
        __m128 vaW = _mm_load_ps(&a[(i+3)*4]); // a[12,13,14,15]
        __m128 vbX = _mm_load_ps(&b[(i+0)*4]); // b[0,1,2,3]
        __m128 vbY = _mm_load_ps(&b[(i+1)*4]); // b[4,5,6,7]
        __m128 vbZ = _mm_load_ps(&b[(i+2)*4]); // b[8,9,10,11]
        __m128 vbW = _mm_load_ps(&b[(i+3)*4]); // b[12,13,14,15]

        _MM_TRANSPOSE4_PS(vaX, vaY, vaZ, vaW);
        // vaX = a[0,4,8,12]
        // vaY = a[1,5,9,13]
        // ...

        _MM_TRANSPOSE4_PS(vbX, vbY, vbZ, vbW);
        // vbX = b[0,4,8,12]
        // vbY = b[1,5,9,13]
        // ...

        __m128 result;
        result = _mm_mul_ps(vaX, vbX);
        result = _mm_add_ps(result, _mm_mul_ps(vaY, vbY));
        result = _mm_add_ps(result, _mm_mul_ps(vaZ, vbZ));
        result = _mm_add_ps(result, _mm_mul_ps(vaW, vbW));
        _mm_store_ps(&r[i], result);
    }
}

View Code

Profiling all three implementations of DotArrays() shows that our final version (the one that transposes the vectors as it goes) is roughly 3.5 times faster than the reference implementation.

28、Shuffle and Transpose

here’s what the _MM_TRANSPOSE() macro looks like:

Game Engine Architecture 5Game Engine Architecture 5
#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) \
{ __m128 tmp3, tmp2, tmp1, tmp0; \
    \
    tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \
    tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \
    tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \
    tmp3 = _mm_shuffle_ps((row2), (row3), 0xEE); \
    \
    (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88); \
    (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD); \
    (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88); \
    (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD); }

View Code

Mask 的来源为:

Game Engine Architecture 5

_mm_shuffle_ps() 的完整涵义为:

Game Engine Architecture 5

29、Vector-Matrix Multiplication with SSE

Vector-Matrix 本质上与 VectorDot 的计算过程是一样的。所以实现原理类似。

Game Engine Architecture 5Game Engine Architecture 5
union Mat44
{
    float c[4][4]; // components
    __m128 row[4]; // rows
};

__m128 MulVecMat_sse(const __m128& v, const Mat44& M)
{
    // first transpose v
    __m128 vX = _mm_shuffle_ps(v, v, 0x00); // (vx,vx,vx,vx)
    __m128 vY = _mm_shuffle_ps(v, v, 0x55); // (vy,vy,vy,vy)
    __m128 vZ = _mm_shuffle_ps(v, v, 0xAA); // (vz,vz,vz,vz)
    __m128 vW = _mm_shuffle_ps(v, v, 0xFF); // (vw,vw,vw,vw)

    __m128 r = _mm_mul_ps(vX, M.row[0]);

    r = _mm_add_ps(r, _mm_mul_ps(vY, M.row[1]));
    r = _mm_add_ps(r, _mm_mul_ps(vZ, M.row[2]));
    r = _mm_add_ps(r, _mm_mul_ps(vW, M.row[3]));\

    return r;
}

View Code

30、Matrix-Matrix Multiplication with SSE

矩阵-矩阵相乘,分别计算前一矩阵的每一行,让前一矩阵的每一行分别与后面的矩阵进行 Vector-Matrix Multiplication运算。

Game Engine Architecture 5Game Engine Architecture 5
void MulMatMat_sse(Mat44& R, const Mat44& A, const Mat44& B)
{
    R.row[0] = MulVecMat_sse(A.row[0], B);
    R.row[1] = MulVecMat_sse(A.row[1], B);
    R.row[2] = MulVecMat_sse(A.row[2], B);
    R.row[3] = MulVecMat_sse(A.row[3], B);
}

View Code

31、Generalized Vectorizaton

Interestingly, most optimizing compilers can vectorize some kinds of singlelane loops automatically. In fact, when writing the above examples, it took some doing to force the compiler not to vectorize my single-lane code, so that I could compare its performance to my SIMD implementation!

32、Vector Predication

下面是一个普通函数,用于计算平方根。

Game Engine Architecture 5Game Engine Architecture 5
#include 

void SqrtArray_ref(float restrict r,const float restrict a,

int count){ for (unsigned i = 0; i < count; ++i) { if (a[i] >= 0.0f) r[i] = std::sqrt(a[i]); else r[i] = 0.0f; }}

View Code

下面是第一个SSE优化,但是在除0的部分有问题。

Game Engine Architecture 5Game Engine Architecture 5
#include

void SqrtArray_sse_broken(float* __restrict__ r,
const float* __restrict__ a,

int count)
{
    assert(count % 4 == 0);
    __m128 vz = _mm_set1_ps(0.0f); // all zeros
    for (int i = 0; i < count; i += 4)
    {
        __m128 va = _mm_load_ps(a + i);
        __m128 vr;
        if (_mm_cmpge_ps(va, vz)) // ???

            vr = _mm_sqrt_ps(va);
        else
            vr = vz;
        _mm_store_ps(r + i, vr);
    }
}

View Code

_mm_cmpge_ps() produce a four component result, stored in an SSE register. the result consists of four 32-bit masks. Each mask contains all binary 1s (0xFFFFFFFFU) if the corresponding component in the input register passed the test, and all binary 0s (0x0U) if that component failed the test.

将 _mm_cmpge_ps 的返回值与 sqrt 的结果相 & 就得到正确的结果。注意,下面取结果的代码略显麻烦。

Game Engine Architecture 5Game Engine Architecture 5
#include

void SqrtArray_sse(float* __restrict__ r,
const float* __restrict__ a,

int count)
{
    assert(count % 4 == 0);
    __m128 vz = _mm_set1_ps(0.0f);
    for (int i = 0; i < count; i += 4)
    {
        __m128 va = _mm_load_ps(a + i);

        // always do the quotient, but it may end
        // up producing QNaN in some or all lanes
        __m128 vq = _mm_sqrt_ps(va);

        // now select between vq and vz, depending
        // on whether the input was greater than
        // or equal to zero or not
        __m128 mask = _mm_cmpge_ps(va, zero);

        // (vq & mask) | (vz & ~mask)
        __m128 qmask = _mm_and_ps(mask, vq);
        __m128 znotmask = _mm_andnot_ps(mask, vz);
        __m128 vr = _mm_or_ps(qmask, znotmask);
        _mm_store_ps(r + i, vr);
    }
}

View Code

SSE4—it is emitted by the intrinsic _mm_blendv_ps(). Let’s take a look at how we might implement a vector select operation ourselves. We can write it like this:

Game Engine Architecture 5Game Engine Architecture 5
__m128 _mm_select_ps(const __m128 a,
                    const __m128 b,
                    const __m128 mask)
{
    // (b & mask) | (a & ~mask)
    __m128 bmask = _mm_and_ps(mask, b);
    __m128 anotmask = _mm_andnot_ps(mask, a);
    return _mm_or_ps(bmask, anotmask);
}

View Code

也可以使用异或来实现 select函数。

Game Engine Architecture 5Game Engine Architecture 5
__m128 _mm_select_ps(const __m128 a,
                        const __m128 b,
                        const __m128 mask)
{
    // (((a ^ b) & mask) ^ a)
    __m128 diff = _mm_xor_ps(a, b);
    return _mm_xor_ps(a, _mm_and_ps(mask, diff));
}

View Code

33、

34、

35、

Original: https://www.cnblogs.com/tekkaman/p/10530747.html
Author: Tekkaman
Title: Game Engine Architecture 5

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/549008/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球