The KISS Game Engine.

Main
Parallelism
Runtime Types
Memory Management
Graphics (in process)

Graphics

Alright so what is a graphics engine?

In general were going to author some data in some art package, run it through some tool and use some API to draw it on the screen. Now realistically it's more than that, what we're really going to want to do is take these data packets and add them to a scene and then have them drawn. The scene manager however will sit on top of the core renderer, in this way it can be replaced by the client application as necessary.

At this point I just want to discuss static geometry, we'll deal with hierarchies later, and it will sit on top of the static solution.

 

Things to think about

Graphics Chips are just big state machines, when we draw something we change their state, so
Do we save and restore the state?
Do we make every draw call responsible for setting all appropriate state?
Can we force a state override for a draw call?

I think the right answer here is State is "saved" automatically. All state can be explicitly overridden outside the scope of the draw call, this last part is important because it lets us potentially use the same "Model" for things like Z pre-pass and Shadow rendering.

The reason we will save state automatically is because otherwise every material would have to explicitly set every state including every texture stage which would be a pain in the ass. We expect to have potentially many "materials" and making them simpler to write at the expense of core complexity

We can defer state setting to the point of the draw call and maintain a shadow states to remove redundant changes.  This way the application just needs to worry about the states it sets and changes to material implementations do not randomly break things in unpredictable ways.

Transparent objects are not generally resolved through Z Buffer sorting
How do we deal with transparency sorting?

This is generally done by making several passes over the Model, drawing opaque and transparent layers in separate passes. It's actually useful to allow for more than 2 passes, to resolve some trivial static sorting issues between transparencies in an object. Our initial implementation will ignore this, but we'll add the functionality later.

Do we allow double sided polygons?

We don't need engine side support for this, just allowing materials to set the Cull flag (making the polygons double-sided) and assuming that they do the right thing with lighting is sufficient.

The ability to draw triangles and lines without having to worry about high level graphics concepts.
Do we need immediate mode rendering?

Yes we need this! In general it's use should be avoided in shipping game code, but the ability to draw debug primitives is utterly invaluable. It's interface should be a simple as possible. We will actually Implement this first since it doesn't require any tools work.

Threading
How are we dealing with threading?
Do we need to submit primitives from multiple threads?

We will want to be able to put the thing that actually makes D3D or whatever API calls on a separate thread. This is a trivial way to exploit parallelism
We may want to be able to submit primitives to the scene manager from multiple threads. Even if it's only debug primitives this is fundamentally useful in a multithreaded environment.
We may want to be able to put Scene Management on a separate thread (or threads) than the render calls. This last one is heavily dependant on the complexity of the Scene Management algorithm but it's worth considering.

Fundamentally what we'll end up with is some data structure of rendering commands built in one frame and consumed in subsequent ones. We want multiple threads to be able to write to this data structure and a single reader, we'll need some degree of determinism in the ordering for the same view on the same scene, and ideally the data structure will be lockless cause mutices hurt (see bit on parallelism).

The determinism requirement is the hard part, and it's actually more than simple determinism, we're going to want to have the application determine ordering, although there are actually few "hard" ordering constraints. We can approach this by sorting submissions based on some metric, and sorting metric collisions further using some thread Id, for complete determinism this would still require that a particular command packet always come from the same thread for a given view. In practice I don't believe that this will be a real issue, cameras tend to move around and change the viewpoint continually anyway, for the most part what we will actually care about is:

  • Sort order is approximately correct according to some metric.
  • Material Passes are rendered in the correct order.
  • Layers are rendered in the correct order.

A single metric should trivially allow us to do this.

As for the lock free part, since we don't read from the data structure while we are writing to it, we could trivially have each thread build separate structures and merge them in a synchronization step. The alternative is using some sort of lock free priority based structure, probably something like an array of queues, with the array used for some initial partitioning in the metrics space. Which is the better solution may vary based on the way that client code partitions the work. For right now the exact solution doesn't matter, since we can trivially define an interface that supports either, and for that matter other possible solutions. For our initial implementation we'll probably ignore the complexities of the parallel implementation anyway.

Should the Renderer be a Singleton?

I think this one is an emphatic yes. The more interesting question is does it contain none static data, i.e. does it allocate memory at startup. Because if the answer is yes then we have to implement some sort of singleton, if it's no we can either just have a bunch of functions in a namespace or declare a class with nothing but static members. The answer is no, simply because we have to have an Init function anyway.

Should the Renderer allocate memory, or should it be passed in?

This is a somewhat tricky question, for some none obvious reasons. There are significant performance issues on multiprocessor machines if a cache line is used by more than one processor, without knowing how many threads will be submitting commands, we can't trivially pre-divide any memory block inside the renderer on cache-line boundaries, so it makes some sense to have the clients allocate the data. However if we do this where do we free the memory? The consumer would have no clear way to know which allocator was used.

Given that, we likely have to allocate the command packets inside the renderer.

A second option is to rely in the allocator to resolve the issue, allocate some larger aligned block of memory on demand, and assume the allocating thread controls it. This makes allocations more expensive.

The question then comes down to whether the potential cache contention is more expensive than the additional overhead/allocation. My best guess at this point is that most clients will actually submit on a single thread for the most part and the cache collisions will be minimal, so I'm going with the simpler allocator although our above requirements still require thread safety.

In the end I created a simple pool type allocator, and wrote a single threaded version and a thread-safe non-locking version. The latter I ended up writing in assembly language, because the compiler refused to inline the version using intrinsics.


#ifndef KISS_RENDERER_POOL_ALLOCATOR
#define KISS_RENDERER_POOL_ALLOCATOR

namespace KISS {



class PoolAllocator
{
public:
    PoolAllocator(void *mem, uint32_t size, uint32_t blockSize);
    inline void *Alloc();
    inline void Free(void * addr);

private:
    void *mFreeHead;
};

#if KISS_RENDERER_THREAD_SAFE
// Lock free multithreaded version
__forceinline void * PoolAllocator::Alloc()
{
    __asm mov edx, mFreeHead[this]
    __asm retry:
    __asm cmp edx, 0
    __asm je fail             // No Memory!
    __asm mov eax, [edx]
    __asm mov ecx, [eax]
    __asm lock cmpxchg [edx], ecx
    __asm jne retry
    __asm fail:
}

__forceinline void PoolAllocator::Free(void *addr)
{
    __asm mov edx, mFreeHead[this]
    __asm retry:
    __asm mov eax, [edx]
    __asm mov ecx, addr
    __asm mov [ecx], eax
    __asm lock cmpxchg [edx], ecx
    __asm jne retry
}

#else
// Trivial None threadsafe version
inline void * PoolAllocator::Alloc()
{
    if (!mFreeHead) return 0; // Fail - OO Memory
    void *temp = mFreeHead;
    mFreeHead = *(void **)mFreeHead;
    return temp;
}

inline void PoolAllocator::Free(void *addr)
{
    *(void **)addr = mFreeHead;
    mFreeHead = addr;
}
#endif


}

#endif


The lock-free multithreaded version uses the cmpxchg instruction to update the head pointer, basically it's an atomic instruction that only succeeds if the head pointer hasn't been modified since the original read. If there is a race condition between two threads, the head pointer will be changed by one and the other will fail, it then re-attempts the operation. It's worth noting that the head pointer will cause cache contention on real multiprocessor environments, so reading it is potentially expensive 100 or so cycles assuming a flush to L2. It may be worth revisiting this allocator later, we'll be able to measure the overall contention in the system by looking at how much of a win we get on say two processors when we use separate heaps. But for now this will be fine, since the interface abstracts the implementation issues.

Thinking about this more, this is actually more complicated than we need, rendering clients only allocate and our renderer which is not multithreaded only frees, because we are potentially multithreaded we can't guarantee the order of allocation and free are the same. However we can wait until we finish consuming a frame and simply free all of the items. So I'm going to change the implementation to be a simple stack (see below), this also makes variable sized allocations easier to deal with.


#ifndef KISS_RENDERER_POOL_ALLOCATOR
#define KISS_RENDERER_POOL_ALLOCATOR

#include "assert.h"

namespace KISS {



class PoolAllocator
{
public:
    PoolAllocator();
    PoolAllocator(void *mem, uint32_t size);
    void Init(void *mem, uint32_t size);

    inline void *Alloc(size_t size);
    inline void Reset();

private:
    void *mMem;
    void *mNextMem;
    uint32_t mSize;
};

#if KISS_RENDERER_THREAD_SAFE
inline void * PoolAllocator::Alloc(size_t lsize)
{
    __asm {
        mov edx, this
        mov eax, [lsize]
        lock xadd mNextMem[edx], eax
    }
}

inline void PoolAllocator::Reset()
{
    mNextMem = mMem;
}

#else
// Trivial None threadsafe version
inline void * PoolAllocator::Alloc(size_t size)
{
    void *retVal = mNextMem;
    assert((intptr_t)mNextMem + size < mSize);
    mNextMem = (void *)((intptr_t)mNextMem+size);
    return retVal;
}

inline void PoolAllocator::Reset()
{
    mNextMem = mMem;
}
#endif


}

#endif


The implementation is of course simpler, but more importantly the interface changed.

 

 

 

Back to the issue of State. Because we have potentially parallel submission of "primitives", from an API stand point we cannot have concept of a single global state. The general assumption is that every "primitive" has an associated material, which has some sort of state block describing some difference from some default state. However earlier I discussed the concept of being to override state, that implies a second state block that is created when the "primitive" is submitted to the queue.

The question then is what does the state override API look like semantically, this really comes down to a single question. What is the lifetime of a state override? Is it one "primitive" or is it until that state is explicitly cleared?

I personally think the latter is more useful, if only because it's useful to be able to override state in a whole host of primitives (say force Aniso filtering) without having to dig down into the state itself.

What happens if we override an override?

This is a more difficult question but if we assume we want to be able to explicitly override large sets of state inside code we do not control, then you should have to explicitly cancel an override before allowing another one. If however you wanted to follow the standard state machine pattern, then it would simply change the override value from that point on.

It's somewhat hard to say which is the better solution at this point, my best guess is the former. but the latter is somewhat easier to implement, and I don't believe nested overrides will be a common pattern.

 

 

 

 

 

Contact me