The KISS Game Engine.

Main
Parallelism
Runtime Types
Memory Management
Graphics (in process)

Parallelism

General Thoughts

It's worth looking at the basic issues with working in a multithreaded environment. I'm not really talking about Synchronization and inter thread communication here just some of the less obvious issues.

In a multithreaded environment we have contention for shared resources, the most common of these resources is memory. It's relatively obvious that if two threads want to access the same memory location you need some way to arbitrate the resource (say a mutex). What's somewhat less obvious is the issue with two processors accessing different memory locations in the same cache line. Without you doing anything on most modern processors it just works, however processors deal with memory on a per cache line basis, so if the owner of the cache- line changes, it needs to be flushed to some coherent memory location (L2 cache if your lucky, main memory if you're not) and then read back by the new owner. If the flush is to L2 cache then were only talking about perhaps 100 cy, if it's to main memory it'll be more like 1000 cy.

So we don't want to allocate memory from two different locations in the same cache line to two different processors. There are two easy ways to do this, use a different heap for each processor, or align every allocation to cache lines.  Unfortunately modern cache lines are large (128bytes plus), so we can end up with a lot of wasted space doing the latter. Doing the former efficiently requires significant understanding of your heap requirements. There are a number of multithreaded allocators around that would deal with this for us, basically they parcel allocations from some central thread-safe allocator to individual thread heaps on demand, they are probably more complicated than we need. So for now we'll just try and parcel out memory for our core threaded systems cache aligned in large enough blocks that the overhead doesn't kill us.

Mutexs are slow

There is one other thing it's important to realize about parallelism acquiring a mutex is very slow, and any piece of code inside a mutex is executed entirely serially. The common case for excessive serialization is the memory allocator, in server apps running on large numbers of CPU's that do significant memory allocation from an allocator that is not designed for multi processor, it's actually not uncommon to see adding processors resulting in the application slowing down. This is a result of the relatively large lock times on most allocators and the overhead of the mutexs themselves.

While memory allocators are the common case in applications, any data-structure with multiple writers will exhibit the same effect, the frequency of collisions dictating how much of your potential CPU resources you will loose. One technique we can use to control the serialization to some extent is to have each thread produce independent result sets then combine the results at the end as a separate task. This is only really useful when the combine operation is cheap relative to the compute operation.

In the end Mutexs are the best solution for some problem sets, just don't use them unless you have to.

Lock Free data structures

Another approach to dealing with synchronization is to use what are termed lock free data structures. Basically this involves using atomic operations that fail in the case of a collision (usually compare and exchange), to ensure the structure is always valid, and retrying the operation when it fails. In principal this is potentially really fast, since in most cases we're talking about re-executing very few instructions, and the chances of more than one collision are low, unfortunately it still has to deal with the memory contention issues from above.

Either way a well written lock free data structure will likely be many times faster than using a mutex.

The disadvantage is that they are relatively complex for even simple structures, and they often have some memory overhead. On the plus side though we only really care about FIFO's, Stacks and Lists and those are "solved" problems (google is our friend here).

Slower can be faster in a multiprocessor environment

Optimizing in a single threaded environment is simple, you simple optimize each piece in isolation. In a multiprocessor environment rewriting something so it runs more slowly to offload to another processor can often improve overall performance.

 

Contact me