Moving towards a GPU implementation

SDL-1.2 is already 3.5 years old and no longer receives any updates. Therefore I’ve taken the effort to port the voxel-engine to SDL-2. Also, I figured out how to connect my wii-u pro controller on linux (which involved modifying and recompiling bluez-4). So I’ve also added some joystick support. Two weeks ago I pushed those changes to github.

I experimented a bit with the engine and noticed that areas that don’t face the camera traverse more octree nodes than the areas that do face the camera. This can be made visible by counting the number of octnodes that are traversed per quadnode and color a pixel pink when this is above 16. The result is shown in the picture below, with some of these pixels  highlighted with a red oval. For these pixels, it would be better to traverse the octree less deep, such that we get some cheap kind of anisotropic filtering. So, in most areas the number of traversed octnodes is usually rather limited and in those areas where it is not, we shouldn’t be traversing the octnodes that much.

Screenshot from the Sibenik dataset. Red highlight: pink pixels, which indicate a high octree node traversal count. Blue highlight: a rendering bug, caused by improper ordering of octree nodes.

The blue area highlights a bug in my rendering algorithm. If you look closely, you see two ‘lines’ of white dots. The problem is that, while within an octnode the children are traversed from front to back, each child node is rendered entirely before the next child node is rendered. Hence octnodes that are further away can end up obscuring octnodes that are nearer to viewer. Which is what we see happening here. To fix this, we should sort the octnodes in each quadnode, rather than traversing the octnode children in front to back order. This is possible as we can almost safely limit the number of octnodes per quadnode. The octnode limit can be configured for a trade-off between rendering speed and accuracy.

With this new algorithm the recursive rendering function no longer needs to return a value. Hence we are no longer required to traverse the recursion tree using depth-first search and instead can use breadth-first search (BFS). BFS is much easier to parallelize and therefore much easier to implement efficiently for the GPU.

Next steps

Its been a while since my last post. I’ve been working on changing the octree storage format, to obtain better compression. The current format stores 8 child pointers per internal node (leaf nodes are not stored), using the following 64 byte struct:

struct octree { 
    uint32_t child[8];
    int32_t avgcolor[8];

However, this is a rather wasteful format if nodes have 4 children on average or less. Therefore I considered changing it to the variable length struct:

struct octree {
    uint32_t bitmask: 8;
    uint32_t color  :24;
    uint32_t child[0];

This struct is only 4 bytes per node + 4 bytes per child pointer. With this, the data file size decreases with about 75% 60%. The drawback is that the traversal code becomes more complicated, as the children still need to be traversed in front to back order, which depends on the camera orientation.

Currently my voxel renderer can output at 5 FPS reliably. This is nowhere near the 30 FPS that I would require for an interactive game. And changing the octree file format is not going to change that. Even worse, it might actually reduce the rendering speed. I might be able to further optimize my algorithm, but without conceptual changes, I would be able to reach 10 FPS tops (rumor has it that this is the frame rate that UD runs at). As I lack hope that I can get my renderer running at the required 30 FPS, changing the storage format does not seem worth the effort.

One conceptual change that might be an option is point caching. However, I don’t like it, as it seems to much of a dirty hack to me and I have little hope that I can get it working properly. The conceptual change that I’m currently considering is to port my algorithm to the GPU. Due to the amount of recursion, that is not a trivial task. However, I’ve done some GPU programming before (using GLSL) and I think I might be able to change my algorithm such that it can run on the GPU. OpenCL seemed like the most obvious language to program it in, however, it is bulky and I have no experience with it. Fortunately, there happen to be some OpenGL extensions that make OpenGL a viable choice, such as ‘compute shaders‘, which makes a lot of GPGPU functionality directly available in OpenGL; ‘indirect rendering‘, which eliminates costly round trips from GPU to CPU and back; and ‘atomic counters‘, which I can use to fill a vertex buffer with a variable amount of points or triangles.