Moving towards a GPU implementation

SDL-1.2 is already 3.5 years old and no longer receives any updates. Therefore I’ve taken the effort to port the voxel-engine to SDL-2. Also, I figured out how to connect my wii-u pro controller on linux (which involved modifying and recompiling bluez-4). So I’ve also added some joystick support. Two weeks ago I pushed those changes to github.

I experimented a bit with the engine and noticed that areas that don’t face the camera traverse more octree nodes than the areas that do face the camera. This can be made visible by counting the number of octnodes that are traversed per quadnode and color a pixel pink when this is above 16. The result is shown in the picture below, with some of these pixels  highlighted with a red oval. For these pixels, it would be better to traverse the octree less deep, such that we get some cheap kind of anisotropic filtering. So, in most areas the number of traversed octnodes is usually rather limited and in those areas where it is not, we shouldn’t be traversing the octnodes that much.

Screenshot from the Sibenik dataset. Red highlight: pink pixels, which indicate a high octree node traversal count. Blue highlight: a rendering bug, caused by improper ordering of octree nodes.

The blue area highlights a bug in my rendering algorithm. If you look closely, you see two ‘lines’ of white dots. The problem is that, while within an octnode the children are traversed from front to back, each child node is rendered entirely before the next child node is rendered. Hence octnodes that are further away can end up obscuring octnodes that are nearer to viewer. Which is what we see happening here. To fix this, we should sort the octnodes in each quadnode, rather than traversing the octnode children in front to back order. This is possible as we can almost safely limit the number of octnodes per quadnode. The octnode limit can be configured for a trade-off between rendering speed and accuracy.

With this new algorithm the recursive rendering function no longer needs to return a value. Hence we are no longer required to traverse the recursion tree using depth-first search and instead can use breadth-first search (BFS). BFS is much easier to parallelize and therefore much easier to implement efficiently for the GPU.

Depth information and screen space ambient occlusion

The voxel rendering algorithm already collected depth information. A while ago I added some code that collects this information and stores it in a depth buffer. Hence it can be used in post-processing steps, such as the screen space ambient occlusion (SSAO) lighting technique. I implemented the SSAO filter using this tutorial. However, my rendering algorithm does not produce normal information, so I had to change it slightly.

I also added some code to allow super sampling anti aliasing (SSAA), which basically renders the image at a much higher resolution and then downscales this to the output resolution. This a prohibitively expensive anti-aliasing technique, but still, my renderer takes only 2-4 seconds to render a 7680×4320 image and downscale it to 1920×1080, which results in 16x SSAA. Combined with the SSAO this resulted in the pictures attached below. To emphasize the effect of SSAO, I exaggerated the resulting shadows.

Benchmarks

As requested, here are some benchmarks. The dataset used is available at https://www.dropbox.com/s/zdy53f7ywfs3mls/sibenik.7z. The points are centered around (0x8000,0x8000,0x8000). The data format used is:

struct point{
    uint16_t x,y,z;
    uint8_t r,g,b;
    uint8_t dummy;
};

I’ve chosen 3 camera positions (uses glm.hpp):

struct Scene {
    const char * filename;
    glm::dvec3 position;
    glm::dmat3 orientation;
};
static const Scene scene [] = {
    {"sibenik",  glm::dvec3( 0.0000,  0.0000,  0.0000),  glm::dmat3(-0.119, -0.430, -0.895,   0.249,  0.860, -0.446,   0.961, -0.275,  0.005)},
    {"sibenik",  glm::dvec3(-0.0462, -0.0302,  0.0088),  glm::dmat3(-0.275,  0.188,  0.943,   0.091,  0.981, -0.170,  -0.957,  0.039, -0.286)},
    {"sibenik",  glm::dvec3( 0.0500, -0.0320,  0.0027),  glm::dmat3(-0.231, -0.166, -0.959,  -0.815,  0.572,  0.097,   0.532,  0.803, -0.267)},
};

 

The positions are normalized, with -1.0 referring to 0x0 and 1.0 referring to 0x10000.

The view frustum is defined as in the code below, but without the near and far clipping. (See also the documentation for glFrustum.)

static const int SCREEN_WIDTH = 1024;
static const int SCREEN_HEIGHT = 768;
glFrustum(
  -SCREEN_WIDTH/2, SCREEN_WIDTH/2, 
  -SCREEN_HEIGHT/2, SCREEN_HEIGHT/2, 
  SCREEN_HEIGHT, SCREEN_WIDTH*2
); 

Results:

Hardware: CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz.

Making some progress

I’ve finally managed to implement the algorithm I described in my previous post. It is still rather slow (2FPS at 1024×768 on average), with some frames taking up to 5 seconds. Also, my implementation does not yet always traverse nodes in the correct order, which causes some ordering errors.

Sibenik dataset

Sibenik dataset

A proposal of a UD like algorithm

This is a repost of my guestpost originally posted on chorasimilarity’s blog.


I designed an algorithm that, given an octree and a location within that octree can compute the cubemap at that location. The algorithm kind of raytraces each face of the cubemap, but does this in such a way that multiple rays are traced simultaneously (aka. mass connected processing).

The faces of the cubemap are axis-aligned, which allows us to use linear interpolation rather than perspective correct interpolation (which includes 2 multiplications and a division). Because of this linear interpolation, the algorithm only needs integer additions, subtractions, and multiplications by a power of two (aka. bitshifts).

For a more thorough explanation of the algorithm please read the Article.

I’ve also written a partial implementation of the algorithm, available at Git-hub, which renders the positive z plane of the cubemap. It currently does so at 2 fps, which, considering I did not yet put effort in optimization and screen space occlusion is not yet implemented, seems promising. The data is still fully loaded into memory at startup. As background I render a fixed cubemap.

Due to space considerations, not all models have been uploaded to git-hub, but a small low-resolution model is available, such that after compiling the code, you can at least see something. The code is rather messy as it is still in an experimental state.