Moving towards a GPU implementation

SDL-1.2 is already 3.5 years old and no longer receives any updates. Therefore I’ve taken the effort to port the voxel-engine to SDL-2. Also, I figured out how to connect my wii-u pro controller on linux (which involved modifying and recompiling bluez-4). So I’ve also added some joystick support. Two weeks ago I pushed those changes to github.

I experimented a bit with the engine and noticed that areas that don’t face the camera traverse more octree nodes than the areas that do face the camera. This can be made visible by counting the number of octnodes that are traversed per quadnode and color a pixel pink when this is above 16. The result is shown in the picture below, with some of these pixels  highlighted with a red oval. For these pixels, it would be better to traverse the octree less deep, such that we get some cheap kind of anisotropic filtering. So, in most areas the number of traversed octnodes is usually rather limited and in those areas where it is not, we shouldn’t be traversing the octnodes that much.

Screenshot from the Sibenik dataset. Red highlight: pink pixels, which indicate a high octree node traversal count. Blue highlight: a rendering bug, caused by improper ordering of octree nodes.

The blue area highlights a bug in my rendering algorithm. If you look closely, you see two ‘lines’ of white dots. The problem is that, while within an octnode the children are traversed from front to back, each child node is rendered entirely before the next child node is rendered. Hence octnodes that are further away can end up obscuring octnodes that are nearer to viewer. Which is what we see happening here. To fix this, we should sort the octnodes in each quadnode, rather than traversing the octnode children in front to back order. This is possible as we can almost safely limit the number of octnodes per quadnode. The octnode limit can be configured for a trade-off between rendering speed and accuracy.

With this new algorithm the recursive rendering function no longer needs to return a value. Hence we are no longer required to traverse the recursion tree using depth-first search and instead can use breadth-first search (BFS). BFS is much easier to parallelize and therefore much easier to implement efficiently for the GPU.

Depth information and screen space ambient occlusion

The voxel rendering algorithm already collected depth information. A while ago I added some code that collects this information and stores it in a depth buffer. Hence it can be used in post-processing steps, such as the screen space ambient occlusion (SSAO) lighting technique. I implemented the SSAO filter using this tutorial. However, my rendering algorithm does not produce normal information, so I had to change it slightly.

I also added some code to allow super sampling anti aliasing (SSAA), which basically renders the image at a much higher resolution and then downscales this to the output resolution. This a prohibitively expensive anti-aliasing technique, but still, my renderer takes only 2-4 seconds to render a 7680×4320 image and downscale it to 1920×1080, which results in 16x SSAA. Combined with the SSAO this resulted in the pictures attached below. To emphasize the effect of SSAO, I exaggerated the resulting shadows.

Now running ~10fps at 1024×768

I noticed that GCC’s vector extensions not always provide an efficient translation to machine code, while occasionally machine code also has operations not available with the vector extensions. Therefore I decided to replace the use of vector extensions with Intel Intrinsics, optimizing the code along the way. This lead to a quite impressive increase in rendering speed. The Intrinsics Guide was an invaluable resource for working with the arcane syntax of the intrinsics.

I’ve changed the sign of x1 and y1 in the bound, dx, dy and dz parameters, which allowed me to merge the dltz and dgtz parameters into the frustum parameter. And furthermore, perform frustum occlusion checking using only 2 operations (_mm_cmplt_epi32 and _mm_movemask_ps). The movemask operation was also useful for reducing the number of operations in computing the furthest octant.

I wanted to use _mm_blend_ps for computing the new values of bound, dx, etc. However, its mask parameter had to be a compile time constant, which forced me to unroll the quadtree traversal loop. This turned out to result in another significant speed up. Unrolling the loops for the octree traversal also caused in improvement in rendering speed.

While the graphical output has not changed since I posted the first benchmarks, the rendering time has decreased quite a bit. Currently it runs the benchmarks at:
Test 1: 109.49ms, 3188432 traversals.
Test 2: 101.27ms, 2959452 traversals.
Test 3: 133.44ms, 3736954 traversals.

Using intrinsics instead of vector extensions is also an important step towards porting my rendering engine to the msvc compiler.

The rendering engine is now detached from the SDL viewer

As I mentioned in a previous post, I wanted to detach the sparse voxel octree rendering code from the SDL based viewer. With commit f36fc4ea that is now realized. The rendering engine now only requires the GLM library, which is used to setup the initial call to traverse. The engine has libpng as an optional dependency. If you have it, the engine will provide a method with which you can easily save the resulting image as png, otherwise that method is a no-op. The benchmark and viewer still require SDL though and are not build if not available.

Voxel editor

I’ve been looking for some voxel-editors and found a few that are gratis:

  • MagicaVoxel: looks very nice, however, it is closed source and no Linux build is available;
  • Zoxel: open-source, but rather limited;
  • VoxelShop: closed-source, seems to work fine.

However, creating voxels manually is rather tedious work. So, I would like to generate them using a script. Furthermore, my voxel renderer supports some features that are not available elsewhere. For example, one can use extruded triangles as ‘leaf-nodes’ instead of tiny cubes, because nodes in a voxel octree can refer to themselves.

Hence, I decided that I want to make my own voxel editor. One that will be purely script based. So, if you want to make a single change, you have to change the script that generated it and re-execute that script. As scripting language I’ve chosen Lua, because it is almost trivial to create c++ bindings for Lua. I have also decided to use Qt4, as I don’t want to write yet another widget toolkit for SDL. However, I still want to use my voxel renderer inside my voxel editor, which means that I have to decouple the rendering engine from the SDL based viewer.

CMake

As the first step, I switched from GNU makefiles to CMake. I noticed that I’m starting to gather quite some dependencies, of which most are not very important and only used for some minor feature. For example, the benchmark tool uses libpng to allow saving the generated images. Now if you don’t have the development files for libpng installed, the benchmark tool will just be built without the image saving functionality. There is also a program that can convert a texture and heightmap into a pointcloud and uses SDL_image to read the input images. If you don’t have SDL_image, this program won’t be built, though other targets still might be.

The voxel rendering engine is not yet detached from the SDL viewer, but hopefully will be soon. I might then also update the benchmark tool to no longer use SDL, or make its SDL usage optional.

Video capture

I also took the effort to get the video capture working again. I had some issues with libAV, so I switched to using ffmpeg. Those libraries are very similar, but not fully compatible. You will need to install ffmpeg if you want to use the video capture feature. The capture feature is disabled by default, because cmake does not see the difference between ffmpeg and libAV; ffmpeg is not that easy to obtain; and trying to use libAV will break the build. See README.md for how to enable the video capture feature.

New data format

I finally got around implementing the new data format, which decreased the file sizes by 69-78%. For example, the Sibenik dataset went from 1.6GiB to 483MiB. The rendering time increased from 153ms to 160ms, however, the variation between scenes varies from 93ms to 210ms.

The format has become the following:

struct octree {
    uint32_t avgcolor:24;
    uint32_t bitmask : 8;
    uint32_t child[0];
};

The child array with length 0 denotes a variable length array. As it has size 0, the size of the struct is 4 bytes, plus 4 bytes for every child. Which child nodes exist is stored in the bitmask. Usually, the child array stores the index to the child nodes, such that given an array octree[N] a child node is stored at, for example, octree[node.child[2]]. In the last layer I don’t store pointers to the leaf nodes, but just store the colors of the leaf nodes (or’ed with 0xff000000, so they are kind of hard coded leaf nodes).

For those who want to test the new file format, I’ve pushed the source code to github. It still includes the two small test files ‘sponge’ and ‘sign’, which have been converted to the new format.

Edit: I’ve updated the git-repo, such that the master branch points to the version with the new storage format. I’ve also updated the readme file.

Next steps

Its been a while since my last post. I’ve been working on changing the octree storage format, to obtain better compression. The current format stores 8 child pointers per internal node (leaf nodes are not stored), using the following 64 byte struct:

struct octree { 
    uint32_t child[8];
    int32_t avgcolor[8];
};

However, this is a rather wasteful format if nodes have 4 children on average or less. Therefore I considered changing it to the variable length struct:

struct octree {
    uint32_t bitmask: 8;
    uint32_t color  :24;
    uint32_t child[0];
};

This struct is only 4 bytes per node + 4 bytes per child pointer. With this, the data file size decreases with about 75% 60%. The drawback is that the traversal code becomes more complicated, as the children still need to be traversed in front to back order, which depends on the camera orientation.

Currently my voxel renderer can output at 5 FPS reliably. This is nowhere near the 30 FPS that I would require for an interactive game. And changing the octree file format is not going to change that. Even worse, it might actually reduce the rendering speed. I might be able to further optimize my algorithm, but without conceptual changes, I would be able to reach 10 FPS tops (rumor has it that this is the frame rate that UD runs at). As I lack hope that I can get my renderer running at the required 30 FPS, changing the storage format does not seem worth the effort.

One conceptual change that might be an option is point caching. However, I don’t like it, as it seems to much of a dirty hack to me and I have little hope that I can get it working properly. The conceptual change that I’m currently considering is to port my algorithm to the GPU. Due to the amount of recursion, that is not a trivial task. However, I’ve done some GPU programming before (using GLSL) and I think I might be able to change my algorithm such that it can run on the GPU. OpenCL seemed like the most obvious language to program it in, however, it is bulky and I have no experience with it. Fortunately, there happen to be some OpenGL extensions that make OpenGL a viable choice, such as ‘compute shaders‘, which makes a lot of GPGPU functionality directly available in OpenGL; ‘indirect rendering‘, which eliminates costly round trips from GPU to CPU and back; and ‘atomic counters‘, which I can use to fill a vertex buffer with a variable amount of points or triangles.

Benchmarks

As requested, here are some benchmarks. The dataset used is available at https://www.dropbox.com/s/zdy53f7ywfs3mls/sibenik.7z. The points are centered around (0x8000,0x8000,0x8000). The data format used is:

struct point{
    uint16_t x,y,z;
    uint8_t r,g,b;
    uint8_t dummy;
};

I’ve chosen 3 camera positions (uses glm.hpp):

struct Scene {
    const char * filename;
    glm::dvec3 position;
    glm::dmat3 orientation;
};
static const Scene scene [] = {
    {"sibenik",  glm::dvec3( 0.0000,  0.0000,  0.0000),  glm::dmat3(-0.119, -0.430, -0.895,   0.249,  0.860, -0.446,   0.961, -0.275,  0.005)},
    {"sibenik",  glm::dvec3(-0.0462, -0.0302,  0.0088),  glm::dmat3(-0.275,  0.188,  0.943,   0.091,  0.981, -0.170,  -0.957,  0.039, -0.286)},
    {"sibenik",  glm::dvec3( 0.0500, -0.0320,  0.0027),  glm::dmat3(-0.231, -0.166, -0.959,  -0.815,  0.572,  0.097,   0.532,  0.803, -0.267)},
};

 

The positions are normalized, with -1.0 referring to 0x0 and 1.0 referring to 0x10000.

The view frustum is defined as in the code below, but without the near and far clipping. (See also the documentation for glFrustum.)

static const int SCREEN_WIDTH = 1024;
static const int SCREEN_HEIGHT = 768;
glFrustum(
  -SCREEN_WIDTH/2, SCREEN_WIDTH/2, 
  -SCREEN_HEIGHT/2, SCREEN_HEIGHT/2, 
  SCREEN_HEIGHT, SCREEN_WIDTH*2
); 

Results:

Hardware: CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz.

Making some progress

I’ve finally managed to implement the algorithm I described in my previous post. It is still rather slow (2FPS at 1024×768 on average), with some frames taking up to 5 seconds. Also, my implementation does not yet always traverse nodes in the correct order, which causes some ordering errors.

Sibenik dataset

Sibenik dataset

A sorting based rendering method

Rasterization, ray tracing, ray casting and splatting are well-known rendering methods. At Euclideon they claim to have found a new method, which they call ‘searching’ or ‘sorting’. I think sorting is a good name and believe that it works as described below.

Scene graph

First you need a scene that you want to render. We want to be able to break down this scene into smaller pieces indefinitely, hence this scene needs to be represented as a tree of infinite size and height and no leaf nodes. It is stored in memory as a directed graph of finite size. Each node models a part of the scene by describing how it is made out of other nodes. As such, the graph effectively describes a fractal. One of the nodes in the graph is the root node and contains the entire scene. We will refer to this graph as the scene graph.

A few trivial examples of scene graphs:

The line segment is made of two scaled down versions of itself. The cube is made similarly. The triangle graph consists of two nodes (marked with a and b) as this eliminates the need for rotation. Based on the cube graph, we can convert an octree into a valid scene graph by letting its leaf nodes point to themselves. Sets of line segments and triangles can be stored in K-D trees or BSP trees.

Culling

Culling is the process of removing parts from the scene that are not visible. This is generally used to speed up rasterization and splatting algorithms and to ensure that parts near the camera are not obscured by parts further away. The sorting rendering method is a combination of frustum culling and occlusion culling.

With frustum culling the parts that are outside the frustum of the camera are removed. This is usually done using bounding spheres and a dot product. We use a different approach. First we chose an origin and chose the xy-axes to be parallel to the sides of the viewing window. Then the bounds of the viewing window projected on the xy-plane are computed. In the image below these are denoted x1 and x2 for the x-axis. Furthermore we compute how these bounds change when the origin is moved. Those bounds and their changes are then used to determine whether the model (shown in orange) is within the frustum.

With occlusion culling, the parts that are fully obscured by other parts are removed. The best-known method for this is the z-buffer, which works on the pixel level. It can be changed to work on the part level by computing a bounding rectangle and the lowest z-value and then check for each pixel in the rectangle whether it is lower than the lowest z-value of the part, in which case it fully obscured and thus can be occluded. This is obviously very slow and can be sped-up by using a quadtree as z-buffer, in which the leaf nodes are the original z-buffer and internal nodes have the maximum value of its children. This quadtree is known as a hierarchical z-buffer.

Another way to speed up the occlusion culling is to sort the scene parts in front to back order, which maximizes the number of parts that are obscured and thus do not need to be rendered. If the parts are ordered perfectly, such that later parts can never obscure earlier parts, then we can use a bitmap, which describes whether a pixel is rendered, instead of a z-buffer.

Performing culling tests for each individual part of your scene can be almost as slow as the rendering itself. Therefore these tests are generally performed on groups of parts. This is where our scene graph comes into play, which defines how the entire scene is constructed out of some groups, which are recursively constructed out of smaller groups.

The new rendering method

The rendering method can be described as searching (for the model parts that need rendering), sorting (put those parts in buckets, known as pixels) and mass connected processing (doing this for many parts simultaneously). It is a combination of the above described frustum and occlusion culling, that entirely eliminates the need for rendering.

The rendering starts by throwing the root node of your scene graph into the root node of the occlusion quadtree. Then the following recursive process starts:

  • If the scene part is larger than the quadtree node it is in, then it is broken into smaller parts (by traversing the parts outgoing edges). The parts that pass the frustum and occlusion culling tests for the current quadtree node are reinserted in front to back order.
  • If the scene part is smaller than the quadtree node and the quadtree node is not a leaf node, then it is inserted in those quadtree child nodes for which the part passes the frustum and occlusion culling tests.
  • If the scene part is smaller than the quadtree node and the quadtree node is a leaf node, then the pixel corresponding to that leaf node is colored using the average color of the part.

The occlusion culling test is fast, because it only needs to check a single boolean (or, if the parts are not sorted perfectly, compare two integers). The frustum culling is fast as it only needs a small number of comparisons (and is usually easy to implement). Furthermore, during the recursion, the new bounds of the projected view window can also be computed in constant time.

One nice thing about this rendering method is that it entirely removes the need for perspective projection. You only need to implement the frustum culling test and how the projected view window bounds change during the recursion. For some scene graphs, including the octree, these operations can be performed with only additions, subtractions and bit shifts. (Btw. this method effectively performs a long tail division, which is used to create the perspective projection.)

I suspect that the Atomontage Engine is also using this sorting approach.

(The images were made using ipe and pdftocairo.)