Moving towards a GPU implementation

SDL-1.2 is already 3.5 years old and no longer receives any updates. Therefore I’ve taken the effort to port the voxel-engine to SDL-2. Also, I figured out how to connect my wii-u pro controller on linux (which involved modifying and recompiling bluez-4). So I’ve also added some joystick support. Two weeks ago I pushed those changes to github.

I experimented a bit with the engine and noticed that areas that don’t face the camera traverse more octree nodes than the areas that do face the camera. This can be made visible by counting the number of octnodes that are traversed per quadnode and color a pixel pink when this is above 16. The result is shown in the picture below, with some of these pixels  highlighted with a red oval. For these pixels, it would be better to traverse the octree less deep, such that we get some cheap kind of anisotropic filtering. So, in most areas the number of traversed octnodes is usually rather limited and in those areas where it is not, we shouldn’t be traversing the octnodes that much.

Screenshot from the Sibenik dataset. Red highlight: pink pixels, which indicate a high octree node traversal count. Blue highlight: a rendering bug, caused by improper ordering of octree nodes.

The blue area highlights a bug in my rendering algorithm. If you look closely, you see two ‘lines’ of white dots. The problem is that, while within an octnode the children are traversed from front to back, each child node is rendered entirely before the next child node is rendered. Hence octnodes that are further away can end up obscuring octnodes that are nearer to viewer. Which is what we see happening here. To fix this, we should sort the octnodes in each quadnode, rather than traversing the octnode children in front to back order. This is possible as we can almost safely limit the number of octnodes per quadnode. The octnode limit can be configured for a trade-off between rendering speed and accuracy.

With this new algorithm the recursive rendering function no longer needs to return a value. Hence we are no longer required to traverse the recursion tree using depth-first search and instead can use breadth-first search (BFS). BFS is much easier to parallelize and therefore much easier to implement efficiently for the GPU.

Depth information and screen space ambient occlusion

The voxel rendering algorithm already collected depth information. A while ago I added some code that collects this information and stores it in a depth buffer. Hence it can be used in post-processing steps, such as the screen space ambient occlusion (SSAO) lighting technique. I implemented the SSAO filter using this tutorial. However, my rendering algorithm does not produce normal information, so I had to change it slightly.

I also added some code to allow super sampling anti aliasing (SSAA), which basically renders the image at a much higher resolution and then downscales this to the output resolution. This a prohibitively expensive anti-aliasing technique, but still, my renderer takes only 2-4 seconds to render a 7680×4320 image and downscale it to 1920×1080, which results in 16x SSAA. Combined with the SSAO this resulted in the pictures attached below. To emphasize the effect of SSAO, I exaggerated the resulting shadows.

A sorting based rendering method

Rasterization, ray tracing, ray casting and splatting are well-known rendering methods. At Euclideon they claim to have found a new method, which they call ‘searching’ or ‘sorting’. I think sorting is a good name and believe that it works as described below.

Scene graph

First you need a scene that you want to render. We want to be able to break down this scene into smaller pieces indefinitely, hence this scene needs to be represented as a tree of infinite size and height and no leaf nodes. It is stored in memory as a directed graph of finite size. Each node models a part of the scene by describing how it is made out of other nodes. As such, the graph effectively describes a fractal. One of the nodes in the graph is the root node and contains the entire scene. We will refer to this graph as the scene graph.

A few trivial examples of scene graphs:

The line segment is made of two scaled down versions of itself. The cube is made similarly. The triangle graph consists of two nodes (marked with a and b) as this eliminates the need for rotation. Based on the cube graph, we can convert an octree into a valid scene graph by letting its leaf nodes point to themselves. Sets of line segments and triangles can be stored in K-D trees or BSP trees.


Culling is the process of removing parts from the scene that are not visible. This is generally used to speed up rasterization and splatting algorithms and to ensure that parts near the camera are not obscured by parts further away. The sorting rendering method is a combination of frustum culling and occlusion culling.

With frustum culling the parts that are outside the frustum of the camera are removed. This is usually done using bounding spheres and a dot product. We use a different approach. First we chose an origin and chose the xy-axes to be parallel to the sides of the viewing window. Then the bounds of the viewing window projected on the xy-plane are computed. In the image below these are denoted x1 and x2 for the x-axis. Furthermore we compute how these bounds change when the origin is moved. Those bounds and their changes are then used to determine whether the model (shown in orange) is within the frustum.

With occlusion culling, the parts that are fully obscured by other parts are removed. The best-known method for this is the z-buffer, which works on the pixel level. It can be changed to work on the part level by computing a bounding rectangle and the lowest z-value and then check for each pixel in the rectangle whether it is lower than the lowest z-value of the part, in which case it fully obscured and thus can be occluded. This is obviously very slow and can be sped-up by using a quadtree as z-buffer, in which the leaf nodes are the original z-buffer and internal nodes have the maximum value of its children. This quadtree is known as a hierarchical z-buffer.

Another way to speed up the occlusion culling is to sort the scene parts in front to back order, which maximizes the number of parts that are obscured and thus do not need to be rendered. If the parts are ordered perfectly, such that later parts can never obscure earlier parts, then we can use a bitmap, which describes whether a pixel is rendered, instead of a z-buffer.

Performing culling tests for each individual part of your scene can be almost as slow as the rendering itself. Therefore these tests are generally performed on groups of parts. This is where our scene graph comes into play, which defines how the entire scene is constructed out of some groups, which are recursively constructed out of smaller groups.

The new rendering method

The rendering method can be described as searching (for the model parts that need rendering), sorting (put those parts in buckets, known as pixels) and mass connected processing (doing this for many parts simultaneously). It is a combination of the above described frustum and occlusion culling, that entirely eliminates the need for rendering.

The rendering starts by throwing the root node of your scene graph into the root node of the occlusion quadtree. Then the following recursive process starts:

  • If the scene part is larger than the quadtree node it is in, then it is broken into smaller parts (by traversing the parts outgoing edges). The parts that pass the frustum and occlusion culling tests for the current quadtree node are reinserted in front to back order.
  • If the scene part is smaller than the quadtree node and the quadtree node is not a leaf node, then it is inserted in those quadtree child nodes for which the part passes the frustum and occlusion culling tests.
  • If the scene part is smaller than the quadtree node and the quadtree node is a leaf node, then the pixel corresponding to that leaf node is colored using the average color of the part.

The occlusion culling test is fast, because it only needs to check a single boolean (or, if the parts are not sorted perfectly, compare two integers). The frustum culling is fast as it only needs a small number of comparisons (and is usually easy to implement). Furthermore, during the recursion, the new bounds of the projected view window can also be computed in constant time.

One nice thing about this rendering method is that it entirely removes the need for perspective projection. You only need to implement the frustum culling test and how the projected view window bounds change during the recursion. For some scene graphs, including the octree, these operations can be performed with only additions, subtractions and bit shifts. (Btw. this method effectively performs a long tail division, which is used to create the perspective projection.)

I suspect that the Atomontage Engine is also using this sorting approach.

(The images were made using ipe and pdftocairo.)

A proposal of a UD like algorithm

This is a repost of my guestpost originally posted on chorasimilarity’s blog.

I designed an algorithm that, given an octree and a location within that octree can compute the cubemap at that location. The algorithm kind of raytraces each face of the cubemap, but does this in such a way that multiple rays are traced simultaneously (aka. mass connected processing).

The faces of the cubemap are axis-aligned, which allows us to use linear interpolation rather than perspective correct interpolation (which includes 2 multiplications and a division). Because of this linear interpolation, the algorithm only needs integer additions, subtractions, and multiplications by a power of two (aka. bitshifts).

For a more thorough explanation of the algorithm please read the Article.

I’ve also written a partial implementation of the algorithm, available at Git-hub, which renders the positive z plane of the cubemap. It currently does so at 2 fps, which, considering I did not yet put effort in optimization and screen space occlusion is not yet implemented, seems promising. The data is still fully loaded into memory at startup. As background I render a fixed cubemap.

Due to space considerations, not all models have been uploaded to git-hub, but a small low-resolution model is available, such that after compiling the code, you can at least see something. The code is rather messy as it is still in an experimental state.