Moving towards a GPU implementation

SDL-1.2 is already 3.5 years old and no longer receives any updates. Therefore I’ve taken the effort to port the voxel-engine to SDL-2. Also, I figured out how to connect my wii-u pro controller on linux (which involved modifying and recompiling bluez-4). So I’ve also added some joystick support. Two weeks ago I pushed those changes to github.

I experimented a bit with the engine and noticed that areas that don’t face the camera traverse more octree nodes than the areas that do face the camera. This can be made visible by counting the number of octnodes that are traversed per quadnode and color a pixel pink when this is above 16. The result is shown in the picture below, with some of these pixels  highlighted with a red oval. For these pixels, it would be better to traverse the octree less deep, such that we get some cheap kind of anisotropic filtering. So, in most areas the number of traversed octnodes is usually rather limited and in those areas where it is not, we shouldn’t be traversing the octnodes that much.

Screenshot from the Sibenik dataset. Red highlight: pink pixels, which indicate a high octree node traversal count. Blue highlight: a rendering bug, caused by improper ordering of octree nodes.

The blue area highlights a bug in my rendering algorithm. If you look closely, you see two ‘lines’ of white dots. The problem is that, while within an octnode the children are traversed from front to back, each child node is rendered entirely before the next child node is rendered. Hence octnodes that are further away can end up obscuring octnodes that are nearer to viewer. Which is what we see happening here. To fix this, we should sort the octnodes in each quadnode, rather than traversing the octnode children in front to back order. This is possible as we can almost safely limit the number of octnodes per quadnode. The octnode limit can be configured for a trade-off between rendering speed and accuracy.

With this new algorithm the recursive rendering function no longer needs to return a value. Hence we are no longer required to traverse the recursion tree using depth-first search and instead can use breadth-first search (BFS). BFS is much easier to parallelize and therefore much easier to implement efficiently for the GPU.


7 thoughts on “Moving towards a GPU implementation

    • Thanks. I wanted to have a nice teaser image on top of my blog. The increased column width is a nice bonus. And I’ve also taken the effort to improve the layout of some of the older posts.

      Yes, I do have ideas for porting to the GPU. As a test I modified my CPU implementation to use a fixed size buffer of octnodes per quadnode and to sort these. It is available in the algorithm3 branch. While the CPU implementation still uses DFS to traverses the quadtree, there is no reason for the GPU implementation to use BFS. Making it some kind of up-scaling algorithm. You start with a low-resolution rendering with lots of octree nodes per pixel and repeatedly perform up-scaling passes. Each up-scaling pass increases the resolution in each dimension by a factor 2, while the number of octree nodes is reduced by a factor two. The CPU will still be used to fill an initial buffer. But for the GPU I only need to implement an up-scaling kernel, which takes an w*h*2s buffer and transforms it into a 2w*2h*s buffer. The last buffer is then rendered pixel-by-pixel to the screen. It is not a precise algorithm, as the bounded buffer size can cause important octnodes to be dropped, however, in practice it is rarely visible.

      So initially the CPU fills the initial buffer, sized 8×6 with 1024 octnodes per ‘pixel’. This buffer is then send to the GPU, which performs several passes to up-scale it from:
      8*6*1024 to
      16*12*512 to
      32*24*256 to
      64*48*128 to
      128*96*64 to
      256*192*32 to
      512*384*16 to
      which is then rendered to the screen. If you look closely to these numbers, you might notice that this algorithm has a running time of O(pixels), which can be considered the holy grail of 3D rendering (though I still need to get it implemented). There are also some additional nice features. It can easily support anti-aliasing and transparency and the scene graph doesn’t need to be an octree, but can be almost any tree, as long as it has some nice scaling and balancing properties.

  1. Hey bcmpinc,

    I was wondering how you got the voxels to be rendered as perfect cubes. I looked through your code and I didn’t see any edge equations to only color pixels inside the cube. I assume that the actual coloring part of your code is in the second loop in the traverse function but I couldn’t figure out how it finds the outline. Also, when your traversing your octree node and check for occlusion, do you check the octree as a cube outline or only as a quad which encompases the octree node?

    • My algorithm only renders pixels, not rectangles. Hence, if the traversal ended at a leaf node, but the quadtree traversal isn’t at a leaf node yet, I need to make up some child nodes. I split the leaf node into 8 smaller versions, which each can be split recursively as necessary. This way my algorithm is always able to split an octree node into smaller nodes. Due to the frustum culling, the ‘sub’-voxels end up being rendered as perfect cubes.

      Yes, you guessed right, the colouring happens in the second loop and is performed by face.draw(quadnode*4+i, color, udepth);. However, this only happens if the quadtree is traversed to the pixel level. If it is still at a higher lever, which is checked by if (quadnode<quadtree::M) {, then traverse is called recursively, allowing the octree nodes to be split further. The frustum culling checks whether the quadtree node overlaps the bounding rectangle of the octree node.

      • Ahh that makes a lot of sense. But when you traverse and call the compute frustum, it looks like your only checking if the bounding box of the octree node is visible so although the result is a perfect square, in between when you traverse the octree your actually checking a quad for occlusion. Im asking because im trying to make my own renderer but i dont want to check the aabb of the octree nodes cube for occlusion because that makes a whole ton of fase positives on the edges of objects.

      • It is a trade-off. If I want more accurate pruning, the frustum check becomes much more expensive. The false positives only happen at the corners of the quad, so I assume that there aren’t too many and those will likely be pruned in the recursive call anyway.

        However, note that frustum culling is something different than occlusion culling. For the later I traverse the octree in front to back order and use the quadtree as a kind of hierarchical depth buffer.

  2. I guess thats true, ive been struggling with getting the renderer ive created to properly check cubes for occlussion. I tried to just get the 6 vertices that make the roots outline and then when rendering the children i would use the same 6 vertices but relative to the childs center. Problem is that depending where the root is positioned in screenspace, its children will use different vertices to make up their outline so you start getting major rendering errors. Ill give my way some more thought but i think eventually ill try just checking the cubes outlines instead. Thanks for your explanations!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s