Vulkan has been released!

Yesterday, February 16th, the Khronos Group released Vulkan, which is a graphics and compute API for the GPU. To some extend it is the successor of OpenGL, though Vulkan will not replace OpenGL. The Vulkan API is designed to eliminate most of the complexity and overhead that is required to implement the OpenGL API. Furthermore, Vulkan has much better support for multi-threading, which allows you to distribute your rendering calls over multiple CPU cores. If you care a lot about performance, you should definitely check it out. Otherwise, stick to OpenGL, because Vulkan is (supposedly) much harder to use.

Update: I have a Fermi GPU, which is Vulkan capable, but it seems that NVIDIA will not be providing Vulkan drivers for Fermi GPUs. This means that I won’t be trying to use Vulkan for the time being.


Copy a texture to screen

For our sparse voxel octree renderer we’re going to bypass almost the entire GPU rendering pipeline and use a compute shader instead. ‘Almost’, because the pipeline is the only method to render onto the screen. In this post I’ll describe how to create a texture and copy that texture onto the screen. I also wanted to show how to use bindless textures, but unfortunately my GPU doesn’t support those. Continue reading

Creating an OpenGL 4.5 context using SDL2 and Glad

In this post I’ll describe how to set up an OpenGL 4.5 context. In my previous post, I decided that I’m going to use GLM, SDL2 and Glad for that. Glad is a loader generator, so head over there and choose API: gl version 4.5, Profile: core, Extensions: GL_ARB_bindless_texture, GL_ARB_sparse_buffer, GL_ARB_sparse_texture, Options: uncheck ‘generate a loader’. Click generate; download the zip-file and unpack glad.h and glad.c into a folder named glad in your source directory. The package also contains a file khrplatform.h, which you don’t need if you remove the line #include <KHR/khrplatform.h> from glad.h. Furthermore, because you put glad.c and glad.h in the same folder, you should change glad.c to replace #include <glad/glad.h> with #include "glad.h".

Yes, I said I wanted to avoid extensions, however, those three are part of AZDO (approaching zero driver overhead) OpenGL, so I want to be able to explain their use as well. I also plan on using the sparse buffers in my voxel-engine.

Continue reading

Starting with OpenGL 4.5

When I started 3D programming, OpenGL 2 did not yet exist. I grew up using glBegin and glEnd, but quickly learned that you needed to use vertex arrays to get any decent performance. Since OpenGL 3.3 core profile, both are gone and you need to use buffer objects to send a list of vertices to your GPU. However, the biggest change is that the fixed pipeline is gone, which means that you’re forced to use shaders.

During my master in computer science and engineering, I did get some experience in using shaders and their programming language GLSL. However, I used Qt4 which provided a very nice class for dealing with shader programs. So, I have no experience with compiling shader programs and binding data using the OpenGL API. Furthermore, the GLSL language itself has also changed quite a lot. Basically, I have to learn OpenGL from scratch again.

There are also big differences between minor OpenGL 4 versions. While no functionality is removed, the added functionality is sometimes a huge improvement over the functionality it seeks to replace, for example: the debug output added in 4.3. I will be using OpenGL 4.5, for the sole reason that my GPU claims to support it.
Continue reading

Moving towards a GPU implementation

SDL-1.2 is already 3.5 years old and no longer receives any updates. Therefore I’ve taken the effort to port the voxel-engine to SDL-2. Also, I figured out how to connect my wii-u pro controller on linux (which involved modifying and recompiling bluez-4). So I’ve also added some joystick support. Two weeks ago I pushed those changes to github.

I experimented a bit with the engine and noticed that areas that don’t face the camera traverse more octree nodes than the areas that do face the camera. This can be made visible by counting the number of octnodes that are traversed per quadnode and color a pixel pink when this is above 16. The result is shown in the picture below, with some of these pixels  highlighted with a red oval. For these pixels, it would be better to traverse the octree less deep, such that we get some cheap kind of anisotropic filtering. So, in most areas the number of traversed octnodes is usually rather limited and in those areas where it is not, we shouldn’t be traversing the octnodes that much.

Screenshot from the Sibenik dataset. Red highlight: pink pixels, which indicate a high octree node traversal count. Blue highlight: a rendering bug, caused by improper ordering of octree nodes.

The blue area highlights a bug in my rendering algorithm. If you look closely, you see two ‘lines’ of white dots. The problem is that, while within an octnode the children are traversed from front to back, each child node is rendered entirely before the next child node is rendered. Hence octnodes that are further away can end up obscuring octnodes that are nearer to viewer. Which is what we see happening here. To fix this, we should sort the octnodes in each quadnode, rather than traversing the octnode children in front to back order. This is possible as we can almost safely limit the number of octnodes per quadnode. The octnode limit can be configured for a trade-off between rendering speed and accuracy.

With this new algorithm the recursive rendering function no longer needs to return a value. Hence we are no longer required to traverse the recursion tree using depth-first search and instead can use breadth-first search (BFS). BFS is much easier to parallelize and therefore much easier to implement efficiently for the GPU.

Depth information and screen space ambient occlusion

The voxel rendering algorithm already collected depth information. A while ago I added some code that collects this information and stores it in a depth buffer. Hence it can be used in post-processing steps, such as the screen space ambient occlusion (SSAO) lighting technique. I implemented the SSAO filter using this tutorial. However, my rendering algorithm does not produce normal information, so I had to change it slightly.

I also added some code to allow super sampling anti aliasing (SSAA), which basically renders the image at a much higher resolution and then downscales this to the output resolution. This a prohibitively expensive anti-aliasing technique, but still, my renderer takes only 2-4 seconds to render a 7680×4320 image and downscale it to 1920×1080, which results in 16x SSAA. Combined with the SSAO this resulted in the pictures attached below. To emphasize the effect of SSAO, I exaggerated the resulting shadows.

Now running ~10fps at 1024×768

I noticed that GCC’s vector extensions not always provide an efficient translation to machine code, while occasionally machine code also has operations not available with the vector extensions. Therefore I decided to replace the use of vector extensions with Intel Intrinsics, optimizing the code along the way. This lead to a quite impressive increase in rendering speed. The Intrinsics Guide was an invaluable resource for working with the arcane syntax of the intrinsics.

I’ve changed the sign of x1 and y1 in the bound, dx, dy and dz parameters, which allowed me to merge the dltz and dgtz parameters into the frustum parameter. And furthermore, perform frustum occlusion checking using only 2 operations (_mm_cmplt_epi32 and _mm_movemask_ps). The movemask operation was also useful for reducing the number of operations in computing the furthest octant.

I wanted to use _mm_blend_ps for computing the new values of bound, dx, etc. However, its mask parameter had to be a compile time constant, which forced me to unroll the quadtree traversal loop. This turned out to result in another significant speed up. Unrolling the loops for the octree traversal also caused in improvement in rendering speed.

While the graphical output has not changed since I posted the first benchmarks, the rendering time has decreased quite a bit. Currently it runs the benchmarks at:
Test 1: 109.49ms, 3188432 traversals.
Test 2: 101.27ms, 2959452 traversals.
Test 3: 133.44ms, 3736954 traversals.

Using intrinsics instead of vector extensions is also an important step towards porting my rendering engine to the msvc compiler.

The rendering engine is now detached from the SDL viewer

As I mentioned in a previous post, I wanted to detach the sparse voxel octree rendering code from the SDL based viewer. With commit f36fc4ea that is now realized. The rendering engine now only requires the GLM library, which is used to setup the initial call to traverse. The engine has libpng as an optional dependency. If you have it, the engine will provide a method with which you can easily save the resulting image as png, otherwise that method is a no-op. The benchmark and viewer still require SDL though and are not build if not available.

Voxel editor

I’ve been looking for some voxel-editors and found a few that are gratis:

  • MagicaVoxel: looks very nice, however, it is closed source and no Linux build is available;
  • Zoxel: open-source, but rather limited;
  • VoxelShop: closed-source, seems to work fine.

However, creating voxels manually is rather tedious work. So, I would like to generate them using a script. Furthermore, my voxel renderer supports some features that are not available elsewhere. For example, one can use extruded triangles as ‘leaf-nodes’ instead of tiny cubes, because nodes in a voxel octree can refer to themselves.

Hence, I decided that I want to make my own voxel editor. One that will be purely script based. So, if you want to make a single change, you have to change the script that generated it and re-execute that script. As scripting language I’ve chosen Lua, because it is almost trivial to create c++ bindings for Lua. I have also decided to use Qt4, as I don’t want to write yet another widget toolkit for SDL. However, I still want to use my voxel renderer inside my voxel editor, which means that I have to decouple the rendering engine from the SDL based viewer.


As the first step, I switched from GNU makefiles to CMake. I noticed that I’m starting to gather quite some dependencies, of which most are not very important and only used for some minor feature. For example, the benchmark tool uses libpng to allow saving the generated images. Now if you don’t have the development files for libpng installed, the benchmark tool will just be built without the image saving functionality. There is also a program that can convert a texture and heightmap into a pointcloud and uses SDL_image to read the input images. If you don’t have SDL_image, this program won’t be built, though other targets still might be.

The voxel rendering engine is not yet detached from the SDL viewer, but hopefully will be soon. I might then also update the benchmark tool to no longer use SDL, or make its SDL usage optional.

Video capture

I also took the effort to get the video capture working again. I had some issues with libAV, so I switched to using ffmpeg. Those libraries are very similar, but not fully compatible. You will need to install ffmpeg if you want to use the video capture feature. The capture feature is disabled by default, because cmake does not see the difference between ffmpeg and libAV; ffmpeg is not that easy to obtain; and trying to use libAV will break the build. See for how to enable the video capture feature.