Let’s Play Fez

My girlfriend was watching a Let’s play of the game Fez. It is mostly a 2D platform game. Though, the twist is that it is actually in a 3D-world. And your character happens to receive the power to change his perspective. The game has really good graphics, but more importantly, these graphics are voxels. So it seems that they are using a cubeset (3D-tileset), rather than 2D tilesets. And while they still don’t have a fully seamless cubeset, they come pretty close.


Edit: This article used to be titled ‘cubesets’. However, as ‘bricks’ are a better 3D equivalent of ’tiles’ than ‘cubes’, I’ve renamed it to ‘bricksets’.

If we want to use the sparse voxel octree renderer to make games, we have to find a more efficient way to store game worlds, because having to download and install a 5TB large world file just isn’t feasible. Therefore we’re going to look at how this problem was already solved 30 years ago. We’re going to look at tilesets.

2D Tilesets

Back in the old days, memory was scarce. To get around this, some games used tilesets to create big worlds. First you create a set of tiles, and then you can create your world by painting with these tiles in a grid. The Tiles can be ‘intentional’, for example: this grid cell is water, that grid cell is dirt. This looks like:

Another option is to use a ‘seamless’ tileset, where additional tiles have been created that represent the transition from one type to another. This helps to hide the grid, as it is no longer visible where one tile ends and another begins:

When creating a seamless tileset, eventually the question “what tiles do I have to make?” will pop-up. The answer is based on the marching squares algorithm. For example, we want to create grass and dirt tiles. Each tile has 4 corners, and each corner can be either grass or dirt. Hence, we would need 2x2x2x2 = 16 different tiles. There is a nice explanation of the marching squares algorithm and how to combine it at: http://blog.project-retrograde.com/2013/05/marching-squares/.

3D Tilesets

The above can quite easily be translated to 3D. Using intentional tileset results in a minecraft like game. For example, the following screen shot is from minetest (my favorite open-source minecraft clone):

I don’t have any images that use the 3D equivalent of seamless tiles, because such games don’t exist yet (though the Euclideon island demo would almost qualify).

Again, the first step is to create a tileset. How many tiles do we need? Well, a cube has 8 corners, so if we are only having dirt & air tiles, this would result in 28 = 256 tiles. That is quite a lot, especially because creating 3D tiles is also a lot harder than drawing 2D tiles. So, instead of drawing these by hand, I want to generate them procedurally. I’ve not yet figured out how I want to do that. Obviously, I also want a third tile type, for example stone. If we combine three tile types in a single tile, we would end up with 38=6561 different possible tiles. Obviously, most of these will rarely be used. So here it would be nice to have a method that can create such tiles on demand.

The name ’tilesets’ isn’t really accurate, because in 3D, they’re not tiles, they’re bricks, hence we might want to call them bricksets.

New data format

I finally got around implementing the new data format, which decreased the file sizes by 69-78%. For example, the Sibenik dataset went from 1.6GiB to 483MiB. The rendering time increased from 153ms to 160ms, however, the variation between scenes varies from 93ms to 210ms.

The format has become the following:

struct octree {
    uint32_t avgcolor:24;
    uint32_t bitmask : 8;
    uint32_t child[0];

The child array with length 0 denotes a variable length array. As it has size 0, the size of the struct is 4 bytes, plus 4 bytes for every child. Which child nodes exist is stored in the bitmask. Usually, the child array stores the index to the child nodes, such that given an array octree[N] a child node is stored at, for example, octree[node.child[2]]. In the last layer I don’t store pointers to the leaf nodes, but just store the colors of the leaf nodes (or’ed with 0xff000000, so they are kind of hard coded leaf nodes).

For those who want to test the new file format, I’ve pushed the source code to github. It still includes the two small test files ‘sponge’ and ‘sign’, which have been converted to the new format.

Edit: I’ve updated the git-repo, such that the master branch points to the version with the new storage format. I’ve also updated the readme file.

Next steps

Its been a while since my last post. I’ve been working on changing the octree storage format, to obtain better compression. The current format stores 8 child pointers per internal node (leaf nodes are not stored), using the following 64 byte struct:

struct octree { 
    uint32_t child[8];
    int32_t avgcolor[8];

However, this is a rather wasteful format if nodes have 4 children on average or less. Therefore I considered changing it to the variable length struct:

struct octree {
    uint32_t bitmask: 8;
    uint32_t color  :24;
    uint32_t child[0];

This struct is only 4 bytes per node + 4 bytes per child pointer. With this, the data file size decreases with about 75% 60%. The drawback is that the traversal code becomes more complicated, as the children still need to be traversed in front to back order, which depends on the camera orientation.

Currently my voxel renderer can output at 5 FPS reliably. This is nowhere near the 30 FPS that I would require for an interactive game. And changing the octree file format is not going to change that. Even worse, it might actually reduce the rendering speed. I might be able to further optimize my algorithm, but without conceptual changes, I would be able to reach 10 FPS tops (rumor has it that this is the frame rate that UD runs at). As I lack hope that I can get my renderer running at the required 30 FPS, changing the storage format does not seem worth the effort.

One conceptual change that might be an option is point caching. However, I don’t like it, as it seems to much of a dirty hack to me and I have little hope that I can get it working properly. The conceptual change that I’m currently considering is to port my algorithm to the GPU. Due to the amount of recursion, that is not a trivial task. However, I’ve done some GPU programming before (using GLSL) and I think I might be able to change my algorithm such that it can run on the GPU. OpenCL seemed like the most obvious language to program it in, however, it is bulky and I have no experience with it. Fortunately, there happen to be some OpenGL extensions that make OpenGL a viable choice, such as ‘compute shaders‘, which makes a lot of GPGPU functionality directly available in OpenGL; ‘indirect rendering‘, which eliminates costly round trips from GPU to CPU and back; and ‘atomic counters‘, which I can use to fill a vertex buffer with a variable amount of points or triangles.


As requested, here are some benchmarks. The dataset used is available at https://www.dropbox.com/s/zdy53f7ywfs3mls/sibenik.7z. The points are centered around (0x8000,0x8000,0x8000). The data format used is:

struct point{
    uint16_t x,y,z;
    uint8_t r,g,b;
    uint8_t dummy;

I’ve chosen 3 camera positions (uses glm.hpp):

struct Scene {
    const char * filename;
    glm::dvec3 position;
    glm::dmat3 orientation;
static const Scene scene [] = {
    {"sibenik",  glm::dvec3( 0.0000,  0.0000,  0.0000),  glm::dmat3(-0.119, -0.430, -0.895,   0.249,  0.860, -0.446,   0.961, -0.275,  0.005)},
    {"sibenik",  glm::dvec3(-0.0462, -0.0302,  0.0088),  glm::dmat3(-0.275,  0.188,  0.943,   0.091,  0.981, -0.170,  -0.957,  0.039, -0.286)},
    {"sibenik",  glm::dvec3( 0.0500, -0.0320,  0.0027),  glm::dmat3(-0.231, -0.166, -0.959,  -0.815,  0.572,  0.097,   0.532,  0.803, -0.267)},


The positions are normalized, with -1.0 referring to 0x0 and 1.0 referring to 0x10000.

The view frustum is defined as in the code below, but without the near and far clipping. (See also the documentation for glFrustum.)

static const int SCREEN_WIDTH = 1024;
static const int SCREEN_HEIGHT = 768;


Hardware: CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz.

Making some progress

I’ve finally managed to implement the algorithm I described in my previous post. It is still rather slow (2FPS at 1024×768 on average), with some frames taking up to 5 seconds. Also, my implementation does not yet always traverse nodes in the correct order, which causes some ordering errors.

Sibenik dataset

Sibenik dataset

A sorting based rendering method

Rasterization, ray tracing, ray casting and splatting are well-known rendering methods. At Euclideon they claim to have found a new method, which they call ‘searching’ or ‘sorting’. I think sorting is a good name and believe that it works as described below.

Scene graph

First you need a scene that you want to render. We want to be able to break down this scene into smaller pieces indefinitely, hence this scene needs to be represented as a tree of infinite size and height and no leaf nodes. It is stored in memory as a directed graph of finite size. Each node models a part of the scene by describing how it is made out of other nodes. As such, the graph effectively describes a fractal. One of the nodes in the graph is the root node and contains the entire scene. We will refer to this graph as the scene graph.

A few trivial examples of scene graphs:

The line segment is made of two scaled down versions of itself. The cube is made similarly. The triangle graph consists of two nodes (marked with a and b) as this eliminates the need for rotation. Based on the cube graph, we can convert an octree into a valid scene graph by letting its leaf nodes point to themselves. Sets of line segments and triangles can be stored in K-D trees or BSP trees.


Culling is the process of removing parts from the scene that are not visible. This is generally used to speed up rasterization and splatting algorithms and to ensure that parts near the camera are not obscured by parts further away. The sorting rendering method is a combination of frustum culling and occlusion culling.

With frustum culling the parts that are outside the frustum of the camera are removed. This is usually done using bounding spheres and a dot product. We use a different approach. First we chose an origin and chose the xy-axes to be parallel to the sides of the viewing window. Then the bounds of the viewing window projected on the xy-plane are computed. In the image below these are denoted x1 and x2 for the x-axis. Furthermore we compute how these bounds change when the origin is moved. Those bounds and their changes are then used to determine whether the model (shown in orange) is within the frustum.

With occlusion culling, the parts that are fully obscured by other parts are removed. The best-known method for this is the z-buffer, which works on the pixel level. It can be changed to work on the part level by computing a bounding rectangle and the lowest z-value and then check for each pixel in the rectangle whether it is lower than the lowest z-value of the part, in which case it fully obscured and thus can be occluded. This is obviously very slow and can be sped-up by using a quadtree as z-buffer, in which the leaf nodes are the original z-buffer and internal nodes have the maximum value of its children. This quadtree is known as a hierarchical z-buffer.

Another way to speed up the occlusion culling is to sort the scene parts in front to back order, which maximizes the number of parts that are obscured and thus do not need to be rendered. If the parts are ordered perfectly, such that later parts can never obscure earlier parts, then we can use a bitmap, which describes whether a pixel is rendered, instead of a z-buffer.

Performing culling tests for each individual part of your scene can be almost as slow as the rendering itself. Therefore these tests are generally performed on groups of parts. This is where our scene graph comes into play, which defines how the entire scene is constructed out of some groups, which are recursively constructed out of smaller groups.

The new rendering method

The rendering method can be described as searching (for the model parts that need rendering), sorting (put those parts in buckets, known as pixels) and mass connected processing (doing this for many parts simultaneously). It is a combination of the above described frustum and occlusion culling, that entirely eliminates the need for rendering.

The rendering starts by throwing the root node of your scene graph into the root node of the occlusion quadtree. Then the following recursive process starts:

  • If the scene part is larger than the quadtree node it is in, then it is broken into smaller parts (by traversing the parts outgoing edges). The parts that pass the frustum and occlusion culling tests for the current quadtree node are reinserted in front to back order.
  • If the scene part is smaller than the quadtree node and the quadtree node is not a leaf node, then it is inserted in those quadtree child nodes for which the part passes the frustum and occlusion culling tests.
  • If the scene part is smaller than the quadtree node and the quadtree node is a leaf node, then the pixel corresponding to that leaf node is colored using the average color of the part.

The occlusion culling test is fast, because it only needs to check a single boolean (or, if the parts are not sorted perfectly, compare two integers). The frustum culling is fast as it only needs a small number of comparisons (and is usually easy to implement). Furthermore, during the recursion, the new bounds of the projected view window can also be computed in constant time.

One nice thing about this rendering method is that it entirely removes the need for perspective projection. You only need to implement the frustum culling test and how the projected view window bounds change during the recursion. For some scene graphs, including the octree, these operations can be performed with only additions, subtractions and bit shifts. (Btw. this method effectively performs a long tail division, which is used to create the perspective projection.)

I suspect that the Atomontage Engine is also using this sorting approach.

(The images were made using ipe and pdftocairo.)

A proposal of a UD like algorithm

This is a repost of my guestpost originally posted on chorasimilarity’s blog.

I designed an algorithm that, given an octree and a location within that octree can compute the cubemap at that location. The algorithm kind of raytraces each face of the cubemap, but does this in such a way that multiple rays are traced simultaneously (aka. mass connected processing).

The faces of the cubemap are axis-aligned, which allows us to use linear interpolation rather than perspective correct interpolation (which includes 2 multiplications and a division). Because of this linear interpolation, the algorithm only needs integer additions, subtractions, and multiplications by a power of two (aka. bitshifts).

For a more thorough explanation of the algorithm please read the Article.

I’ve also written a partial implementation of the algorithm, available at Git-hub, which renders the positive z plane of the cubemap. It currently does so at 2 fps, which, considering I did not yet put effort in optimization and screen space occlusion is not yet implemented, seems promising. The data is still fully loaded into memory at startup. As background I render a fixed cubemap.

Due to space considerations, not all models have been uploaded to git-hub, but a small low-resolution model is available, such that after compiling the code, you can at least see something. The code is rather messy as it is still in an experimental state.