Now running ~10fps at 1024×768

I noticed that GCC’s vector extensions not always provide an efficient translation to machine code, while occasionally machine code also has operations not available with the vector extensions. Therefore I decided to replace the use of vector extensions with Intel Intrinsics, optimizing the code along the way. This lead to a quite impressive increase in rendering speed. The Intrinsics Guide was an invaluable resource for working with the arcane syntax of the intrinsics.

I’ve changed the sign of x1 and y1 in the bound, dx, dy and dz parameters, which allowed me to merge the dltz and dgtz parameters into the frustum parameter. And furthermore, perform frustum occlusion checking using only 2 operations (_mm_cmplt_epi32 and _mm_movemask_ps). The movemask operation was also useful for reducing the number of operations in computing the furthest octant.

I wanted to use _mm_blend_ps for computing the new values of bound, dx, etc. However, its mask parameter had to be a compile time constant, which forced me to unroll the quadtree traversal loop. This turned out to result in another significant speed up. Unrolling the loops for the octree traversal also caused in improvement in rendering speed.

While the graphical output has not changed since I posted the first benchmarks, the rendering time has decreased quite a bit. Currently it runs the benchmarks at:
Test 1: 109.49ms, 3188432 traversals.
Test 2: 101.27ms, 2959452 traversals.
Test 3: 133.44ms, 3736954 traversals.

Using intrinsics instead of vector extensions is also an important step towards porting my rendering engine to the msvc compiler.

Advertisements

Next steps

Its been a while since my last post. I’ve been working on changing the octree storage format, to obtain better compression. The current format stores 8 child pointers per internal node (leaf nodes are not stored), using the following 64 byte struct:

struct octree { 
    uint32_t child[8];
    int32_t avgcolor[8];
};

However, this is a rather wasteful format if nodes have 4 children on average or less. Therefore I considered changing it to the variable length struct:

struct octree {
    uint32_t bitmask: 8;
    uint32_t color  :24;
    uint32_t child[0];
};

This struct is only 4 bytes per node + 4 bytes per child pointer. With this, the data file size decreases with about 75% 60%. The drawback is that the traversal code becomes more complicated, as the children still need to be traversed in front to back order, which depends on the camera orientation.

Currently my voxel renderer can output at 5 FPS reliably. This is nowhere near the 30 FPS that I would require for an interactive game. And changing the octree file format is not going to change that. Even worse, it might actually reduce the rendering speed. I might be able to further optimize my algorithm, but without conceptual changes, I would be able to reach 10 FPS tops (rumor has it that this is the frame rate that UD runs at). As I lack hope that I can get my renderer running at the required 30 FPS, changing the storage format does not seem worth the effort.

One conceptual change that might be an option is point caching. However, I don’t like it, as it seems to much of a dirty hack to me and I have little hope that I can get it working properly. The conceptual change that I’m currently considering is to port my algorithm to the GPU. Due to the amount of recursion, that is not a trivial task. However, I’ve done some GPU programming before (using GLSL) and I think I might be able to change my algorithm such that it can run on the GPU. OpenCL seemed like the most obvious language to program it in, however, it is bulky and I have no experience with it. Fortunately, there happen to be some OpenGL extensions that make OpenGL a viable choice, such as ‘compute shaders‘, which makes a lot of GPGPU functionality directly available in OpenGL; ‘indirect rendering‘, which eliminates costly round trips from GPU to CPU and back; and ‘atomic counters‘, which I can use to fill a vertex buffer with a variable amount of points or triangles.

Benchmarks

As requested, here are some benchmarks. The dataset used is available at https://www.dropbox.com/s/zdy53f7ywfs3mls/sibenik.7z. The points are centered around (0x8000,0x8000,0x8000). The data format used is:

struct point{
    uint16_t x,y,z;
    uint8_t r,g,b;
    uint8_t dummy;
};

I’ve chosen 3 camera positions (uses glm.hpp):

struct Scene {
    const char * filename;
    glm::dvec3 position;
    glm::dmat3 orientation;
};
static const Scene scene [] = {
    {"sibenik",  glm::dvec3( 0.0000,  0.0000,  0.0000),  glm::dmat3(-0.119, -0.430, -0.895,   0.249,  0.860, -0.446,   0.961, -0.275,  0.005)},
    {"sibenik",  glm::dvec3(-0.0462, -0.0302,  0.0088),  glm::dmat3(-0.275,  0.188,  0.943,   0.091,  0.981, -0.170,  -0.957,  0.039, -0.286)},
    {"sibenik",  glm::dvec3( 0.0500, -0.0320,  0.0027),  glm::dmat3(-0.231, -0.166, -0.959,  -0.815,  0.572,  0.097,   0.532,  0.803, -0.267)},
};

 

The positions are normalized, with -1.0 referring to 0x0 and 1.0 referring to 0x10000.

The view frustum is defined as in the code below, but without the near and far clipping. (See also the documentation for glFrustum.)

static const int SCREEN_WIDTH = 1024;
static const int SCREEN_HEIGHT = 768;
glFrustum(
  -SCREEN_WIDTH/2, SCREEN_WIDTH/2, 
  -SCREEN_HEIGHT/2, SCREEN_HEIGHT/2, 
  SCREEN_HEIGHT, SCREEN_WIDTH*2
); 

Results:

Hardware: CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz.

Making some progress

I’ve finally managed to implement the algorithm I described in my previous post. It is still rather slow (2FPS at 1024×768 on average), with some frames taking up to 5 seconds. Also, my implementation does not yet always traverse nodes in the correct order, which causes some ordering errors.

Sibenik dataset

Sibenik dataset