Now running ~10fps at 1024×768

I noticed that GCC’s vector extensions not always provide an efficient translation to machine code, while occasionally machine code also has operations not available with the vector extensions. Therefore I decided to replace the use of vector extensions with Intel Intrinsics, optimizing the code along the way. This lead to a quite impressive increase in rendering speed. The Intrinsics Guide was an invaluable resource for working with the arcane syntax of the intrinsics.

I’ve changed the sign of x1 and y1 in the bound, dx, dy and dz parameters, which allowed me to merge the dltz and dgtz parameters into the frustum parameter. And furthermore, perform frustum occlusion checking using only 2 operations (_mm_cmplt_epi32 and _mm_movemask_ps). The movemask operation was also useful for reducing the number of operations in computing the furthest octant.

I wanted to use _mm_blend_ps for computing the new values of bound, dx, etc. However, its mask parameter had to be a compile time constant, which forced me to unroll the quadtree traversal loop. This turned out to result in another significant speed up. Unrolling the loops for the octree traversal also caused in improvement in rendering speed.

While the graphical output has not changed since I posted the first benchmarks, the rendering time has decreased quite a bit. Currently it runs the benchmarks at:
Test 1: 109.49ms, 3188432 traversals.
Test 2: 101.27ms, 2959452 traversals.
Test 3: 133.44ms, 3736954 traversals.

Using intrinsics instead of vector extensions is also an important step towards porting my rendering engine to the msvc compiler.


2 thoughts on “Now running ~10fps at 1024×768

    euclideon released a web demo of their software. check it out

    btw. nice work! i tried porting my algorithm to gpu with very poor results. my cpu renderer was a lot faster. but i have never done gpu programming before, so probably its due to my lack of skill.

    • Thanks for the link. Unfortunately, they don’t support Linux.

      I would like to port my algorithm to the GPU as well. However, GPU’s only partially support branching and don’t support recursion. So I have to find a way to linearize my algorithm. As The traverse as a finite depth, I could implement it with a fixed size stack, though, these are also known to perform poorly on GPU’s.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s