I noticed that GCC’s vector extensions not always provide an efficient translation to machine code, while occasionally machine code also has operations not available with the vector extensions. Therefore I decided to replace the use of vector extensions with Intel Intrinsics, optimizing the code along the way. This lead to a quite impressive increase in rendering speed. The Intrinsics Guide was an invaluable resource for working with the arcane syntax of the intrinsics.
I’ve changed the sign of x1 and y1 in the
dz parameters, which allowed me to merge the
dgtz parameters into the
frustum parameter. And furthermore, perform frustum occlusion checking using only 2 operations (
_mm_movemask_ps). The movemask operation was also useful for reducing the number of operations in computing the furthest octant.
I wanted to use
_mm_blend_ps for computing the new values of
dx, etc. However, its mask parameter had to be a compile time constant, which forced me to unroll the quadtree traversal loop. This turned out to result in another significant speed up. Unrolling the loops for the octree traversal also caused in improvement in rendering speed.
While the graphical output has not changed since I posted the first benchmarks, the rendering time has decreased quite a bit. Currently it runs the benchmarks at:
Test 1: 109.49ms, 3188432 traversals.
Test 2: 101.27ms, 2959452 traversals.
Test 3: 133.44ms, 3736954 traversals.
Using intrinsics instead of vector extensions is also an important step towards porting my rendering engine to the msvc compiler.