Next steps

Its been a while since my last post. I’ve been working on changing the octree storage format, to obtain better compression. The current format stores 8 child pointers per internal node (leaf nodes are not stored), using the following 64 byte struct:

struct octree { 
    uint32_t child[8];
    int32_t avgcolor[8];
};

However, this is a rather wasteful format if nodes have 4 children on average or less. Therefore I considered changing it to the variable length struct:

struct octree {
    uint32_t bitmask: 8;
    uint32_t color  :24;
    uint32_t child[0];
};

This struct is only 4 bytes per node + 4 bytes per child pointer. With this, the data file size decreases with about 75% 60%. The drawback is that the traversal code becomes more complicated, as the children still need to be traversed in front to back order, which depends on the camera orientation.

Currently my voxel renderer can output at 5 FPS reliably. This is nowhere near the 30 FPS that I would require for an interactive game. And changing the octree file format is not going to change that. Even worse, it might actually reduce the rendering speed. I might be able to further optimize my algorithm, but without conceptual changes, I would be able to reach 10 FPS tops (rumor has it that this is the frame rate that UD runs at). As I lack hope that I can get my renderer running at the required 30 FPS, changing the storage format does not seem worth the effort.

One conceptual change that might be an option is point caching. However, I don’t like it, as it seems to much of a dirty hack to me and I have little hope that I can get it working properly. The conceptual change that I’m currently considering is to port my algorithm to the GPU. Due to the amount of recursion, that is not a trivial task. However, I’ve done some GPU programming before (using GLSL) and I think I might be able to change my algorithm such that it can run on the GPU. OpenCL seemed like the most obvious language to program it in, however, it is bulky and I have no experience with it. Fortunately, there happen to be some OpenGL extensions that make OpenGL a viable choice, such as ‘compute shaders‘, which makes a lot of GPGPU functionality directly available in OpenGL; ‘indirect rendering‘, which eliminates costly round trips from GPU to CPU and back; and ‘atomic counters‘, which I can use to fill a vertex buffer with a variable amount of points or triangles.

Advertisements

34 thoughts on “Next steps

  1. Dont you think that a better data representation will result in better performance? Yes, there is a drawback when decompression makes the traversal take longer, but i think atm the data is too big. Making it smaller will result in having more cached, which should increase performance. Btw, i think your data is still big. Can you post a memory comparison between your old and New Format regarding the Cathedral Form the Benchmarks?

    • The Sibenik dataset (the Cathedral) has 75.232.796 points, stored using 10 bytes per point.
      With my original octree format that becomes 1640 MiB, i.e. 92,4 22,9 bytes per point.
      The new format uses 769 MiB, i.e. 45,4 10,7 bytes per point, a reduction of 53%.

      For the tower dataset (152 million points) the file size is reduced from 2354 MiB to 632 MiB, which is a reduction of 73%.

      For the mounaloa dataset (136 million points) the file size is reduced from 2084 Mib to 863 MiB, which is a reduction of 59%.

      Apparently I did not remember the 75% reduction correctly. It is more like 60%.

      • Yes, i meant the sibenik data. For the Tower its close to 75% and even 60% is awesome. Big thanks for sharing this. I also have thought about the data representation and i think i can bring the sibenik down to about 430mb using a linkless octree. Could you please also post how deep your octree is? I mean the more depth, the more data, right?

      • For the Sibenik dataset, the depth is 16, though effectively, only 13 tree layers are used.

        Also, my conversion algorithm ensures that each layer has between 2 and 8 times as many nodes as the layer above. Hence the octree has at most twice as many nodes as there are points. Hence the octree file uses at most 16 bytes per point. Half of this is used for storing pointers, and almost half is used for color.

        P.S. the ‘bytes per point’ in my previous comment were wrong.

  2. If you are interested in bounds on file size, I’ve compressed the Sibenik dataset into a LASzip file, which takes only 129MiB. LASzip is a lossless compressed LAS file format, find more information at http://www.laszip.org/

    • ine, i dont know if that is applicable here. bcmpinc and i are talking about the datarepresentation in ram, not about the required disk space. of course, if you save the data on your harddrive, you can compress it. but if you want to do something with the data, you need to decompress it first (at least somehow). this process is probably slow and again thats not quite the problem. the moment you have your data very small und usable (uncompressed), a lot of points can be cached in faster memory (L1, L3, registers, …) and are therefore accessed faster.
      at least thats how i see this.

      even if we used the LASzip somehow, its probably the plain point data, no octree/hierarchy/datastructure involved. if there was a special way to compress an octree with fast decompression, that would be helpful.

      • I understand the point. There were concerns previously about the storage size for such data as well, that’s where LASzip is relevant. Also, it preserves the order of the points, so with clever ordering it may be possible to reconstruct an octree very quickly when loading data from LASzip. Indeed it doesn’t use any hierarchical structures internally, just clever compression. This also suggests that if one uses an octree and clever compression, the data size on disk may go down even more.

        As for the in-memory representation, the plain octree may be the best. I don’t think that a linkless octree is useful, because it has significant traversal overhead if you want to visit nodes in arbitrary order, which is the case in front-to-back splatting. But consider also that you need to stream the data from the disk. It may be faster to load highly compressed data and decompress it, rather than loading uncompressed data.

  3. yes, you are right. especially for future streaming, compression is useful. still, i dont know if its viable to decompress smaller parts on the fly. the fact that the order is preserved is very interesting especially when a linkless octree is used.

    maybe one could add the hierarchy as simple points too and compress it together with the actual leaf points…

    • Interesting. It seems they lied about not using divisions for perspective transformations. They do use them, until they switch to orthogonal projection. My algorithm only uses divisions for the root node and from that point only uses additions,subtractions and bitshifts, while still doing perspective projection. Someone suggest switching to orthogonal projection, but I believe it made my algorithm more complicated and slowed it down.

      For occlusion culling they seem to use a bitmask, and when checking whether the bounding box of the current octree node is occluded they iterate over all the pixels in the bounding box, to see if they’ve been rendered. I shortly considered this approach, but would have implemented it using a fenwick-tree, as that reduces the query time from O(n2) to O(log2 n). Though update times are increased from O(1) to O(log2 n). The occlusion method I’ve implemented uses a hierarchical bitmask (aka. quadtree). And my traversal function traverses the octree and quadtree simultaneously.

      Seems like that the octree, octree traversal and octree node bounding boxes are the only things that my algorithm has in common with Euclideon’s.

      • Yeah, i also think they left out the interesting parts. I dont believe thats all there is to it. My algorithm also only does perspective projection for the 8 root vertices. Form there its only adding and bitshifting. The occlusion culling could be faster than hiz. I have to think that through. My algorithm is still a lot slower..

        Btw, that patent is horribly written. I read a lot of scientific papers, but this is very hard to read. Not because oft what it says, but how its written. At one point i thought there was something wrong with the document because oft repetitions.
        I wouldnt be suprized if this was a hoax, lol

      • Don’t forget that scientific papers are generally written to be reproducible, while the goal of patents is that it is not reproducible. With that in mind, I find the patent quite readable, especially the flowchart.

        Don’t worry about the frame rate. If you get 10fps at 640×480 using only a single CPU core, you are doing just as good as they do.

      • OK, i only read the description at Google. There were no images.

        Why is 10fps at 640×480 as good as the UD algorithm?

      • Their algorithm can render at a 1024×768 resolution at slightly more than 10 frames per second per cpu. It can make use of multiple cpu cores to improve rendering speed. However their octree splatting algorithm renders the octree node bounding boxes as quads with a minimum size of 2×2 pixels. That allows them to reduce the octree traversal depth, at the cost of a much lower image quality. I estimate that the image quality is comparable to that of a properly rendered 640×480 scene. Also, their 2×2 point size also causes very noticeable deformations.

  4. And it seems they’re limit the FOV as well. Instead of the more regular 90 it is more like 60 to 70 degrees.

  5. So with all the information we have now, what do you think about their demos (UD Island)? Its at higher resolution with a lot more than 10fps.

    Whats their secret? It must have something to do with the tree data Format.

  6. IMHO: It´s upscaled from a lower resolution in the same way as Geoverse, which is the reason why they never showed a single lossless screenshot. In addition, they use multiple cores.

    • Sounds plausible. That would be a very cheap trick. Well they didnt say that the Videos show the native resolution…
      But they always emphasized that it runs in a single thread, right? At least thats what i remember.
      Its sad that its some Kind of hoax afterall

  7. Glad to see people are still thinking about this. A session with IDA Pro could settle any doubts re: the finer poiints of Euclideons algo 😉 Seems like thay may have mis represented some claims, but no doubt we will continue to explore CPU based approches regardless.

  8. Hi Bauke, I found something of interest for you if you are wishing to port your recursive code to GLSL:
    http://thomasdiewald.com/blog/?p=1825
    His implementation is a KD-Tree, GL-es compatible, and he has turned the recursive problem into an iterative one, and is getting great throughput as a consequence.

  9. @batman Yes, I think pop-in is inevitable with any kind of streaming system, be it a fast bus like PCIE, or a slow bus like the internet. Personally, I think pop-in ruins emersion a little, but not as much as frame stalls or general lagginess would.

    @bcmpinc I’m going to carry on hacking on your original design, it’s running like a treat on my machine with a few tweaks, a steady 30-40fps, and if I move the morton decode onto the GPU I will get another 5fps out of it. I saw on Github that you made some changes to the quadtree and tried vectorizing too. Did it not work out as you anticipated ? Switching to pixel buffer objects has made a big improvement speed wise, with the benefit that the quadtree and final cubemap resolution are independant now. I’m going to try my linear octree code on it next, I’ll let you know if it was worth splitting the hot/cold data. According to Jaroen Baert ( http://www.forceflow.be/ ) it was with his octree system

    • I tried removing every other layer in the quadtree (such that each node has 16 children instead of 4), hoping that the reduced tree depth would improve performance. It did not. Neither did that vectorization.

      I keep being stuck at 10fps, so how did you reach 30? Can you share your changes? And how did you move the morton decode to the GPU?

      • I can share, I’ll put it on github. Quick summary of changes;

        Changed quadtree recursion in build and render to 6754 instead of 5467.
        Moved morton decode one layer up in traverse (~1ms saved, not a big win)
        Added some early out logic in the octree recurse part of traverse, using bool masks. Saves on a lot of traverse calls.
        Changed GlTeximage to glTexSubimage2d, and use the top level quadtree for each face to send only portions of each face that have changed.( top level splits each face into 4 ) Saves a little bandwidth.

        What I plan to do next with the morton decode (yes it is eating a of of time) is use the shader
        to do morton decode example: http://glsl.heroku.com/e#9262.1

        Here are results so far;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s