Benchmarks

As requested, here are some benchmarks. The dataset used is available at https://www.dropbox.com/s/zdy53f7ywfs3mls/sibenik.7z. The points are centered around (0x8000,0x8000,0x8000). The data format used is:

struct point{
    uint16_t x,y,z;
    uint8_t r,g,b;
    uint8_t dummy;
};

I’ve chosen 3 camera positions (uses glm.hpp):

struct Scene {
    const char * filename;
    glm::dvec3 position;
    glm::dmat3 orientation;
};
static const Scene scene [] = {
    {"sibenik",  glm::dvec3( 0.0000,  0.0000,  0.0000),  glm::dmat3(-0.119, -0.430, -0.895,   0.249,  0.860, -0.446,   0.961, -0.275,  0.005)},
    {"sibenik",  glm::dvec3(-0.0462, -0.0302,  0.0088),  glm::dmat3(-0.275,  0.188,  0.943,   0.091,  0.981, -0.170,  -0.957,  0.039, -0.286)},
    {"sibenik",  glm::dvec3( 0.0500, -0.0320,  0.0027),  glm::dmat3(-0.231, -0.166, -0.959,  -0.815,  0.572,  0.097,   0.532,  0.803, -0.267)},
};

 

The positions are normalized, with -1.0 referring to 0x0 and 1.0 referring to 0x10000.

The view frustum is defined as in the code below, but without the near and far clipping. (See also the documentation for glFrustum.)

static const int SCREEN_WIDTH = 1024;
static const int SCREEN_HEIGHT = 768;
glFrustum(
  -SCREEN_WIDTH/2, SCREEN_WIDTH/2, 
  -SCREEN_HEIGHT/2, SCREEN_HEIGHT/2, 
  SCREEN_HEIGHT, SCREEN_WIDTH*2
); 

Results:

Hardware: CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz.

Advertisements

53 thoughts on “Benchmarks

  1. Yes, sure.
    For the moment I’ve got a problem: it seems that we have different (0, 0, 0) positions. Mine is somewhat higher than yours, a little above the level of the large painted window, near the ceiling. I don’t understand how this could have happened 🙂 Even if we use a totally different interpretation of x,y,z, the objects should stay at the same distance from the center… right?
    The octree itself if freshly rebuilt from the same file as in the Dropbox.
    In order to test if my render is ok, I’ve added a black point at (32768, 32768, 32768). You can see it in this picture, it’s very close to the camera: http://s22.postimg.org/3m94rn769/zero.jpg
    Could you check again if the positions are right please?

      • Ah, I see! Yes, then it’s fine, although FOV specification is also required, it seems. Unfortunately, my code is rigid in that respect, I only supports 90 degrees. My frustum is also a little shifted for technical reasons, but compared to the FOV difference it’s not so important.

        Anyway, here are the results, in the order of scenes in the code snippet:
        Test 1: 149.8ms, http://postimg.org/image/v3utpwy6l/
        Test 2: 145.8ms, http://postimg.org/image/cd4o5pjmr/
        Test 3: 174.5ms, http://postimg.org/image/x8tfflkw3/

        All done in a single thread of Core i7 3630QM @ 2.4 GHz, times measured all-inclusive, frame-to-frame. I don’t count the traversals currently; I would like to clarify what counts as a traversal. I have several steps of rejection for a node, some of which, like frustum culling, are very fast. So if I count every node visit as a traversal it wouldn’t give a complete picture. Perhaps a better checkpoint is when divisions are performed, or even when I decide to visit the node’s children?

        Overall, it seems that I go at least one level deeper in the tree. Especially this is evident on the first screenshot, where on your picture there is some color bleeding on the window from grayish voxels on the other side, while in my screenshot there is none, despite the fact that the window itself is smaller due to larger FOV. I can tell you about some tricks which I use – there is also some postprocessing involved to enchance the picture, for example.

      • For traversals, I count the number of times that my traversal function is called. The occlusion checks are done prior to calling. So I guess this is similar to when you decide to visit the node’s children.

        It seems that I’m stuck at 150ms. I can think of some possible optimizations, but from none of them I expect improvements of 10ms or more, while I’m aiming at 33ms. You image quality is definitely better than mine, but as long as the render time is as high as it currently is, I’m not encouraged to improve it.

        While there are about 0.1% cache misses and 14% branch mispredictions, the greatest problem is simply the amount of traversals. However, I don’t see a way to reduce that significantly.

      • Here are traversal counts for my tests:
        1. 1784661
        2. 1768067
        3. 1911440

        I count a node as traversed after its occlusion check is passed.

  2. wow, awesome results! im still facing problems with my implementation, but i hope i can post my results soon. i tried to stick as close as i could to the opengl conventions (coordinate system, matrices, …).
    btw. thx ine, for the model.

  3. For those who are trying to set the matrices, here’s the code which finally worked for me:
    auto& testPos = scene[2];
    camera.position = (testPos.position + 1.0) * (double)(0x8000);
    camera.view = mat4(glm::scale(-1, -1, 1)) * mat4(dmat4(-testPos.orientation)) * translate(-camera.position);

    I use unscaled coordinates. Note that the camera.view matrix already includes the translation, so camera.position is redundant.

    • I have various models, each with a different scale. Also, the scale is not stored in the model. Therefore the root node of each model is considered to have the same size.

      The coordinate system that I use is odd, as my camera looks down the positive z-axis, instead of the more common negative z-axis.

  4. 175ms : http://i.imgur.com/5zajAfv.jpg , Book keeping at that resolution is a real memory hog.
    45ms : http://i.imgur.com/Y8qMSE1.png . At this mip level things run really well, but image quality is poor.
    28ms http://i.imgur.com/88c27v6.png 1000 yard view, could maybe be used for imposters with a little tweaking
    1 thread hashing, 1 thread rendering, on amd a8.
    the renderer is bottlenecked by the unordered map, I’m thinking about switching to cylinder mapping and I can ditch the unordered map for a flat array, and a cos() lookup table. Ditched the bloom filter, it is faster to just rehash and conditional move.

      • Yes, 1 Thead is doing the work that the server will be doing (finding candidate points and hashing them) the other thread does everything else. Is it cheating?

      • I don’t think it’s a question of cheating. We just need to realistically estimate the computational effort involved.

      • thats of course not cheating! but it would be interesting to see, how your algorithm performs on a single thread. this way it could be compared better. it also would be nice to see the performance on something comparable to an i7 cpu.

  5. Lensman,

    Could you set the matrices as in the Bauke’s post?
    Also, do you use unordered_map from the standard library? I suppose it’s somewhat inefficient, I think it uses linked lists to resolve collisions, at least in the MS implementation. Maybe it’s better to write your own specialized version.

    • It’s actually pretty good, and performs better than the boost version with default settings, but it has some strange quirks, such as adding records if you query it using the operator[], instead of using the find semantic. I read a lot of comments about morton codes, and z curves, sounds like a space filling curve might be a better option. I’ll redo the benchmarks with proper matrices when I have time later.

      • That behavior of operator[] is intended. Otherwise code like: unordered_map map; map["key"]="value"; would not work. Furthermore the function must return an object anyway. If you do not want an entry to be created, use find.

        The z-curve is a space filling curve. The Hilbert curve is slightly better, however it is a lot more difficult to implement.

  6. i think there is something fundamentally wrong. i also implemented this “octree to quadtree” approach. it is clever. but what ive seen so far, its fairly common and known…
    (i get about the same results ~6 fps, with nearly the same image quality).
    but it is nothing other than voxel rendering in my opinion. bruce dell claims, his approach has nothing to do with voxels. also, i cant think of an easy way to animate this, bruce says, his system is capable of animations.

    we have 3 people, who implemented this, with more or less the same results. i dont think this can be improved further in terms of speed without special optimizations (according to bruce, no optimizations were made).

    • The no optimization statement is vague. However, if I were to do the remaining (non-special) improvements, that I can think of, I cannot get it faster than 10fps. A special improvement that I can think of is porting my algorithm to the GPU, which seems possible and might get it running at 30fps. There is also the option to change the scope of the algorithm and include pointcaching.

      point caching

      Bruce’s algorithm has nothing to do with ray tracing. He is unclear about voxels, partially because voxels are ill defined, partially because he wants to remain secretive. His algorithm is quite certainly based on octrees, in which case his renderer is a voxel renderer. Unless his algorithm does not render, but instead returns a selection of points (rectangles) that is then sent to the renderer. In some of the video’s these rectangles are shortly visible.

      The details about the speed, resolution and render quality of Bruce’s algorithm is a bit fuzzy. Somewhere during the development his rendering results changed from ‘thick like clay’ to their current quality. Due to the way my algorithm works, my results are also thick like clay. If I let my algorithm return a set of points, this can be solved.

      So, to get past the 6fps, point caching seems to be required.

      Animation is easy, by replacing subtrees, but no longer works when point caching is used. And note that it does not need to be octrees what is being rendered.

      • i read my post again and it sounds kinda harsh. sry.
        yes, there maybe optimizations, but as you said, i dont think they can improve a lot.
        i believe, bruce dell meant, he didnt use any simd or whatnot… of course he probably uses some kind of caching/fast implementation.

        i guess you are right, i also thought about handing some rendering to the gpu. im convinced thats what atomontage does and it wouldnt surprise me if the fps went a lot higher than 30.

        bruce explicitly says to “avoid that (voxels)”. there must be something that makes him think its not the common “voxels”, he is using.
        if i use lower resolution octrees (e.g. depth = 8) my images also look like “clay” or at least like the blocky images Lensman postet before.

        i dont know if octree/voxel animation is so easy. many people say its hard, so voxels are only for static content.

      • With ‘thick like clay’ I do not mean resolution, but how the octree nodes are rendered. Given a pixel p and an octree node n, we can render n in p if:
        – the center of n is projected onto p;
        – the projection of n intersects p.

        With the first option every node only renders a single pixel. With the second option a node can be rendered to multiple pixels, thereby making the model look thicker.

  7. Bruce’s Algorithm:
    Linkless 1D Octree Traversal

    while(d++ < maxDepth)
    do
      if(map(i,_d) && !exclusion(i,_d))
        if(map(i,d))
          (_x,_y,_z) = pos(i,d)
          (x,y) = screen_pos(_x,_y,_z)
          if(!excluded(x,y)) 
           add_render_bucket(x,y,_z,i,d)
           add_exclusion(x,y,i,d)
          end
        end
      else
        i = i + size_exclusion(_d,d)
      end
    while i < size(d)
    _d = d - 1
    end

    The above is extremely fast. exclusion is occlusion, map is bit flag for ocree empty/full(no leaf)
    size_exclusion is number of iterations at depth d per node at current level; size is number of nodes at depth d; add_render_bucket is points visible on screen, (i,d is used to obtain RGB from HDD as temperary access); keep in mind the map/hit array is stored in RAM at load time, believe me 1 bit per node is significantly small; 30mb at depth of 10, after a depth of 10 the size grows exponentially, and so I used RLE to decrease this at LOAD time(unpacked in RAM), though I havn't tried to implement the RLE traversal directly into my algorithm which may be a practical solution depending on complexity of operations. finally excluded decides if x,y is occluded via octree, uses similar traversal.

    don't need to take my word for it. just trying to help..

    • Also.. for the last time, animation is done through Color Cycling; It really isn’t a mystery, animations drop performance by almost zero;

    • What FPS do you get with your algorithm? And can you give a screenshot, preferably of the Sibenik dataset (see my post)?

      Color cycling is cheap, but limited. It cannot animate the shape.

    • lol. it sounded like you tested it already.
      pease accept my thoughts on what you posted:

      linkless octree is a way to save multidimentional data efficiently. but its slower to traverse (at least thats what i read and it makes sense). even with a simple octree with links, “we” cant accieve the performance of ud. so why even try a slower method?

      how do you handle projection? (x,y) = screen_pos(_x,_y,_z) ? project each point/node?

      how is the occlusion done? HiZ?

      it looks more or less like what bcmpinc is already doing. where exactly does it differ? only the (slower but smaller) octree representation?

      no criticism intended. i just want to understand how you think this works.

    • im a little confused. how does this work? i mean, how can you have a space filling curve like morton/hilbert when you have sparse data? do you use the space filling together with hashing?
      like points are hashes with the morton code as the key?

      • A space filling curve helps in this instance because it increases cache choerence when reading back info that needs to be rendered. If the data is in order, then certain view directions will be reading data back-to-front, the worst case for cache coherence. Alternatively you can draw and scan in-order for maximum throughput, this is what voxlap does does by drawing floor, ceiling and walls seperately.

    • lol. what a reaction to a few simple questions…
      you already postet pseudocode, so why dont you implement it?
      because of your “other projects”? haha
      you said its simple. well, then do it! show us the results.
      i dont think you thought this through. there are a few things you simply ignored.
      well, good luck…

  8. HI all

    I’m study the code and have some questions.

    Why use a zbuffer?, is the octree is traverse in front to back, when “touch” the quadtree node, is the first and not need calculate more with this region.

    Ine tell about fix the FOV to 90 for acelerate check of octree in quadtree, any can show this aproach?

    I try to port the last version to win in gcc but something with vs4i with operator [] don’t work (try with last version of gcc too), I try to port the octree format with fixed depth

    I make a octree format of fixed depth with less 8bytes/voxel, with all the nodes with color/average color in 24bits with 8bits waste but i use 32 bits por morton then a 16 bits for every dim is not allowed then I can’t convert sibenik, I plan do this conversion in future.

    with this format the parameter color is innecesary, can be removed, is a the number node in another array.

    Batman, can you show us your implementation? if gotham city is quiet perhap you have a free time for us.

    • The octree traversal is indeed from front to back, so it is sufficient to know whether a certain pixel has been rendered. When a pixel is rendered the corresponding quadtree leaf node is marked as rendered. Intermediary nodes are marked as rendered when all their child nodes are rendered.

      I used to fix the camera to 90° angles and then used this to render the sides of a cubemap, which where then rendered to screen. This has a few drawbacks, one has to compute which parts of the cubemap need to be rendered, and a pixel on the cubemap does not correspond to exactly one pixel on the screen. So far, my implementation with free camera rotation seems to be equally fast as the implementation with fixed 90° rotations.

      About the v4si error, can you tell the git-hash of the version you are compiling? And show the error?

      I’ve been working on reducing the octree file size to 8-16 bytes/voxel. However, this makes traversal a bit harder and I lost interest, because the rendering speed seems to be stuck at 150ms, while I’m aiming at 30ms.

      • why not remove the code for mark the quadtree. not need because the node not touch the same pixel.

        Yes a know, but i sat 90 degress of aperture, but with any direction to look.

        I try again and report to git, more precise, the old gcc say about [] the new have another problem..

        I think you have too many parameters in function, this is why is slow, every call have push pop all the parameter.

      • The quadtree is needed, because different octree nodes can touch the same pixel.

        Reducing the number of function parameters does not improve the speed. The compiler is smart enough to push and pop as little parameters as possible.

    • zbuffer is needed (here) because there is no other way to tell whether a node is occluded or not. the tree is traversed in front to back order, but how can you tell if a node is behind an already drawn node and can therefor be discarded?

      in my opinion, the fov of 90° mentioned by ine is because this way there is no need to do any float or tan operations. if you have a point in space with a distance z from camera, you can have a look at the “frustum plane” at that distance. if the point is within that plane, it is visible. the moment you have a fov of 90°, no calculations are required for getting the plane. if the point is at z = -5, the plane is -5 to 5 (tan(45°) = 1). see: http://www.lighthouse3d.com/tutorials/view-frustum-culling/radar-approach-testing-points/
      (at least i think thats what ine meant)

      i agree with kyle on one thing: you need something like a linkless octree. at all cost, you must make the octree as small as possible. this way, most of the relevant data can be kept in cache and is processed fast. my implementation uses a very simple octree with pointers etc and the data requires a lot of memory. simply reading that amount of data takes long. if you perform any additional calculations, framerate will drop. my implementation is work in progress, so no showing yet 😉

      take a look at this video: http://www.youtube.com/watch?v=W-hOoBa1hxo
      you can find the source code in the discription. i didnt take a look at it yet, but it seems to do the job.

    • is isometric, not problem to render..
      anyway, i have the algo for calculate perspective projection without division… very beautiful.

      int I3D2D3(int x,int y,int z,int s) /* individual 3-D to 2-D conversion */
      {
        int L=-z,U=-z;
        int R=z,D=z;
        int X=0,Y=0;
        int M1,M2;
        int bit=1<<(s-1);// s<32
        for (;bit>0;bit>>=1) {
          printf("L:%d U:%d R:%d D:%d   x:%d y:%d   %d\n",L,U,R,D,X,Y,s);
          M1=(L+R)>>1;
          if (x<M1) { R=M1;X-=bit; } else { L=M1;X+=bit; }
          M2=(U+D)>>1;
          if (y<M2) { D=M2;Y-=bit; } else { U=M2;Y+=bit; }
        }    
        return (X >> 1)<<16| Y>>1;
      }
      

      the idea is compare in z plane and translate in screen plane..
      I deconstruct the algo from a post in many places, and reconstruct again.
      Is a search algorithm!! I try to look how can integrate in a rendering process.
      Look carefully how work.
      more advance version in my site, and my lang. I can generate morton code directly or reemplace L and R for Center and Size.

      • That basically is a long-tail division. A division technique that they used to teach in elementary school. It is a lot slower than the division that is implemented into your CPU, unless you use a trick to do it for many points simultaneously.

        My rendering algorithm basically does such a long-tail division for multiple points simultaneously. For that I also keep track of variables dx, Ldx, Udx, Rdx, Ddx, (and similar for y and z) such that if I wanted to change x, into x+dx, then I have to change L into L+Ldx, U into U+Udx, etc.

      • Lol, that code looks alot like the code from that bible racist, whose comments were deleted. That doesnt mean its Bad. Its probably brilliant, because he was insane…

        Ill post my solution here, as soon as i find time and a proper keyboard. I think its not that good, but it worked for me.

      • I dont get how this is working. It projects a point (x,y,z) by comparing its (x,y) with its (z) value? Then It translates the resulting coords according to the comparison?

        If this works, then its very nice. Ill try it out. I wonder how you can modify the Projection. I mean different fov etc.
        But why do you think this is a search Algorithm? It gives you the final Position of a point, but not the point itself. I agree with bmcpinc, you have to find a way to project multiple points at once. If you apply this to every point in screen, this will be slow. Maybe you can project higher hierarchies and usw that for further calculations for the children. Maybe this results in less looping.

      • well, first note the iteration is abous screen resolution, try s=9 for a 512×512 screen, in a division of 32 bits, the iteration be 32 (without the clz optimization).
        For diferent FOV multiple in L,R for the scale.
        I don’t know the solution but is a path for study.
        One idea is mix the search point with the traverse octree.. something like… calc the quadrant (1 step) of the node and decent the octree, this share the calc for the childrens.
        Other, mark the nodes with the flag, have points in 0,1 quadrant or both, then traverse the octree with the direccion of point to paint (this avoid zbuffer).

      • i tried the code and im always getting the same number, no matter what input. only s seems to alter the final value. what am i doing wrong?

    • @batman: The world generation and lighting are quite impressive. However, since it is isometric, there is very little that changes from view to view. I suspect that each cube is rendered and cached, such that those can be rendered as tiles.

      • Yes, i thought so too. In the Video, he says, he uses deferred rendering. He also says, that there is not too much caching involved. But the tiles are probably cached and reused.

  9. Im not completely sure if this relates but there was a paper I remember mentioned called Inferring Transforms that talks about projecting points from 3d to 2d without divisions (not sure about multiplications). It was part of a big book by Jim Blinn I believe.

  10. Pingback: Now running ~10fps at 1024×768 | Voxel-Engine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s