A Short Guide to 3D Graphics Performance testing

As with any optimization problem, you need to follow these steps:

Understand the structure of your system.
Do measurements to find the bottleneck.
Eliminate the bottleneck.
Rinse, repeat.

Understand the system

In most 3D applications - whether under OpenGL, OpenGLES, WebGL or Direct3D, there are four principle places where speed bottlenecks can happen:

The CPU - you have to calculate what meshes you're going to draw and set them up for rendering. This tends to be more or less a fixed cost per mesh.
The transmission link between CPU and GPU which is a cost that depends on the number of vertices you send multiplied by the number of per-vertex attributes - plus the cost of updating textures and shaders that you might change during that frame.
The GPU's vertex processor. This is the per-vertex processing cost of transforming/lighting your vertex data - without shaders, the cost roughly depends on the number of vertices times the number of lights you have turned on - with shaders, the cost roughly depends on the number of vertices times the complexity of your shader.
The GPU's pixel/fragment processor. This is the per-pixel cost for pixels that pass clipping. The cost roughly depends on the number of pixels you draw onto the screen times the number of textures you use and/or the complexity of your fragment shader.

In general, these four parts of the process are happening in parallel. While the GPU's pixel processor is drawing the pixels of one set of triangles, the GPU vertex processor is transforming the vertices of the ones that came along a little later, the transmission system is shoving the next set of vertices after that into the input queue of the GPU - and the CPU is working on setting up the next mesh to draw.

When one of these four systems is processing meshes slower than the others, the ones before it will be stuck waiting for it to complete - and the ones after it will be idling while they wait for more polygons to process.

Since all four systems are working in parallel - you can't simply use the CPU clock to time how long a particular mesh takes to draw because if something other than the CPU is the bottleneck, you may be timing the time it took for some previous mesh to draw. In the end, you really can't disentangle the timings like that.

The best thing you can do is to measure the long-term average frame rate of your system. Render (say) 100 frames and calculate the average time per frame.

Find the bottleneck

If your application is running slower than you'd hoped - then you need to establish which of these four things is the biggest problem. However, since drawing fewer polygons will also reduce the number of pixels you're filling, it's essential to do these tests one at a time in the order I describe:

GPU Pixel processing

Pixel processing time is easy to understand - reduce the size of the window you're rendering to (keeping everything else the same). If your program goes faster in rough proportion to the area of the window (height x width) - then pixel processing is the bottleneck.

CPU processing

If you have eliminated pixel processing as a bottleneck - then since CPU time generally doesn't depend on the number of vertices you draw, then you can (just as a test) keep rendering to a tiny window to more or less eliminate pixel processing costs - but deliberately halve the number of triangles in each mesh - keeping the number of meshes constant.

If your application's performance increases by roughly a factor of two when you do that then you were obviously not limited by the CPU's per-mesh costs. But if your performance hardly changes when you halve the overall vertex count - then probably you're drawing too many meshes or doing too much per-mesh work in the CPU and you need to improve your code somehow.

Vertex transmission

Once you know that it's something to do with the number of vertices - then figuring out whether the transmission costs or the GPU's vertex processing costs are your problem is tricky - but since both depend mostly on number of vertices, you probably don't need to.

You may be able to get some handle on this by deliberately changing the type of your vertex attributes (eg from a 'float' to a 'byte' - or vice-versa) to deliberately alter the amount of data that has to be transmitted without significantly altering the amount of processing time in the GPU. If changing that makes a big difference to overall frame rates then you're most likely "transmission-limited" - if it makes almost zero difference - then probably your GPU vertex processing is the culprit.

GPU Vertex transmission/processing

If the other tests didn't make much difference to frame rate - then this is what remains. Hopefully you can play around with simplifying or uselessly complexifying your vertex shader (or in a fixed-function pipeline, turning on greater or fewer numbers of light sources) and actually confirm this theory.

Eliminate the bottleneck

If CPU time is the culprit

You either need to optimize your code so that other CPU time-sinks are reduced (eg make physics, collision, AI, etc faster) - or you need to improve your field-of-view culling so you draw fewer meshes that are off-screen - or you need to reduce the number of meshes in your art (eg by combining multiple parts of an object into a single object using tricks like texture atlassing).

In my experience, the last of these is the first thing that most people should be looking at.

On a modern high-end graphics system, the GPU can probably draw around 200 to 800 vertices in the time it takes the CPU to get the next mesh set up ready to draw. So if your scene is full of meshes, each containing (say) an 8 vertex cuboid - then you are highly likely to be CPU bound. That's because it's taking the CPU a long time to get ready to draw each cube - but the GPU gets the cube processed in lightning quick time and within a gazillionth of a second it's just sitting around waiting for the CPU to send it the next one.

In a harmonious, balanced system, the meshes are complex enough that the GPU gets done with processing each one at almost the exact moment that the CPU is finished with setting up the next one. That way, neither CPU nor GPU is sitting around idle waiting for the other to finish working.

However, it takes careful attention to artwork so as not to have a bunch of 8 vertex-per-mesh crates and then an 8000 vertex-per-mesh space-alien! If that's where you're at - go to your art team and threaten them with extreme violence! They will be amazed to find that they can replace each 8 vertex cube with a 400 vertex model and have little or no effect on frame rates! On the other hand, if they can combine 500 eight vertex crates into eight 500 vertex multi-crate meshes - then the scene may well render 100x faster!

If GPU vertex processing is the culprit

Then your meshes are too complex or you have too many (fixed function) light sources or your vertex shader is too complex. Use level-of-detail to reduce the number of vertices in meshes that are further from the eyepoint. Consider doing occlusion queries to reduce the number of meshes you draw. Optimize light source culling.

If transmission of data is the culprit

Then the measures described above to reduce the number of vertices will help. You can also consider whether you can afford to send per-vertex attributes at lower precision - or whether you need them at all. For example - lots of people make the mistake of sending vertex normals, binormals and tangents as floating point triplets. But those things are mostly only used for lighting - and lighting really only needs to be accurate to at most one part in 256 because that's the precision of the display you are driving. So it's worth considering knocking colors and normal/binormal/tangent data down to 'byte' precision to avoid sending so much per-vertex data. Other data might be similarly reduced...if your objects are smaller than (say) 2.5 meters - then you could consider rounding the vertex data to the nearest centimeter, converting it to centimeters and making sure that the center of the object coordinate system is in the middle of the object and sending vertex data as bytes too! If not bytes, then maybe 16 bit 'short' data? How many objects do you have that need better-than-centimeter precision and are more than 6553.5 meters across? Or need better-than-millimeter precision and are more than 655.35 meters across?

Also, there are circumstances where you can trade transmission cost for GPU vertex processing costs. For example, in a terrain mesh - it's very likely that your surface normals all point roughly upwards. Since you know that x²+y²+z²==1.0, you can compute z (assuming z-is-up) from x and y because you know that the sign of z is always positive. Similarly, if you've sent the normal and the binormal then you can calculate the tangent vector from the cross-product of the other two and you only need to store one bit for its sign - and you can probably pack that into some other variable.

Do you need to send all of the data at all? For regularly gridded terrain, the X and Y coordinates can be converted into a single integer and the Z coordinate read from a 'height map' pre-stored as a vertex-texture. The normal data could be stored in the same manner.

If GPU pixel processing is the culprit

Simplify your pixel shaders. Consider doing a depth-only pass before your 'beauty' pass. Can you reduce the resolution of your textures? Can you draw to a smaller window? Can you make better use of approximate front-to-back rendering order?

Rinse and repeat

When you've done some kind of optimization - re-do the testing phase to see if you've speeded things up - and also (very important) to see if you moved the bottleneck somewhere else. If the CPU was the limiting factor - and you improved that, then perhaps the GPU is now the limiting factor. If so, then speeding up the code in your CPU still more probably won't help - and (worse still) doing more work there won't result in better frame rates. So after each round of optimization, see which part of the system needs more work next.

Also, if you're getting good frame rates - then you can probably improve the quality of your graphics by drawing more meshes, more polygons, or having more complicated lighting or something. When things are humming along fast enough, you can use "reverse optimization" to understand where you could be making things look nicer at little to no cost to performance.

This is an over-simplification

There are times when (for example) your CPU is spending too long getting to the point in the frame cycle where it starts drawing meshes...then, once it starts drawing them, the meshes are too complicated and the CPU is held up waiting for the GPU to get done. In such cases, you have multiple bottlenecks and improving either CPU or vertex count will both improve performance.

Low-end hardware

On things like cellphones and very low end Intel GPU's, it's likely that the "GPU vertex processing" stage is actually happening in software on the CPU. In this case, the cost per-vertex for "transmission" is essentially zero because everything is sitting in the CPU's main memory - and the per-vertex cost for "GPU vertex processing" is actually slowing the CPU down.

In such circumstances, you can still figure out whether it's per-pixel costs (by shrinking the window size down) - but halving the number of vertices in each mesh won't tell you much about whether you have too many vertices or too many meshes. In a sense, it doesn't matter because improving one will relieve the situation for the other - and whichever one is easiest to improve should be the one you attack first.

A Short Guide to 3D Graphics Performance testing

Contents