Optimizing OpenGL Performance with VAOs
Eric Lengyel • June 27, 2014
I’ve been making some low-level optimizations in the C4 Engine’s rendering code lately with the goal of speeding up The 31st by reducing the amount of CPU work and driver overhead required to render each object in the scene. There’s a particular feature of OpenGL called vertex array objects (VAOs) that should theoretically help achieve this goal, but many game developers have discovered that it doesn’t perform as expected. This blog post is an account of my experiences with VAOs, and it includes more technical details and a small amount of ranting.
A vertex array object encapsulates all of the state associated with a set of vertex attribute arrays. For each array, this state includes references to the vertex buffer object (VBO) in which the raw data is located, the format of the data (given by the numerical type, component count, and a flag indicating whether integer types are normalized), and the stride that gives the number of bytes separating the starting locations of consecutive vertices. In the case of meshes that have indexed vertices, the VAO state also includes a reference to the VBO containing the index array. All of this data is ordinarily constant for any particular item that is rendered in the scene and does not change from frame to frame. In C4, scene geometry usually has four vertex attribute arrays (position, normal, tangent, and texture coordinates) plus an index array, but there can be as many as ten vertex attribute arrays in special cases. The issue that VAOs were intended to address, at least in theory, is that respecifying all of this information in OpenGL every time an item is drawn is kind of silly when it’s the same exact stuff being specified every single time. Doing so requires a bunch of calls into OpenGL, and revalidating the parameters to each of those calls wastes a lot of precious clock cycles while accomplishing absolutely nothing. By using a VAO, this information can be specified by the game once and validated by the OpenGL driver ahead of time. Then the game only has to tell OpenGL which VAO it wants to use through a single function call to set up the vertex arrays each time an item is rendered.
Let’s look at this in a little more detail. Without using a VAO, the code necessary to specify the vertex arrays for some item in the scene would typically look like this:
for (int i = 0; i < maxArrayCount; i++)
{
if (newArrayEnabled[i])
{
if (!currentArrayEnabled[i])
{
currentArrayEnabled[i] = true;
glEnableVertexAttribArray(i);
}
if (newVertexBuffer[i] != currentVertexBuffer)
{
currentVertexBuffer = newVertexBuffer[i];
glBindBuffer(GL_ARRAY_BUFFER, newVertexBuffer[i]);
}
glVertexAttribPointer(i, size[i], type[i], norm[i], stride, offset[i]);
}
else
{
if (currentArrayEnabled[i])
{
currentArrayEnabled[i] = false;
glDisableVertexAttribArray(i);
}
}
}
if (indexed)
{
if (newIndexBuffer != currentIndexBuffer)
{
currentIndexBuffer = newIndexBuffer;
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, newIndexBuffer);
}
}
Now consider what happens inside the driver for each call to the glVertexAttribPointer() function:
- Verify that the array index is in the allowable range.
- Verify that the size is between one and four.
- Verify that the type is a valid value by going through a bunch of if/else statements.
- Verify that the stride is positive.
- Make sure that the currently bound VBO is valid.
- Run through a bunch of logic to combine the size, type, and norm parameters to form an internal format code, such as the 32-bit code recognized by the GCN architecture.
- Finally, store all this information in some kind of internal structure that will be used the next time something is drawn.
The other functions in the above code also require validation by the driver, but it’s relatively minor. In total, this is a ton of unnecessary, repetitive work done by the OpenGL driver for every single item rendered during every single frame. Using VAOs is supposed to eliminate all of this waste by performing all of these steps ahead of time so that the above code can be replaced, in its entirety, with the following line:
glBindVertexArray(newVertexArray);
The OpenGL driver would only need to verify that the new VAO is actually a valid object, and that object could easily contain the exact data that needs to be dumped into the hardware command buffer to configure the vertex arrays the next time something is drawn. Using VAOs should be a huge win.
But for some reason, it isn’t. I ran some tests with three intentionally extra-heavy, CPU-bound scenes from the Cemetery level in The 31st, and I found that using VAOs was slower, sometimes much slower, than simply going through all of the redundant array specification calls. It was slower for four different OpenGL implementations: those from Nvidia, AMD, Intel, and Apple. I wasn’t the first person to make this observation, either. I actually already knew that Valve had obtained the same result when they converted the Source engine to OpenGL, and I had accepted that for a while. But then a blog post surfaced from the head OpenGL honcho at AMD in which VAOs were shown to be faster in all cases (like they should) on both AMD and Nvidia hardware. So I wanted to try them out for myself and get to the bottom of the whole argument. My results support Valveās claim (and the claims many other developers) that using VAOs in real-world situations is invariably a loss. The synthetic test from AMD obviously doesn’t have the same CPU usage characteristics as a real game, and it would appear that this makes it impossible to obtain meaningful data.
The raw numbers from my tests are displayed in the following tables. Scenes 1, 2, and 3 make 10933, 12139, and 9789 draw calls per frame, respectively. In all cases, the latest final-release drivers were installed. (Numbers from different tables cannot be compared because they were run on different computers and/or at different resolutions.)
Test #1: Nvidia GeForce GTX 770, Windows 7 64-bit
| Scene | Respecifying Array State | Using VAOs | Relative VAO Performance |
|---|---|---|---|
| 1 | 20.1 ms | 25.5 ms | −26.9% |
| 2 | 21.6 ms | 27.9 ms | −29.2% |
| 3 | 18.5 ms | 23.7 ms | −28.1% |
Test #2: AMD Radeon HD 6800, Windows 8.1 64-bit
| Scene | Respecifying Array State | Using VAOs | Relative VAO Performance |
|---|---|---|---|
| 1 | 37.7 ms | 41.3 ms | −17.5% |
| 2 | 41.3 ms | 48.5 ms | −17.4% |
| 3 | 32.8 ms | 38.7 ms | −18.0% |
Test #3: Intel HD Graphics 4600 (Haswell), Windows 7 64-bit
| Scene | Respecifying Array State | Using VAOs | Relative VAO Performance |
|---|---|---|---|
| 1 | 48.6 ms | 50.5 ms | −3.9% |
| 2 | 57.7 ms | 58.6 ms | −1.6% |
| 3 | 42.7 ms | 43.2 ms | −1.2% |
Test #4: AMD Radeon HD 6770M, Mac OS 10.9.3
| Scene | Respecifying Array State | Using VAOs | Relative VAO Performance |
|---|---|---|---|
| 1 | 41.6 ms | 43.7 ms | −5.0% |
| 2 | 46.1 ms | 48.1 ms | −4.3% |
| 3 | 37.3 ms | 39.4 ms | −5.6% |
Something is wrong here. If the drivers did the dumbest thing possible to implement VAOs by simply storing the parameters passed to the array specification functions and then running the code from the listing above when something was drawn, I’d expect it to have the performance in the second column of these tables. All I did was implement on the application side what I’d expect to be the least that the driver could do on its end. In reality, though, the drivers cannot even achieve what I consider to be the worst possible performance. This means that all of the OpenGL implementations are all doing something horribly inefficient, although Intel isn’t far from being a wash. They should be able to do a lot better than I could on the application side by using prevalidation and internal data format optimizations.
Some people have suggested that the performance problem can be attributed to additional cache misses due to “pointer chasing” inside the driver. This sounds plausible, but my gut was telling me that it wouldn’t be nearly enough to explain the difference in speed that I was seeing, especially on Nvidia and AMD hardware. To test this theory, I tried allocating storage for the array specifications in a separate area of memory so there would be extra indirections when I ran the code in the above listing, but doing this did not make any measurable difference.
Hopefully, the driver engineers can shed some light on what’s happening here, or better yet, they can analyze the driver code and figure out what’s causing it to be so damn slow.