Introduction
Guest-Articles/2022/Compute-Shaders/Introduction
GPU Computing
In this chapter, we will have a look on the compute shader and try to understand how it works and how we can create and run a compute shader.
While traditionally the graphics card (GPU) has been a rendering co-processor which is handling graphics, it got more and more common to use graphics cards for other (not necessarily graphics related) computational tasks (General Purpose Computing on Graphics Processing Units; short:
A stream processor uses a function/
As stated above the most important (mandatory) aspect of programs running on GPUs is that they must be parallelizable. Sharing of memory is not easily possible and very limited for
(Even though this operation can be enhanced by the GPU using a kernel that accumulates sub-stream data in parallel and reducing the amount of serial accumulations for bigger streams. The results of the sub-stream data has to be combined in the host program afterwards).
It is important to keep this mandatory parallelism in mind when writing GPU
To summarize, compute shaders work great for many small parallel batches. Check out: Mythbusters Demo GPU versus CPU
Compute Shader Stage
To make GPU computing easier accessible especially for graphics applications while sharing common memory mappings, the OpenGL standard introduced the
Compute shaders are
To pass data to the compute shader, the shader needs to fetch the data for example via
The following table shows the data any shader stage operates on. As shown below, the compute shaders works on an "abstract work item".
Stage | Data Element |
---|---|
Vertex Shader | per vertex |
Tessellation Control Shader | per vertex (in a patch) |
Tessellation Evaluation Shader | per vertex (in a patch) |
Geometry Shader | per primitive |
Fragment Shader | per fragment |
Compute Shader | per (abstract) "work item" |
Compute space
The user can use a concept called
During execution of the
The
The image below shows how every
An example:
Given the
While it is possible to communicate using
Create your first compute shader
Now that we have a broad overview about compute shaders let's put it into practice by creating a "Hello-World" program. The program should write (color) data to the pixels of an image/texture object in the compute shader. After finishing the compute shader execution it will display the texture on the screen using a second shader program which uses a vertex shader to draw a simple screen filling quad and a fragment shader.
Since compute shaders are introduced in OpenGL 4.3 we need to adjust the context version first:
glfwWindowHint (GLFW_CONTEXT_VERSION_MAJOR, 4);
glfwWindowHint (GLFW_CONTEXT_VERSION_MINOR, 3);
Compile the Compute Shader
To being able to compile a compute shader program we need to create a new shader class. We create a new ComputeShader class, that is almost identically to the normal Shader class, but as we want to use it in combination to the normal shader stage we have to give it a new unique class name.
class ComputeShader
{
public:
unsigned int ID;
ComputeShader(const char* computePath)
{
...
}
}
Note: we could as well add a second constructor in the Shader class, which only has one parameter where we would assume that this is a compute shader but in the sake of clarity, we split them in two different classes.Additionally it is not possible to bake compute shaders into an OpenGL program object alongside other shaders.
The code to create and compile the shader is as well almost identically to the one for other shaders. But as the compute shader is not bound to the rest of the render pipeline we attach the shader solely to the new program using the shader type GL_COMPUTE_SHADER after creating the program itself.
unsigned int compute;
// compute shader
compute = glCreateShader (GL_COMPUTE_SHADER);
glShaderSource (compute, 1, &cShaderCode, NULL);
glCompileShader (compute);
checkCompileErrors(compute, "COMPUTE");
// shader Program
ID = glCreateProgram ();
glAttachShader (ID, compute);
glLinkProgram (ID);
checkCompileErrors(ID, "PROGRAM");
Check out the chapter Getting Started - Shaders to get more information about the Shader class.
Create the Compute Shader
With the shader class updated, we can now write our compute shader. As always, we start by defining the version on top of the shader as well as defining the size of the local
This can be done using the special layout input declaration in the code below. By default, the local sizes are 1 so if you only want a 1D or 2D
#version 430 core
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
Since we will execute our shader for every pixel of an image, we will keep our local size at 1 in every dimension (1 pixel per
There is a limitation of
There is as well a limitation on the
As we define and divide the tasks and the compute shader groups sizes ourselves, we have to keep these limitations in mind.
We will bind the a 2d image in our shader as the object to write our data onto. The internal format (here rgba32f) needs to be the same as the format of the texture in the host program.
layout(rgba32f, binding = 0) uniform image2D imgOutput;
We have to use image2d as this represents a single image from a texture. While sampler variables use the entire texture including mipmap levels and array layers, images only have a single image from a texture. Note while most texture sampling functions use normalized texture coordinates [0,1], for images we need the absolute integer
With this set up, we can now write our main function in the shader where we fill the imgOutput with color values. To determine on which pixel we are currently operating in our shader execution we can use the following GLSL Built-in variables shown in the table below:
Type | Built-in name | |
---|---|---|
uvec3 | gl_NumWorkGroups | number of set by |
uvec3 | gl_WorkGroupSize | size of the defined with layout |
uvec3 | gl_WorkGroupID | index of the |
uvec3 | gl_LocalInvocationID | index of the current work item in the |
uvec3 | gl_GlobalInvocationID | global index of the current work item (gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID) |
uint | gl_LocalInvocationIndex | 1d index representation of gl_LocalInvocationID (gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y + gl_LocalInvocationID.y * gl_WorkGroupSize.x + gl_LocalInvocationID.x) |
Using the built-in variables from the table above we will create a simple color gradient (st-map) on our image.
void main() {
vec4 value = vec4(0.0, 0.0, 0.0, 1.0);
ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
value.x = float(texelCoord.x)/(gl_NumWorkGroups.x);
value.y = float(texelCoord.y)/(gl_NumWorkGroups.y);
imageStore(imgOutput, texelCoord, value);
}
We will setup the execution of the compute shader that every
We can then write our calculated pixel data to the image using the
Create the Image Objecte
In the host program, we can now create the actual image to write onto. We will create a 512x512 pixel texture.
// texture size
const unsigned int TEXTURE_WIDTH = 512, TEXTURE_HEIGHT = 512;
...
unsigned int texture;
glGenTextures (1, &texture);
glActiveTexture (GL_TEXTURE0);
glBindTexture (GL_TEXTURE_2D, texture);
glTexParameter i(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameter i(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
glTexParameter i(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameter i(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexImage2D (GL_TEXTURE_2D, 0, GL_RGBA32F, TEXTURE_WIDTH, TEXTURE_HEIGHT, 0, GL_RGBA,
GL_FLOAT, NULL);
glBindImageTexture(0, texture, 0, GL_FALSE, 0, GL_READ, GL_RGBA32F);
To find a deeper explanation of the functions used to setup a texture check out the Getting Started - Textures chapter. Here the
Executing the Compute Shader
With everything set up we can now finally execute our compute shader. In the drawing loop we can use/bind our compute shader and execute it using the
// render loop
// -----------
computeShader.use();
glDispatchCompute((unsigned int)TEXTURE_WIDTH, (unsigned int)TEXTURE_HEIGHT, 1);
// make sure writing to image has finished before read
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
We first bind our shader using the
Before accessing the image data after the compute shader execution we need to define a barrier to make sure the data writing is completly finished. The
Rendering the image
Lastly, we will render a rectangle and apply the texture in the fragment shader.
// render image to quad
glClear (GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
screenQuad.use();
screenQuad.setInt("tex", 0);
glActiveTexture (GL_TEXTURE0);
glBindTexture (GL_TEXTURE_2D, texture);
renderQuad();
We will bind our texture now as sampler2D and use the texture coordinates of the rectangle to sample it.
The vertex and fragment shader are very simple as seen below.
Vertex Shader
#version 430 core
layout (location = 0) in vec3 aPos;
layout (location = 1) in vec2 aTexCoords;
out vec2 TexCoords;
void main()
{
TexCoords = aTexCoords;
gl_Position = vec4(aPos, 1.0);
}
Fragment Shader
#version 430 core
out vec4 FragColor;
in vec2 TexCoords;
uniform sampler2D tex;
void main()
{
vec3 texCol = texture(tex, TexCoords).rgb;
FragColor = vec4(texCol, 1.0);
}
Image Output
Adding Time Variable and Speed Measuring
We will now add time to the program for performance measuring to test which settings (
// timing
float deltaTime = 0.0f; // time between current frame and last frame
float lastFrame = 0.0f; // time of last frame
int fCounter = 0;
// render loop
// -----------
...
// Set frame time
float currentFrame = glfwGetTime ();
deltaTime = currentFrame - lastFrame;
lastFrame = currentFrame;
if(fCounter > 500) {
std::cout << "FPS: " << 1 / deltaTime << std::endl;
fCounter = 0;
} else {
fCounter++;
}
The code above prints the frames per second limited to one print every 500 frames as too frequent printing slows the program down. When running our program with this "stopwatch" we will see that it will never get over 60 frames per second as glfw locks the refresh rate by default to 60fps.
To bypass this lock we can set the swap interval for the current OpenGL Context to 0 to get a bigger refresh rate than 60 fps. We can use the function
glfwMakeContextCurrent (window);
glfwSetFramebufferSizeCallback(window, framebuffer_size_callback);
glfwSwapInterval(0);
Now we can get much more frames per seconds rendered/calculated. To be fair this example/hello world program is very easy and actually doesnt have any complex calculations so the calcuation times are very low.
We can now make our texture animated (moving from left to write) using the time variable. First, we change our computeShader to be animated:
#version 430 core
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
// images
layout(rgba32f, binding = 0) uniform image2D imgOutput;
// variables
layout (location = 0) uniform float t; /** Time */
void main() {
vec4 value = vec4(0.0, 0.0, 0.0, 1.0);
ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
float speed = 100;
// the width of the texture
float width = 1000;
value.x = mod(float(texelCoord.x) + t * speed, width) / (gl_NumWorkGroups.x);
value.y = float(texelCoord.y)/(gl_NumWorkGroups.y);
imageStore(imgOutput, texelCoord, value);
}
We create a uniform variable t, which will hold the current time. To animate a repeating rolling of the texture from left to right we can use
the module operation
In the host program, we can assign the variable value the same way as we assign them for any other shader using
computeShader.use();
computeShader.setFloat("t", currentFrame);
Hence currentFrame is an altering value, we have to do the assignment in the render loop for every iteration.
The layout (location = 0) definition in front of the float variable is in general not necessary as the shader implementation queries the location of every variable on each uniform assignment. This might slow down the program execution speed if executed for multiple variables every render loop.
If you know that the location won't change and you want to increase the performance of the program as much as possible you can either query the location just once before the render loop and save it in the host program or hardcode it in the host program.
Altering local size
Lastly, we can make use of the
In this last section, we are going to add some local
For simplicity, we increase the resolution of our texture to get a number that can be divided by 10 without a rest. Here we will have 1,000,000 pixels though need 1 million shader
// texture size
const unsigned int TEXTURE_WIDTH = 1000, TEXTURE_HEIGHT = 1000;
We can now lower the amount of
glDispatchCompute((unsigned int)TEXTURE_WIDTH/10, (unsigned int)TEXTURE_HEIGHT/10, 1);
If we run the program without altering the shader we will see that only 1/100 of the image will be calculated.
To calculate the whole image again we have to adjust the local_size of the compute shader accordingly. Here we distribute the
#version 430 core
layout (local_size_x = 10, local_size_y = 10, local_size_z = 1) in;
layout(rgba32f, binding = 0) uniform image2D imgOutput;
layout (location = 0) uniform float t; /** Time */
void main() {
vec4 value = vec4(0.0, 0.0, 0.0, 1.0);
ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
float speed = 100;
// the width of the texture
float width = 1000;
value.x = mod(float(texelCoord.x) + t * speed, width) / (gl_NumWorkGroups.x * gl_WorkGroupSize.x);
value.y = float(texelCoord.y)/(gl_NumWorkGroups.y*gl_WorkGroupSize.y);
imageStore(imgOutput, texelCoord, value);
}
As seen above we have to adjust the ratio for the relative
You can find the full source code for this demo here.
Final Words
The above introduction is meant as a very simple overview of the compute shader and how to make it work. As it is not part of the render pipeline, it can get even more complicated to debug non-working shaders/programs. This implementation only shows one of the ways to manipulate data with the compute shader using
In upcoming following articles we will go into creating a particle simulation and deal with buffer objects to work on input data and output data after manipulation. As well as having a look on
Exercises
References
Contact: mail