Introduction

GPU Computing

In this chapter, we will have a look on the compute shader and try to understand how it works and how we can create and run a compute shader. While traditionally the graphics card (GPU) has been a rendering co-processor which is handling graphics, it got more and more common to use graphics cards for other (not necessarily graphics related) computational tasks (General Purpose Computing on Graphics Processing Units; short: GPGPU-Programming). The reason for this purpose change is performance, as GPUs perform floating-point calculations much faster than today's CPUs. However, this performance boost comes with a hurdle in programming algorithms. Since the GPU is not a serial but a stream processor it's not trivial to program the same algorithms which were designed for the CPU to run on the GPU as well.

A stream processor uses a function/kernel (e.g. a fragment Shader) to run over a set of input records/stream (e.g. fragments) to produce a set of output records (pixels for the final image) in parallel. Due to the parallel execution, each element is processed independently, without any dependencies between elements.

As stated above the most important (mandatory) aspect of programs running on GPUs is that they must be parallelizable. Sharing of memory is not easily possible and very limited for kernels running on the graphics card, this means that calculations that the kernel performs must be computed independently of each other. For example, it's easy to implement a program that multiplies each element in one stream with the corresponding element (e.g. by index) in a second stream while it's more complicated (or not completely possible in parallel) to accumulate the values of one stream to one single sum value as it always needs the result of the executions before.

(Even though this operation can be enhanced by the GPU using a kernel that accumulates sub-stream data in parallel and reducing the amount of serial accumulations for bigger streams. The results of the sub-stream data has to be combined in the host program afterwards).

It is important to keep this mandatory parallelism in mind when writing GPU kernels as the GPU is not suitable for all problems due to its stream programming model.

In order to complete this chapter, you will need to be able to create an OpenGL 4.3+ context. The compute shaders to be discussed are only available starting in OpenGL 4.3. Using OpenGL 3.3 or earlier will result in errors. The sample shader code will use OpenGL 4.3.

To summarize, compute shaders work great for many small parallel batches. Check out: Mythbusters Demo GPU versus CPU

Compute Shader Stage

To make GPU computing easier accessible especially for graphics applications while sharing common memory mappings, the OpenGL standard introduced the compute shader in OpenGL version 4.3 as a shader stage for computing arbitrary information. While other GPGPU APIs like OpenCL and CUDA offer more features as they are aimed for heavyweight GPGPU projects, the OpenGL compute shader is intentionally designed to incorporate with other OpenGL functionality and uses GLSL to make it easier to integrate with the existing OpenGL graphics pipeline/application. Using the compute shader in OpenGL graphics applications makes it possible to avoid complicated interfacing, as it would be needed with OpenCL or CUDA.

Compute shaders are general-purpose shaders and in contrast to the other shader stages, they operate differently as they are not part of the graphics pipeline. (see OpenGL 4.3 with Computer Shaders). The compute shader itself defines the data "space" it operates on. An OpenGL function can be used to define the amount of executions that also initiates the execution of the compute operation. The computer shader does not have user-defined inputs or any outputs as known from the other shaders.

To pass data to the compute shader, the shader needs to fetch the data for example via texture access, image loads or shader storage block access, which has to be used as target to explicitly write the computed data to an image or shader storage block as well.

The following table shows the data any shader stage operates on. As shown below, the compute shaders works on an "abstract work item".

Stage	Data Element
Vertex Shader	per vertex
Tessellation Control Shader	per vertex (in a patch)
Tessellation Evaluation Shader	per vertex (in a patch)
Geometry Shader	per primitive
Fragment Shader	per fragment
Compute Shader	per (abstract) "work item"

Compute space

The user can use a concept called work groups to define the space the compute shader is operating on. Work Groups are the smallest amount of compute operations that the user can execute (from the host application). Wile the space of the work groups is a three-dimensional space ("X", "Y", "Z") the user can set any of the dimension to 1 to perform the computation in one- or two-dimensional space. In the image below every green cube is one work group.

During execution of the work groups the order might vary arbitrarily and the program should not rely on the order in which individual groups are processed.

The work group may contain many compute shader invocations. The amount of invocations of the shader is defined by the local size of the work group, which is again three-dimensional.

The image below shows how every work group is splitted in its local space invocations represented by the red cubes.

An example:
Given the local size of a computer shader of (128, 1, 1) and executing the shader with a work group count of (16, 8, 64). The shader will be 1,048,576 times invoked separately. This is the product of work group dimensions times the product of the local size of the compute shader: (128 * 1 * 1 * 16 * 8 * 64 = 1,048,576). Each invocation can be uniquely identified by a unique set of inputs.

While it is possible to communicate using shared variables and special functions between different invocations in a specific work group, it is not effectively possible to communicate between different work groups without potentially deadlocking the system.

Create your first compute shader

Now that we have a broad overview about compute shaders let's put it into practice by creating a "Hello-World" program. The program should write (color) data to the pixels of an image/texture object in the compute shader. After finishing the compute shader execution it will display the texture on the screen using a second shader program which uses a vertex shader to draw a simple screen filling quad and a fragment shader.

Since compute shaders are introduced in OpenGL 4.3 we need to adjust the context version first:

glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 4);
glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);

Compile the Compute Shader

To being able to compile a compute shader program we need to create a new shader class. We create a new ComputeShader class, that is almost identically to the normal Shader class, but as we want to use it in combination to the normal shader stage we have to give it a new unique class name.

class ComputeShader
{
     public:
          unsigned int ID;
		
          ComputeShader(const char* computePath)
          {
               ...
          }
}

Note: we could as well add a second constructor in the Shader class, which only has one parameter where we would assume that this is a compute shader but in the sake of clarity, we split them in two different classes.Additionally it is not possible to bake compute shaders into an OpenGL program object alongside other shaders.

The code to create and compile the shader is as well almost identically to the one for other shaders. But as the compute shader is not bound to the rest of the render pipeline we attach the shader solely to the new program using the shader type GL_COMPUTE_SHADER after creating the program itself.

unsigned int compute;
// compute shader
compute = glCreateShader(GL_COMPUTE_SHADER);
glShaderSource(compute, 1, &cShaderCode, NULL);
glCompileShader(compute);
checkCompileErrors(compute, "COMPUTE");

// shader Program
ID = glCreateProgram();
glAttachShader(ID, compute);
glLinkProgram(ID);
checkCompileErrors(ID, "PROGRAM");

Check out the chapter Getting Started - Shaders to get more information about the Shader class.

Create the Compute Shader

With the shader class updated, we can now write our compute shader. As always, we start by defining the version on top of the shader as well as defining the size of the local invocations per dimension in the compute shader.

This can be done using the special layout input declaration in the code below. By default, the local sizes are 1 so if you only want a 1D or 2D work group space, you can specify just the local_size_x or the local_size_x and local_size_y component. For the sake of completeness, we will explicitly set all components as shown below.

#version 430 core

layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

Since we will execute our shader for every pixel of an image, we will keep our local size at 1 in every dimension (1 pixel per work group). We will alter this value later. OpenGL will handle this local size in the background. The values must be an integral constant expression of a value greater than 0 and it must abide by limitations shown in the warning paragraph below.

There is a limitation of work groups that can be dispatched in a single compute shader dispatch call. This limit is defined by GL_MAX_COMPUTE_WORK_GROUP_COUNT, which must/can be queried using the function glGetIntegeri_v where the indices 0, 1 and 2 corresponds to the X, Y and Z dimensions, respectively.

There is as well a limitation on the local size which can be queried with GL_MAX_COMPUTE_WORK_GROUP_SIZE and another limitation of the total number of invocations within a work group, which is that the product of the X, Y and Z components of the local size must be less than GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS.

As we define and divide the tasks and the compute shader groups sizes ourselves, we have to keep these limitations in mind.

We will bind the a 2d image in our shader as the object to write our data onto. The internal format (here rgba32f) needs to be the same as the format of the texture in the host program.

layout(rgba32f, binding = 0) uniform image2D imgOutput;

We have to use image2d as this represents a single image from a texture. While sampler variables use the entire texture including mipmap levels and array layers, images only have a single image from a texture. Note while most texture sampling functions use normalized texture coordinates [0,1], for images we need the absolute integer texel coordinates. Images and samplers are completely separated including their bindings. While samplers can only read data from textures, image variables can read and/or write data.

With this set up, we can now write our main function in the shader where we fill the imgOutput with color values. To determine on which pixel we are currently operating in our shader execution we can use the following GLSL Built-in variables shown in the table below:

Type	Built-in name
uvec3	gl_NumWorkGroups	number of work groups that have been dispatched set by glDispatchCompute()
uvec3	gl_WorkGroupSize	size of the work group (local size) operated on defined with `layout`
uvec3	gl_WorkGroupID	index of the work group currently being operated on
uvec3	gl_LocalInvocationID	index of the current work item in the work group
uvec3	gl_GlobalInvocationID	global index of the current work item (gl_WorkGroupID gl_WorkGroupSize + gl_LocalInvocationID)*
uint	gl_LocalInvocationIndex	1d index representation of gl_LocalInvocationID (gl_LocalInvocationID.z gl_WorkGroupSize.x * gl_WorkGroupSize.y + gl_LocalInvocationID.y * gl_WorkGroupSize.x + gl_LocalInvocationID.x)*

Using the built-in variables from the table above we will create a simple color gradient (st-map) on our image.

void main() {
    vec4 value = vec4(0.0, 0.0, 0.0, 1.0);
    ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
	
    value.x = float(texelCoord.x)/(gl_NumWorkGroups.x);
    value.y = float(texelCoord.y)/(gl_NumWorkGroups.y);
	
    imageStore(imgOutput, texelCoord, value);
}

We will setup the execution of the compute shader that every invocation corresponds to one pixel, though the global x and y size will be equal to the image's x and y dimension. Therefore, the gl_GlobalInvocationID gives us the absolute coordinate of the current pixel.Remember that we only have one single invocation per work group as we set all local dimensions to 1. Using the gl_NumWorkGroups variable, we can calculate the relative coordinate of the image in the range [0, 1] per dimension.

We can then write our calculated pixel data to the image using the imageStore function. The imageStore function takes the image unit to write to as first argument, the absolute texel coordinate as second argument and the data value to store at this texel as third.

Create the Image Objecte

In the host program, we can now create the actual image to write onto. We will create a 512x512 pixel texture.

// texture size
const unsigned int TEXTURE_WIDTH = 512, TEXTURE_HEIGHT = 512;
...
unsigned int texture;

glGenTextures(1, &texture);
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, texture);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, TEXTURE_WIDTH, TEXTURE_HEIGHT, 0, GL_RGBA, 
             GL_FLOAT, NULL);

glBindImageTexture(0, texture, 0, GL_FALSE, 0, GL_READ, GL_RGBA32F);

To find a deeper explanation of the functions used to setup a texture check out the Getting Started - Textures chapter. Here the glBindImageTexture function is used to bind a specific level of a texture to an image unit. Since we use image2D we need to use this function instead of the glBindTexture function. Note that we use GL_RGBA32F as internal format corresponding to the layout format used in the compute shader.

Executing the Compute Shader

With everything set up we can now finally execute our compute shader. In the drawing loop we can use/bind our compute shader and execute it using the glDispatchCompute function.

// render loop
// -----------

computeShader.use();
glDispatchCompute((unsigned int)TEXTURE_WIDTH, (unsigned int)TEXTURE_HEIGHT, 1);

// make sure writing to image has finished before read
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

We first bind our shader using the use() function of the ComputeShader. The glDispatchCompute function launches one or more compute work groups based on the given 3 dimensions as arguments. Here we launch the execution two-dimensional corresponding to the image size and leave the third component to 1. While the individual shader invocations within the work group are executed as a unit, work groups are executed completely independent and in unspecific order.

Before accessing the image data after the compute shader execution we need to define a barrier to make sure the data writing is completly finished. The glMemoryBarrier defines such a barrier which orders memory transactions. The GLbitfield parameter barriers specifies the barriers to insert. They must be a bit wise combination of any GL barrier_bit constants (see: glMemoryBarrier - Khronos). In this case, we only need the GL_SHADER_IMAGE_ACCESS_BARRIER_BIT which assures access using the image functions will reflect data written by shaders prior to the barrier.

It is also possible to use the GL_ALL_BARRIER_BITS variable to have a generic barrier for all types of writing. The glMemoryBarrier function will stop the execution of the host program at this point though it makes sense to insert this function right before accessing the barrier's data.

Rendering the image

Lastly, we will render a rectangle and apply the texture in the fragment shader.

// render image to quad
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
screenQuad.use();
screenQuad.setInt("tex", 0);
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, texture);
renderQuad();

We will bind our texture now as sampler2D and use the texture coordinates of the rectangle to sample it.

The vertex and fragment shader are very simple as seen below.

Vertex Shader

#version 430 core
layout (location = 0) in vec3 aPos;
layout (location = 1) in vec2 aTexCoords;
	
out vec2 TexCoords;
	
void main()
{
    TexCoords = aTexCoords;
    gl_Position = vec4(aPos, 1.0);
}

Fragment Shader

#version 430 core
out vec4 FragColor;
	
in vec2 TexCoords;
	
uniform sampler2D tex;
	
void main()
{             
    vec3 texCol = texture(tex, TexCoords).rgb;      
    FragColor = vec4(texCol, 1.0);
}

Image Output

Adding Time Variable and Speed Measuring

We will now add time to the program for performance measuring to test which settings (work group amount/local size) work best for us.

// timing 
float deltaTime = 0.0f; // time between current frame and last frame
float lastFrame = 0.0f; // time of last frame
int fCounter = 0;

// render loop
// -----------
...
// Set frame time
float currentFrame = glfwGetTime();
deltaTime = currentFrame - lastFrame;
lastFrame = currentFrame;
if(fCounter > 500) {
        std::cout << "FPS: " << 1 / deltaTime << std::endl;
        fCounter = 0;
} else {
    fCounter++;
}

The code above prints the frames per second limited to one print every 500 frames as too frequent printing slows the program down. When running our program with this "stopwatch" we will see that it will never get over 60 frames per second as glfw locks the refresh rate by default to 60fps.

To bypass this lock we can set the swap interval for the current OpenGL Context to 0 to get a bigger refresh rate than 60 fps. We can use the function glfwSwapInterval function for this when initializing the glfw context:

glfwMakeContextCurrent(window);
glfwSetFramebufferSizeCallback(window, framebuffer_size_callback);
glfwSwapInterval(0);

Now we can get much more frames per seconds rendered/calculated. To be fair this example/hello world program is very easy and actually doesnt have any complex calculations so the calcuation times are very low.

We can now make our texture animated (moving from left to write) using the time variable. First, we change our computeShader to be animated:

#version 430 core

layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
	
// images 
layout(rgba32f, binding = 0) uniform image2D imgOutput;

// variables
layout (location = 0) uniform float t;                 /** Time */
	
void main() {
    vec4 value = vec4(0.0, 0.0, 0.0, 1.0);
    ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
    float speed = 100;
    // the width of the texture
    float width = 1000;

    value.x = mod(float(texelCoord.x) + t * speed, width) / (gl_NumWorkGroups.x);
    value.y = float(texelCoord.y)/(gl_NumWorkGroups.y);
    imageStore(imgOutput, texelCoord, value);
}

We create a uniform variable t, which will hold the current time. To animate a repeating rolling of the texture from left to right we can use the module operation %. We animate the texture using the time variable t multiplied by the a speed value as offset for the x coordinate. Having the offseted x coordinate we can use the width of the texture (which in this case is hard-codeded) as divisor to get the rest which will be the new coordinate. We divide this value by the by the Workgroup size in x to get the ratio value between 0 and 1 we do the same for the y value, where we just simply divide the texel coordinate by the number of workgroups in the y dimension.

In the host program, we can assign the variable value the same way as we assign them for any other shader using glUniform functions, which is wrapped in the setFloat function of the ComputeShader class. We use setFloat to set the value of the variable t.

computeShader.use();
computeShader.setFloat("t", currentFrame);

Hence currentFrame is an altering value, we have to do the assignment in the render loop for every iteration.

The layout (location = 0) definition in front of the float variable is in general not necessary as the shader implementation queries the location of every variable on each uniform assignment. This might slow down the program execution speed if executed for multiple variables every render loop.

glUniform1f(glGetUniformLocation(ID, name.c_str()), value);
If you know that the location won't change and you want to increase the performance of the program as much as possible you can either query the location just once before the render loop and save it in the host program or hardcode it in the host program.

Altering local size

Lastly, we can make use of the local size. As it can be seen in the image below the total amount of n dimensional executions is the product of the amount of work groups times local invocations. (compare the calculation in the compute space section above). Currently one pixel corresponds to one work group as we set the local size to 1 in all dimensions (dark gray boxes).

In this last section, we are going to add some local invocations (small light grey boxes) per work group. In other words, we will split the image in batches of a specific size and run over each of these batches per work group. So we have to alter our shader a little bit to calculate and write to the right texel. You could imagine the final image as an overlay over the work group sheet below where each invocation will then be one pixel of the image:

For simplicity, we increase the resolution of our texture to get a number that can be divided by 10 without a rest. Here we will have 1,000,000 pixels though need 1 million shader invocations.

// texture size
const unsigned int TEXTURE_WIDTH = 1000, TEXTURE_HEIGHT = 1000;

We can now lower the amount of work groups dispatches by the ratio of 10 for each dimension. This means we will execute 10,000 work groups.

glDispatchCompute((unsigned int)TEXTURE_WIDTH/10, (unsigned int)TEXTURE_HEIGHT/10, 1);

If we run the program without altering the shader we will see that only 1/100 of the image will be calculated.

Rendered image altered global workgroups

To calculate the whole image again we have to adjust the local_size of the compute shader accordingly. Here we distribute the invocations as well only in 2 dimensions (X and Y).

#version 430 core
layout (local_size_x = 10, local_size_y = 10, local_size_z = 1) in;

layout(rgba32f, binding = 0) uniform image2D imgOutput;

layout (location = 0) uniform float t;                 /** Time */

void main() {
    vec4 value = vec4(0.0, 0.0, 0.0, 1.0);
    ivec2 texelCoord = ivec2(gl_GlobalInvocationID.xy);
    
    float speed = 100;
    // the width of the texture
    float width = 1000;

    value.x = mod(float(texelCoord.x) + t * speed, width) / (gl_NumWorkGroups.x * gl_WorkGroupSize.x);
    value.y = float(texelCoord.y)/(gl_NumWorkGroups.y*gl_WorkGroupSize.y);
    imageStore(imgOutput, texelCoord, value);
}

As seen above we have to adjust the ratio for the relative texel coordinate calculation. The gl_NumWorkGroups variable gives us the amount of the local size per work group. This makes it obvious that the amount of dimensions is the product of the amount of work groups times the amount of local invocations as stated in the introduction.

You can find the full source code for this demo here.

Final Words

The above introduction is meant as a very simple overview of the compute shader and how to make it work. As it is not part of the render pipeline, it can get even more complicated to debug non-working shaders/programs. This implementation only shows one of the ways to manipulate data with the compute shader using image access. Using Uniform Buffers or Shader Storage Buffers is a more common way to manipulate geometry itself like particle or cloth simulations.

In upcoming following articles we will go into creating a particle simulation and deal with buffer objects to work on input data and output data after manipulation. As well as having a look on Shared Memory and atomic operations. The upcoming articles will build on these basics and go more into details of the compute shader and more complex calculations like simulations or image manipulations.

Exercises

Check The book of shaders and try to apply some of the generative designs in the compute shader to get more complex calculations. Compare different ratios between work groups and local sizes and see how the FPS differ.

Try to add noise/pattern parameters as uniform variables for the implementation in the first excersise.

In a later article we will go over blurring with compute shaders and compare it with the fragment shader implementations. Feel free to go ahead and try it on your own. Check the GLSL function imageLoad(image, texelCoordinate)

References

Article by: Jonas Sorgenfrei
Contact: mail