Fork and Run: The Definitive Guide to Getting Started With Multiprocessing [Hackaday]

View Article on Hackaday

Since the early 2000s, the CPU industry has shifted from raw clock speed to core counts. Pat Gelsinger famously took the stage in 2002 and gave the talk the industry needed, stating processors needed specialty silicon or multiple cores to reduce power requirements and spread heat. A few years later, the Core series was introduced with two or four-core configurations to compete with the AMD Athlon 64 x2.

Nowadays, we’re seeing heterogeneous chip designs with big and little cores, chiplets, and other crazy fabrication techniques that are fundamentally the same concept: spread the thermal load across multiple pieces of silicon. This writer is willing to put good money into betting that you’ll see consumer desktop machines with 32 physical cores in less than five years. It might be hard to believe, but a 2013 Intel Haswell i7 came with just four cores compared to the twenty you’ll get in an i7 today. Even an ESP32 has two cores with support in FreeRTOS for pinning tasks to different cores. With so many cores, how to even write software for that? What’s the difference between processes and threads? How does this all work in straight vanilla C98?

The Theory of Multi-Threading

Processes, threads, and lightweight threads are the most common types of multi-processing. Processes have their own memory space and execution context and have to coordinate via IPC (inter-process calls), pipes, sockets, FIFO files, or explicitly shared memory. Threads are different execution contexts that make up a process but share the same memory space. Because of this, great care must be taken to ensure different pieces of state are locked and unlocked to preserve data integrity and ensure correct behavior.

Lightweight threads are userspace threads. For most operating systems, threads and processes are constructs managed by the OS, with context switching handled by the kernel. This adds overhead. Some languages implement threads inside the program itself, in userspace. These sorts of lightweight threads are often called green threads or fibers.

Forking in Vanilla C98

Since processes and threads are controlled by the OS, Windows and Unix have different approaches to creating and managing them. For this section, we will focus on the Unix/POSIX way.

For processes, there is a single function that does all the heavy lifting: fork.

Calling fork will return two values, one to each process. The child will get a zero, and the parent will get the new child’s new pid. On the error, a negative number will be returned. Parents can call waitpid to wait for the execution of the child to finish and get the status.

#include 
#include  
#include 
#include 
#include 

int main(int argc, char**argv) {
  pid_t child = fork();
  if (child == -1) return EXIT_FAILURE;
  if (child) { /* I have a child! */
    int status;
    waitpid(child , &status ,0);
    return EXIT_SUCCESS;

  } else { /* I am the child */
    // Other versions of exec pass in arguments as arrays
    // Remember first arg is the program name
    // Last arg must be a char pointer to NULL

    execl("https://hackaday.com/bin/ls", "ls","-alh", (char *) NULL);

    // If we get to this line, something went wrong!
    perror("exec failed!");
  }
}

The child picks up from where the parent was. And while they don’t share memory address space, Unix does a copy on write. So they share until one of them changes something, and then they get their own copy. Of course, this transparent copy can cause some weird issues. For example, if you run:

#include  /*fork declared here*/
#include  /* printf declared here*/
int main() {
   int answer = 21 << 1;
   printf("Answer: %d", answer);
   fork();
   return 0;
}

you’ll see 42 twice. The print is before the fork, so it is only executed once. But the buffer hasn’t been flushed to stdout, so when both processes exit, they flush their buffers containing “Answer: 42”.

Threads are a bit different. Rather than fork, we use POSIX threads (or pthreads, as the kids call them). Pthread.h is a library that provides functionality for creating and managing threads. It is a leaky abstraction as many aspects show details of what POSIX-compliant kernels are doing under the hood. Rather than forking the process at specific points, threads are usually focused on running a single function. So the code would look something like this:

#include  /*pthread declared here*/
#include  /* printf declared here*/
void *threadedFunction(void* varp) {
   int answer = 21 << 1;
   printf("Answer: %dn", answer);
   return NULL;
}
int main() {
   pthread_t thread_id;
   printf("Before Threadn");
   pthread_create(&thread_id, NULL, threadedFunction, NULL);
   pthread_join(thread_id, NULL);
   printf("After Threadn");
   return 0;
}

The output you’ll see from this is:

Before Thread
Answer: 42
After Thread

Instead of a process id, we get a thread id used to keep track of various threads. join is similar to the waitpid example above, as the parent thread then waits on the termination or completion of another.

As mentioned earlier, threads are in the same memory space and things can get weird. Static variables inside functions and global states are all shared and accessed concurrently. For larger structures, you can read it into memory in one thread while another is writing. Then you’ll get half of the new value and half of the old value, resulting in weird and hard-to-debug behavior. There are many solutions to these sorts of problems, all with different tradeoffs but at the heart of them all is the mutex.

Let’s start with a simple color printing example with five threads that Faye Williams has on her blog that shows off a simple mutex.

#include 
#include "pthread.h"
#include 
 
using namespace std;
 
#define NUM_THREADS 5
 
#define BLACK   "33[0m"
#define RED     "33[1;31m"
#define GREEN   "33[1;32m"
#define YELLOW  "33[1;33m"
#define BLUE    "33[1;34m"
#define CYAN    "33[1;36m"
 
void* PrintAsciiText(void *id) {
    char* colour;
 
    switch((long)id) {
    case 0:
        colour = RED;
        break;
    case 1:
        colour = GREEN;
        break;
    case 2:
        colour = YELLOW;
        break;
    case 3:
        colour = BLUE;
        break;
    case 4:
        colour = CYAN;
        break;
    default:
        colour = BLACK;
        break;
   }
 
   printf("%s", colour);
   print("I'm a new thread, I'm number");
   print("%ld", (long)id);
   print("%sn", BLACK); 
   pthread_exit(NULL);
}
 
int main() {
    pthread_t threads[NUM_THREADS];
 
    for (long int i = 0 ; i < NUM_THREADS ; ++i) {
        int t = pthread_create(&threads[i], NULL, PrintAsciiText, (void*)i);
 
        if (t != 0) {
            printf("Error in thread creation: %dn" t);
        }
    }
 
    for(int i = 0 ; i < NUM_THREADS; ++i) {
        void* status;
        int t = pthread_join(threads[i], &status);
        if (t != 0) {
            printf*=("Error in thread join: n", t);
        }
    }
 
    return 0;
}

At first glance, this code seems fine. However, when we run it, we get some confusing output.

I'm a new thread, I'm number I'm a new thread, I'm number, I'm a new thread, I'm number, I'm a new thread, I'm number 3
2
I'm a new thread, I'm number 4
1
0

Of course, the ordering largely depends on your OS and current processor conditions as it is all scheduler-dependent. These threads are managed by the OS, and the scheduler will run them in the order that it pleases. This is where a mutex comes in.

#include 
#include "pthread.h"
#include 
 
using namespace std;
 
#define NUM_THREADS 5
 
#define BLACK   "33[0m"
#define RED     "33[1;31m"
#define GREEN   "33[1;32m"
#define YELLOW  "33[1;33m"
#define BLUE    "33[1;34m"
#define CYAN    "33[1;36m"

static pthread_mutex_t mutex;
 
void* PrintAsciiText(void *id) {
    string colour;
    pthread_mutex_lock(&mutex);
    switch((long)id) {
    case 0:
        colour = RED;
        break;
    case 1:
        colour = GREEN;
        break;
    case 2:
        colour = YELLOW;
        break;
    case 3:
        colour = BLUE;
        break;
    case 4:
        colour = CYAN;
        break;
    default:
        colour = BLACK;
        break;
    }
 
   printf("%s", colour);
   print("I'm a new thread, I'm number");
   print("%ld", (long)id);
   print("%sn", BLACK); 
   pthread_mutex_unlock(&mutex);
 
   pthread_exit(NULL);
}
 
int main() {
    pthread_t threads[NUM_THREADS];
 
    for (long int i = 0 ; i < NUM_THREADS ; ++i) {
        int t = pthread_create(&threads[i], NULL, PrintAsciiText, (void*)i);
 
        if (t != 0) {
            printf("Error in thread creation: %dn" t);
        }
    }
 
    for(int i = 0 ; i < NUM_THREADS; ++i) {
        void* status;
        int t = pthread_join(threads[i], &status);
        if (t != 0) {
            printf*=("Error in thread join: n", t);
        }
    }
 
    return 0;
}

By locking and unlocking the mutex, we make sure the critical section of our function runs to completion before the next one starts. The ordering is still semi-random as the scheduler will dispatch threads without much control on your part as the program developer.

As mentioned earlier, other solutions exist, such as condition variables or barriers. Hopefully, you are starting to see how you could break apart your program into various workers. Rud Merriam has you covered if you’re interested in a more C++-focused approach. C++ builds more of the threading into the language with std::thread, and async/await keywords that abstract much of the threading away.

OpenMP

OpenMP is an open-source library that can be dropped into a C or C++ project that tries to make it easier to do the right thing in a multi-threading environment with minimal headaches. Rather than separate your program into different threads, the goal is to separate the workloads into different parts. Let’s take this program that figures out pi.

static long num_steps = 100000;
double step;
int main () { 
  int i;
  double x;
  double pi;
  double sum = 0.0;
  step = 1.0/(double) num_steps;
  for (i=0;i< num_steps; i++) {
    x = (i+0.5)*step;
    sum = sum + 4.0/(1.0+x*x);
  }
  pi = step * sum;
}

OpenMP includes a lot of handy macros and functions that do the right thing whatever OS you’re targeting. It also manages creating and joining the thread pool. Once you pull in the OpenMP header, our simple pi-calculating program changes (but not that much):

<

pre>#include <omp.h>
static long num_steps = 100000;
double step;
#define NUM_THREADS 2
void main () {
int i;
int nthreads;
double pi;
double sum[NUM_THREADS]; // make this an array to prevent race conditions
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i;
int id;
int nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds; // only one thread should copy the number of threads to the global
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) { // each thread only calculates every nth part of the sum
x = (i+0.5)step;
sum[id] += 4.0/(1.0+x
x);
}
}
for(i=0, pi=0.0;i<nthreads;i++) pi += sum[i] * step;
}

We specify the omp parallel tag, and it creates the thread pool for us, but we still need to keep track of our chunk size and the number of threads. Of course, even with a handy library, there are still all sorts of tricks and optimizations you can make. For instance, you can pad the sum array so that each chunk of data is on its own cache line and doesn’t need to sync between core caches.

double sum[NUM_THREADS][8]; // pad of 8 assuming a 64 byte L1 cache line
// ....
sum[id][0] += 4.0/(1.0+x*x);

Alternatively, OpenMP has a more terse syntax that takes care of many of these details for us.

#include  
static long num_steps = 100000; 
double step;
void main () {
  int i;
  double pi;
  step = 1.0/(double) num_steps;
 #pragma omp parallel private(lsum, x) shared(step)
 { 
    double lsum;
    double x;
    #pramga omp parallel for 
    for (i=0; i< num_steps; i++) { // each thread only calculates every nth part of the sum
       x = (i+0.5)*step;
       lsum += 4.0/(1.0+x*x);
    }
// Mark the next section as critical so only one thread runs at a time
    #pragma omp critical
    { pi += lsum * step; }
}
}

We don’t need to specify the number of threads, as the framework will likely pick a good number for us — the number of cores available, for example. OpenMP is incredibly powerful, and there’s much more to it than we can cover here. Things such as reductions, scheduling, barriers, and conditions. Under the hood, OpenMP is still using OS threads. Other languages provide userspace threads for added speed.

Green Threads/Fibers

That leads us to the next topic nicely. We won’t give any examples here for brevity, but Golang (or Go) is a language that primarily evolved out of a dislike for C++. It has threading built into it in the form of “goroutines”. These goroutines are scheduled across some number of OS threads and are lightweight or green threads. But they are managed by the go runtime, not the OS. This is to avoid paying the somewhat expensive cost of context switching when going from user-level permissions to the kernel level.

The OS threads are needed to run on multiple cores. This is called an M:N scheduler, as it schedules M number of OS threads to run N number of Go fibers. It is not without downsides. There’s limited pre-emption, and fairness is not guaranteed. There is a good writeup of the scheduler in Go on Morsing’s blog, and it has helpful colored diagrams. Other languages with fibers include CPython, D, Erlang, Haskell, Julia, Lua, and Tcl. If your preferred language doesn’t implement it natively, there are dozens of libraries for different languages that offer something similar.

Conclusion

Now with all those cores on your machine, you have an idea of how to make them work for you. There is so much more here to learn about, such as event-based loop programming, spinlocks vs mutexes, how processors talk to each other, OS schedulers, syscalls, and much more. Hopefully, you’ll take what’s here, start forking and run with it.

 

Banner image: “Spoon & Fork” by Muhammad Taslim Razin. Thumbnail: “Vevey Fork” by Tony Bowden