Linux Fu: An Odd Use for fork() [Hackaday]

April 21, 2022 Karin Maday Jade Global

If you are a Star Trek fan, you’ll probably remember the phrase “You have to learn why things work on a starship.” The truth is, in most episodes, knowing how to override another ship’s console or make gunpowder didn’t come in very handy, but boy when it did, it really saved the day. Linux is a lot like that. There are a few things you probably don’t need to know very often, but when you do need to know, it makes a huge difference. In this particular post, I want to look at an odd use of the fork system call. For many purposes, you’ll never need to know this particular irregular use. But when you need it, you are really going to need it.

This is actually based on an old client of mine who used Unix to run a massive and very critical report every day. The report had a lot of math since they were trying to optimize something and then generate a lot of reports. In those days, the output of the report was on old green-bar paper on a line printer. The problem was that the report took something like 14 hours to run including the printouts. If someone discovered something wrong, there was no time to run the report again because the next day’s report would have to start before the second run would finish.

The client had a bunch of Windows programmers and — at that time — there wasn’t anything really analogous to a real fork call in Windows. I looked at the code and realized that probably most of the code was spending time waiting to print the output. The computer had multiple CPUs and there were multiple printers, but that one program was hanging on the one printer. There was a lot of data, so writing it to a database and then running different reports against it wasn’t a great option. The answer was to use the power of fork. With a change in the code that took less than 30 minutes, the report ran in five hours. They were very pleased.

So how did I do it? The answer lies in how fork works. Just about every time you see a fork, you see some sort of exec call to start a new program. So if you think about fork at all, you probably think it is part of how you start a new program and, most of the time, that’s true.

What does `fork()` Do Exactly?

The call, however, does something very strange. It actually copies the entire running process into a new process. It then runs the new process. Of course, the original process is running, also. Normally, when you see fork, it looks like this:

int childPID;
childPID = fork();
if (childPID == 0) exec....; /* load child program and run that */
/* the parent only gets here with childPID set to the new process' PID */
...

In other words, the return value for fork is zero for a child process and something else for the parent process. Some early Unix systems really copied everything in the running process. However, that’s really inefficient, especially when most of the time you just immediately load a new program.

Modern systems use COW or Copy On Write semantics. That means the new process gets what amounts to a pointer to the original process memory and it only copies relatively small amounts of memory when the child or parent program makes changes to that region of memory. This is good for things like instruction spaces that shouldn’t change anyway since very few people still write self-modifying code. That means that right after a fork call, both parent and child see the exact same data, but any changes they make will not reflect to the other side.

Parallel Processing Made Easy

For my client’s long report, the program was mostly I/O bound. However, each report also had some pretty hairy math to go along with it, in addition to all the math required to get to the point that each report could execute. Instead of executing all of it in one process, I broke the program up into multiple pieces. The first piece did as much math as it could that applied to nearly everything. Then the program called fork a bunch of times and each child started a report which did a little more math just for itself and claimed a printer to write the output.

Since the CPU had multiple processors, everything got sped up. Report three didn’t have to wait for reports one and two to complete. Everyone was able to drive the printers at once. It was an overall win and it took almost no time to make this fix.

Granted, not every problem will allow for a fix like this one. But giving each report process a memory copy of the data was very fast compared to reading it from a file or database. The data didn’t change after the reports started, so real memory consumption wasn’t too bad, either.

An Example

So is it really that simple? It is. The only problem now is that with modern machines, it is hard to find a simple problem to demonstrate the technique. I finally settled on just doing something simple, but doing lots of it. My made up task: fill a really large array of double-precision floating point numbers with some made up but predictable data and then find the average. By really large I mean 55 million entries or more.

I created a program that can do the job in two ways. First, it just does it in the simplest way possible. A loop walks each item in the array, you add them up, and you divide at the end. On my machine, running this a few times takes an average of about 458 milliseconds — using the time command to figure that out.

The program can also accept an F parameter on the command line. When that is in effect, the setup is the same, but a fork creates two processes to split the array in half and find the average of each half. I didn’t want to have the child communicate back to the process, but that is possible, of course. Instead, you just have to read the two averages, add them together, and divide by two to get the true average. I didn’t want to add the overhead to communicate the result, but it would be easy enough to do.

The time for the fork version to run? About 395 milliseconds. Of course, your results will vary, and while 60 or so milliseconds doesn’t seem like a lot, it does show that having two processes working together can allow multiple cores to work at the same time.

The larger the array, the bigger the time savings. For example, setting the size to 155,256,000 showed a savings of around 150 milliseconds. Of course, these timings aren’t scientific and there are a lot of factors to consider, but the data clearly shows that splitting the work between two processes works faster.

The Code

The code is straightforward. The work isn’t hard, there’s just a lot of it.

#include 
#include 
#include 
#include 

// compile: gcc -o stress stress.c
// run: time stress
// time stress F

#define SIZE 55256000 // how big is the array?
double bigarray[SIZE];

// The process routine will go from llimit to ulimit
// For the single case, that's everything
// For the child case we will split in half
unsigned int ulimit=SIZE;
unsigned int llimit=0;

double total; // running total

// Load up the array with some bogus data
void setup(void)
   {
   unsigned int i;
   for (i=llimit;i1 && (*argv[1]=='f' || *argv[1]=='F')) dofork=1; // f or F will trigger a fork
   setup(); // load array
   if (!dofork)
      {
// single case
// ulimit and llimit are already set
      process();
      exit(0);
      }
   else // forking here
      {
      if (pid=fork())
         {
// parent -- adjust ulimit
         ulimit=SIZE/2;
         process();
         waitpid(pid,NULL,0); // wait for child
         exit(0);
         }
      else
         {
// child -- adjust lower and upper limit
         llimit=SIZE/2;
         ulimit=SIZE;
         process();
         exit(0);
         }
     }
// we never get here
   }

Why Things Work on a Starship

Now that you know how fork really works. Well, sort of. There are plenty of nuances about what handles get passed to the child and which don’t. And, as I said, you won’t need this very often. But there are times when it will really save the day.

If you want a higher-level look at multitasking, try an older Linux Fu. Or, check out the GNU Parallel tool.