Linux Fu: Simple Pipes [Hackaday]

March 16, 2022 Kyle Wiggers

In the old days, you had a computer and it did one thing at a time. Literally. You would load your cards or punch tape or whatever and push a button. The computer would read your program, execute it, and spit out the results. Then it would go back to sleep until you fed it some more input.

The problem is computers — especially then — were expensive. And for a typical program, the computer is spending a lot of time waiting for things like the next punched card to show up or the magnetic tape to get to the right position. In those cases, the computer was figuratively tapping its foot waiting for the next event.

Someone smart realized that the computer could be working on something else while it was waiting, so you should feed more than one program in at a time. When program A is waiting for some I/O operation, program B could make some progress. Of course, if program A didn’t do any I/O then program B starved, so we invented preemptive multitasking. In that scheme, program A runs until it can’t run anymore or until a preset time limit occurs, whichever comes first. If time expires, the program is forced to sleep a bit so program B (and other programs) get their turn. This is how virtually all modern computers outside of tiny embedded systems work.

But there is a difference. Most computers now have multiple CPUs and special ways to quickly switch tasks. The desktop I’m writing this on has 12 CPUs and each one can act like two CPUs. So the computer can run up to 12 programs at one time and have 12 more that can replace any of the active 12 very quickly. Of course, the operating system can also flip programs on and off that stack of 24, so you can run a lot more than that, but the switch between the main 12 and the backup 12 is extremely fast.

So the case is stronger than ever for writing your solution using more than one program. There are a lot of benefits. For example, I once took over a program that did a lot of calculations and then spent hours printing out results. I spun off the printing to separate jobs on different printers and cut like 80% of the run time — which was nearly a day when I got started. But even outside of performance, process isolation is like the ultimate encapsulation. Things you do in program A shouldn’t be able to affect program B. Just like we isolate code in modules and objects, we can go further and isolate them in processes.

Doubled-Edged Sword

But that’s also a problem. Presumably, if you want to have two programs cooperate, they need to affect each other in some way. You could just use a file to talk between them but that’s notoriously inefficient. So operating systems like Linux provide IPC — interprocess communications. Just like you make some parts of an object public, you can expose certain things in your program to other programs.

The most fundamental way to do this is with the fork call. When you fork a new process, the new process is a total copy of its parent. You don’t always realize this because the next thing you often do is call something like exec to load a new program or you use a wrapper-like system that calls fork and exec for you. But each time you run, say, ls at the command prompt, your running ls program starts life as a total copy of the shell. That copy then loads the ls executable and runs it.

But what if it didn’t? That was how my report writer worked. The big math, which took hours on a Sequent computer with a lot of CPUs, took place in one process. When it was time to print, I forked off a bunch of subprocesses. Each one had a total copy of the data which I then treated as read-only and started printing. That’s one way to communicate between processes.

Another way is pipes. Imagine a command line like:

cat data.txt | sort | more

Here, you are creating three processes. One dumps data from a text file. It sends that data to a pipe that is connected to the sort program. It outputs to another pipe that is connected to the more program.

One Way

Pipes like that are one-way affairs, but you can create named pipes and talk over them in both directions. You can do both of these in the shell — mknod makes a named pipe — but you can also do both of them in a program. (popen is very easy to use for regular pipes and there is a mknod API call, too.)

There are several other methods you can use to talk across processes:

Message queues – A way to send messages to another process asynchronously
Semaphores – A way to share a counter with another program
Shared memory – Share a block of memory
Signal – You can send signals to other processes which can be used as a form of communication

You might wonder why you need anything beyond shared memory. Honestly, you don’t, but it is easier in many cases to use a different method. The problem is you need some way to have an atomic operation and things like semaphores manage that for you. Imagine if we had a variable in shared memory called busy. If busy is 1, then we know we shouldn’t change data in our shared memory because someone is using it.

We might write:

while (busy) ; // wait for busy==0
busy=1;
do_stuff();
busy=0;

Looks great, right? No. Somewhere in the CPU that while loop looks like this:

while_loop2384: TST busy ; set flags on busy
                JNZ while_loop2384 ; if no zero flag, jump
                MOV busy,#1 ; move 1 to busy

Most of the time this will work fine. Most of the time. But what happens if I do the TST instruction and then I get put to sleep so another program can run the same code? Or another CPU is running the exact same code at the exact same time? It can happen. So both programs will now see that busy is zero. Then they will both set busy to 1 and continue on. That’s a fail.

Semaphores manage this through an atomic access mechanism that allows the program to test and set the operation in one place. There is more to worry about, like what happens if I’m waiting for process B to release a semaphore and process B is waiting for me to release a different one. But that situation — deadlock –is a topic for the future, along with other hiddle gotchas like priority inversion.

In the Pipeline

I have a bit of a made up problem. On Linux, if I type df I can find out all the mounted things and their characteristics. But that list includes things like the root directory and swap file. What if you wanted to just read the loop devices and show the same format output? There are plenty of ways to do that, of course. You could read the loop files out /etc/mtab and then read the other data from /sys or wherever else it resides. Sounds like a lot of work.

Of course, running df gets us almost there. In fact, I could just run a pipeline in the shell to get what I want, sort of:

df | grep '^/dev/loop'

That works but the output is scrambled. On my system /dev/loop3 is first and /dev/loop0 is last and there’s no clear reason why number 4 is between 8 and 14. So I want to sort it. Piping through sort doesn’t help much because it is going to sort alphabetically. You might think about the -n flag to sort, but that won’t work because the number is at the end of the string. Sure, I could use some strange combo of cut or sed to maybe fix all this, but it is getting too complicated. Let’s just write C code.

The first step is to get df to just print everything and capture the output. Since we want to process the output we need to read a pipe, and popen() is an easy way to set this up:

#include 
int main(int argc, char * argv[]) {
// This part reads the output of DF into the lines array (with some modifications)
   FILE * result = popen("df", "r"), * sort;
   int i;
   if (!result) {
      perror("Can't open df");
      return 1;
   }
   while (!feof(result)) {
     int c = getc(result);
     if (c != EOF) putchar(c);
   }
 pclose(result);
 return 0;
}

Half Solved

That’s half the problem solved. If you have the characters, you could do all the sorting and filtering you want, but… wait a minute! I’m still lazy. So let’s ask the shell to help us out. Here’s my plan. I know I only want lines that start with /dev/loop so let’s do this:

Read a whole line at one time
If it isn’t a /dev/loop line throw it out
If it is a /dev/loop line then save it in an array but chop off the /dev/loop part
After we have all the lines ask the shell to sort and then add the /dev/loop back in after sorting

Easy enough:

#include 
#include 

char buffer[4097];
char * lines[512];
unsigned int maxline = 0;

int main(int argc, char * argv[]) {
// This part reads the output of DF into the lines array (with some modifications)
   FILE * result = popen("df", "r"), * sort;
   int i;
   if (!result) {
      perror("Can't open df");
   return 1;
   }
   while (!feof(result)) {
 // get a line from df
     char * rc = fgets(buffer, sizeof(buffer), result);
// only save lines that start with /dev/loop
    if (rc &amp;&amp; !strncmp(buffer, "https://hackaday.com/dev/loop", 9)) {
// out of space
       if (maxline >= sizeof(lines) / sizeof(char * )) {
       fprintf(stderr, "Too many loopsn");
       return 2;
     }
   lines[maxline++] = strdup(buffer + 9); // copy just the number part and the rest of the line
   // should check lines[maxline[1] for null here
   }
 }
 pclose(result);

// Now we are going to print through sort
// The sed replaces the /dev/loop at the front of the line after sorting
 sort = popen("sort -n | sed 's/^/\/dev\/loop/'", "w");
 if (!sort) {
   perror("Can't open sort");
   return 3;
  }
// for each line, send to pipe (note order didn't really matter here ;-)
 for (i = 0; i < maxline; i++)
   fputs(lines[i], sort);
 pclose(sort);
 return 0;
}

And there you have it. Yeah, you could do this with the shell alone but it would be a lot more difficult unless you resort to another programming language like awk and then you really aren’t just using a shell. Besides, this makes a nice example and there are lots of things you could do like this that would be very hard to do otherwise.

You might wonder if you could spin off something like sort and both feed it input and read its output. The answer is yes, but not with popen(). The popen() call is just a handy wrapper around pipe and fork. If you wanted to do both ends you would have to use the pipe() call directly (twice) and then execute sort or whatever. But that’s a topic for the future.

There are plenty of other future topics for interprocess communications, too. But for now, try out pipes using popen. Critical sections come up in shell scripts, too. If you prefer to write your scripts in C, that’s possible, too.

Doubled-Edged Sword

One Way

In the Pipeline

Half Solved

Spread the word!