There are lots of Linux users who don’t care how the kernel works, but only want to use it. That is a tribute to how good Linux is.
— Linus Torvalds
Operating systems run different programs at the same time. As these programs are often written by different developers, the operating system guarantees the correct use of resources. One program cannot use all the memory or read data written by another program.
Whenever your program wants to execute a privileged action like printing a text on screen, it requests that the kernel perform the action on its behalf. How does it work? This post will show the underlying code executed when you call a Linux system call.
- What is the difference between a library call and a system call.
- Why system calls are less performant than function calls.
- What are the steps when executing a system call.
- What is the difference between an API and an ABI.
- How to execute a system call in assembly language.
- How programming languages expose system calls.
Overview
All programs running on Linux use system calls. System calls are the interface between your application and the Linux kernel. If you need a service provided by the Kernel, you need a system call. You use them every time you read a file on disk (open
, read
, close
), allocate a new block of memory (sbrk
, mmap
), communicate with another process (kill
, listen
), start a new process (fork
, wait
), or simply when your program stops (exit
).
The command man 2 syscalls
lists all system calls:
This command returns 467 different system calls on my Ubuntu 20.04 server. Not all of them are actively in use. Some system calls are deprecated and only kept to preserve compatibility with old binaries.
System calls are unavoidable, but in practice, we rarely interact directly with them.
A library function is simply one of the functions that constitute the standard C library.
Many library functions don’t use system calls (e.g., string-manipulation functions like strcmp()
). On the other hand, some library functions are layered on top of system calls. For example, the fopen()
library function uses the open
system call to actually open a file. Often, library functions are designed to provide a more user-friendly interface than the underlying system call. For example, the printf()
library function provides output formatting and data buffering, whereas the write
system call just outputs a block of bytes. Sometimes, library functions and system calls have the same name. For example, man 2 exit
prints the manual for the exit
system call and man 3 exit
prints the manual for the exit()
library function.
System calls are implemented by your kernel and are an integral part of Linux. On the contrary, different implementations of the standard C library exist. The most commonly installed is the GNU C library (glibc), and is the one we will cover in this article.
For example, in C, the PID of your program is accessible using the function getpid()
:
- 1
- The
getpid()
library function is defined inunistd.h
and returns the PID of the current process using thegetpid
system call.
Retrieving the PID of the current process is one of the most basic system calls. It simply returns the value stored in memory inside a data structure maintained by the kernel. It takes no arguments. It always succeeds. But calling getpid()
is not like calling any other function.
Here is a minimal benchmark illustrating this difference:
- 1
- We defined a basic function returning an integer literal. Retrieving the PID of a process doesn’t interact with a hardware device. Simply returning this integer is relatively close to reading this value from a data structure kept in memory.
- 2
- We call the two functions 1,000,000 times and measure how long it takes.
Here are the results on my laptop using Ubuntu 20.04 in a virtual server:
Calling a system call is, on this example, 200 times slower than calling a simple function. Indeed, a system call is not a simple function call. When you call the function getpid()
, you use a wrapper implemented by glibc hiding the actual logic to execute a system call. Under the hood, this wrapper function does a lot of work:
- Step 1: The library function copies its arguments into registers. It also copies a number identifying the system call into a specific register. The library function then forces the processor to switch from user mode to kernel mode.
- Step 2: The kernel executes the system call:
- The kernel saves the state of the CPU (the register values).
- The kernel checks the validity of the system call number.
- The kernel invokes the right system call routine based on this number. This routine checks the validity of arguments and executes the logic of the system call.
- The kernel restores the state of the CPU and places the return value and the possible error in specific registers.
- Step 3: The library function checks for an error and set the global variable
errno
. The library returns to the caller.
That’s a lot of work, for sure, and provides the beginning of an explanation for why system calls are more expensive.
Step By Step
It’s time to show the code behind system calls. We will use glibc (v2.33) and Linux kernel (v5.13) to illustrate the lines of code running between the user and kernel modes. We will continue with the getpid
example.
User Mode (glibc)
The Objective
For this first step, the objective is to execute the system call getpid
from the viewpoint of a user process. Concretely, we will have to specify values in specific CPU registers before calling a specific CPU instruction, and as different CPU architectures have different registers and different instructions set, the logic will be, of course, different 🙂 (based on your computer architecture).
For example, here is the assembly code to execute the getpid
system call for the amd64
architecture:
Here is the same code for the arm64
architecture:
These two instructions are enough to request the kernel to returns the PID of the current process.
Linux system calls are accessible using an application binary interface (ABI). An ABI defines how a routine is accessed in machine code (hardware-dependent) whereas an API defines a similar access in source code (hardware-independent).
If Linux system calls were implemented using a standard C API, every program would have to call them like C functions. An ABI removes this restriction by asking the compiler or interpreter of your favorite language to generate the machine code (i.e., initializing the registers). ABI is for hardware what API is for software.
The System V Application Binary Interface is the reference specification used by major Unix-like operating systems such as Linux. If we want to understand the previous code sample, we need to have a look in particular at the System V Application Binary Interface for AMD64. This document is 100-page long, but only the section about the calling conventions are interesting:
- User-level applications use as integer registers for passing the sequence
%rdi
,%rsi
,%rdx
,%rcx
,%r8
and%r9
. The kernel interface uses%rdi
,%rsi
,%rdx
,%r10
,%r8
and%r9
. - A system-call is done via the
syscall
instruction. This clobbers%rcx
and%r11
as well as the%rax
return value, but other registers are preserved. - The number of the syscall has to be passed in register
%rax
. - System-calls are limited to six arguments, no argument is passed directly on the stack.
- Returning from the
syscall
, register%rax
contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is-errno
. - Only values of class INTEGER or class MEMORY are passed to the kernel.
In our example, we don’t have arguments to pass, but we still need to specify which system call we want to execute. Under the hood, a Linux system call is just a number. For amd64
, the number 39 represents the getpid
system call and must be specified in the register rax
before calling the CPU instruction syscall
.
The Code
Glibc implements the getpid()
library function and many other similar functions to make accessible system calls in a friendly manner to C programs. Calling the getpid
system call is not so different from calling any other system call. The number of arguments varies, and some calls do not return errors, but the logic is similar. Basically, glibc must put values in registers and call an instruction to delegate to the kernel. Therefore, to avoid code duplication, glibc adopts a declarative approach to implement these library functions.
For example, if you look inside the code source, you will only find the declaration of the function getpid()
:
You will not find the implementation directly, at least not in an obvious manner.
System calls are defined in various syscalls.list
files reflecting the differences between machine architectures. These files are then merged in a precise order. The format looks like this:
These files contain the metadata required to generate thin assembly wrappers around the corresponding system calls. For example, getpid
arguments are defined as Ei:
, which means:
E
→errno
is not set by the call (i.e., the system call never fails).i
→ returns a scalar value (i.e., a integer representing thepid_t
):
→ separates the return context from the arguments. As there are no letters after it, it means the system call takes no argument.
These files are then read by the script make-syscalls.sh
, launched by the Makefile
when building glibc. This script outputs one rule for every system call:
Here is an example of command when all of the pieces are put together:
The command compiles a C snippet from stdin
using many directories containing header files, in particular, files named sysdep.h
. These files declare macros representing the real assembly code for all supported architectures. For example:
The result of the previous rule command is an object file. Let’s inspect its content:
To sum up, when we are calling the getpid()
library function, the alias __getpid()
is really called. This function is implemented in assembly language and executes the same instructions we presented before.
In practice, not all system calls can be generated like this. A prior version of the getpid()
library function used a cache to limit system calls since the PID of a process never changes. This cache was removed by this commit, but if we move back in Git history, we can have a look at a different technique used by glibc to implement library functions.
- 1
- The function
__getpid()
is implemented as a C function. - 2
- The code checks for the
pid
in the thread-local memory area to determine if the function has already been called. - 3
- If the cache is empty, the code delegates to
really_getpid()
that checks the cache again before calling the macroINTERNAL_SYSCALL
we have just covered before.
Of course, When a file like getpid.c
is present, the script make-syscalls.sh
does not override it:
- 1
- No rule is generated.
The code will simply be compiled with the rest of the glibc source code, reusing the same macros as the current implementation, which means the code always ends up with the syscall
instruction to give control to the kernel.
Kernel Mode (Linux)
The Objective
The user process has just requested a service from the kernel. It filled the registers and called a special instruction to jump to a different location. Enter the kernel.
For this second step, the objective is for the kernel to register a procedure at this location. This procedure reads the system call number and looks at the table of system calls to find the address of the kernel function to call. Then after this function returns, it does a few checks and then returns to the user process.
The Code
First, we will look at the implementation of the getpid()
system call.
Implementing a System Call
The main entry point for the getpid
system call is called sys_getpid()
, but you would not find the function declaration as such. System call functions are defined using the SYSCALL_DEFINEn()
macro rather than explicitly, where n
indicates the number of arguments. This macro takes the system call name followed by the (type, name)
pairs for the parameters as arguments. The motivation is to make metadata available for other tools like tracing.
Here is the definition of the getpid
system call :
- 1
- The code uses the
current
pointer representing the current task, which is issuing the syscall. The PID is then extracted from this struct. We will not cover it further.
This entry point also needs a corresponding function prototype in the reference file include/linux/syscalls.h
. This prototype is marked as asmlinkage
to match the way that system calls are invoked:
Finally, the system call must be registered in the system call table, so that the kernel can found it from its number.
Most architectures share a generic system call table:
But some architectures (e.g. x86) have their own architecture-specific system call tables. For amd64
, the system call table looks like:
- 1
- We find again the number 39 representing the
getpid
system call onamd64
.
That’s pretty much all the steps required when adding a new system call in Linux.
Now, we must look at the glue between the syscall
CPU instruction and the system call function we have just presented.
Initializing the System Call Entry
On amd64
, the instruction syscall
put the address present in the register IA32_LSTAR
into the register RIP
, the instruction pointer. After this step, the handler at that location will be executed in a CPU privileged mode. It means that the kernel needs to put the system call entry address into the IA32_LSTAR
register during its initialization.
int 0x80 vs syscall
Many online code examples use the int 0x80
instruction instead of syscall
. This instruction was the only option on i386
architecture (x86
) and is still available on amd64
architecture (x86-64
) since this latter is a superset of the former for backward-compatibility reasons (i.e., code compiled to x86
is portable to x86-64
).
For example, the getpid
system call can be executed in two different ways on amd64
:
Similar instructions exist for other architectures too. The motivation is always to transition from user to kernel mode in a secure way—an application cannot just jump to arbitrary kernel code.
For an implementation viewpoint:
int 0x80
relies on software interrupts. The idea is to use the same method to enter the kernel as hardware interrupts do (ex: when pressing a key on your keyboard).syscall
(andsysenter
) relies on specific CPU instructions designed for the specific use case of system calls and thus comes with optimizations.
syscall
is more performant because it does less operations (syscall
does not generate a software interrupt) and based on some benchmarks, using syscall
is a magnitude faster (~5 times faster), which is fast compared to int 0x80
but still slow compared to calling a local function.
The kernel starts when the function start_kernel
defined in init/main.c
is called. This function installs various interrupt handlers using the function trap_init
, which called cpu_init
, which called syscall_init
. Let’s look at the implementation of this last function (for amd64
):
- 1
MSR_*
are Model-specific Register and can only be written by the privileged CPU instructionwrmsr
. This first line is low-level code to ensure that we return to user code with the related privilege.- 2
entry_SYSCALL_64
is the system call entry. We store the address of this function.
Now that the system call entry is ready, we are ready too to see what happens when the syscall
instruction is called, but first, we still have to introduce the system call table.
Initializing the System Calls Table
Any system call will trigger the execution of the system call entry we have just configured. This function determines which system call routine to execute by looking into the system call table for the system call number.
This table is represented by the sys_call_table
array in the Linux kernel:
- 1
- All elements point initially to the
sys_ni_syscall
function, which is a fallback function simply returning-ENOSYS
(Function not implemented
). - 2
- The header file
asm/syscalls_64.h
is generated dynamically from the list of system calls on your system and overrides the default handler for all defined system calls.
This asm/syscalls_64.h
file is generated by the script arch/x86/entry/syscalls/syscalltbl.sh
and the result looks like:
If we evaluate the macros, our system call table initialization looks like this:
At this point, we have already configured the system call entry, and the system call table is ready for this handler to determine the system call to execute. Let’s do it.
Entering a System Call
As we have seen, the system call entry is defined by the entry_SYSCALL_64
function defined like this:
The line that interests us is the system call execution:
Where the function do_syscall_64
is defined like this:
- 1
- Check the system call number is valid. The value of
NR_syscalls
is determined at compile time. - 2
- Clamp the index within
[0..NR_syscalls]
. - 3
- Execute the function present in the system call table with the specified number.
After a system call handler returns, the system call entry restores registers, flags and pushes the return address of the user process before exiting with the sysretq
instruction. Then, the user program continues where it left off. Our long journey in Linux system calls is finished.
The code was tested on Ubuntu 20.04. You can recreate the same environment using a local virtual machine if you are running on a different operating system. I use Vagrant on my machine:
When using a virtual machine, calling a system call is no different from what we have presented. As the processors are virtual, the hypervisor is responsible for converting machine code generated to the host architecture. Several techniques exist to handle this. A naive approach is for the hypervisor to trap system calls and delegate to the guest OS using different system calls specific to this OS and its architecture.
What follows is a basic program written in Assembly for amd64
architecture and executing the system calls getpid
and exit
. (The second is required if you don’t want your program to crash abruptly at the end.)
What we have is still a text file with assembly language instructions. It is not a format that a computer can run. Assembly language is text (source code) that must be converted into bytes (machine code). Therefore, we need to run a few commands first:
nasm
: The assembler “assembles” the instructions to machine code bytes to create an object file.ld
: The linker turns this object file into an executable file that the operating system can run. (As we have only one object file, the linker does almost nothing but is a mandatory step.)
Let’s create the executable:
The program outputs nothing. We haven’t write code for that. We can solve this problem using a debugger to inspect registers but first, let’s dump some information about our object file:
The result of the getpid
system call will be available starting with the address 401007
in the register rax
.
Let’s output some information about our file:
- 1
- We retrieve the initial address
0x401000
as reported previously by the commandobjdump
.
Let’s add a breakpoint to stop after the system call execution:
Print the value of the register rax
:
In a second terminal:
The output confirms that the PID of our program is 4066
. We successfully executed our first system call using assembly code!
Case Studies
Go
We will use the Go programming language and explain using code how the language makes system calls accessible to Go developers. We will still use the getpid
system call for the example.
The function Getpid
is implemented by the package os
:
The code simply delegates to the package syscall
. This package contains files implementing system calls for every supported architecture. For example, the file zsyscall_linux_amd64
provides the implementation of system calls for the amd64
architecture. Other files such as zsyscall_linux_arm64
exist in the same package. Go build constraints are used to determine which file is finally included when building the binary:
Here is the definition of Getpid
for amd64
architecture:
As visible from the code, this file was generated from the template file syscall_linux_amd64.go
. Here is a snippet from this file:
This prototype definition of the getpid
system call specifies the number of arguments, if errors can be returned, and also if the current goroutine must be suspended during the execution of the system call (sys
vs sysnb
= suspend vs continue). The utility program mksyscall.pl
present in the same package reads this template file to generate the implementations of system call wrappers like Getpid()
.
What remains to cover is the code behind this line included in the generated code:
This function rawSyscallNoError
is defined like this:
In Go, a function declaration may omit the body. Such a declaration provides the signature for a function implemented outside Go, such as an assembly routine.
The code belongs in the sys/unix
package. This package regroups the assembly code to execute system calls. Porting Go to a new architecture/OS combination or supporting a new syscall implies changes in this package like updating the hand-written assembly files at asm_${GOOS}_${GOARCH}.s
, which are parsed by the Go tooling to build the final code.
Here is the native implementation for Linux of the function rawSyscallNoError
:
The logic should look familiar if you remember the calling conventions in the System V ABI specification.
Concerning the getpid
system call, the const SYS_GETPID
is defined like this:
And now, the native implementation of the getpid
system call:
For comparison, here is the same code for the arm64
architecture:
That’s all. You have reviewed the standard Go code to execute a system call on Linux.
Java
Since Java 9, the process API can be used to get the current process ID.
- 1
- We grab a handle to the current process and then query the PID.
This method is implemented in the type ProcessHandlerImpl
like this:
The code delegates to the native
function getCurrentPid0
implemented in C in ProcessHandleImpl_unix.c
:
- 1
- The function
getpid()
is defined by glibc.
As Java native methods are implemented in C, the code can reuse existing function libraries supported by glibc that we have already covered. We are back to square one. It’s time to close this blog post.
mov rax, 60
As we have seen through this article, a system call is definitively not a simple function call. A lot of code is executed to delegate the responsibility to the kernel to ensure we are privileged to run it.
System calls are essential for developers. They define the capabilities of your system. For example, the epoll
system call helped Nginx to solve the C10k problem by offering a event-driven I/O model, the inotify_*
system calls allows react-scripts
to automatically rerun your tests when you are making a code change, the sendfile
system call supports the Zero-Copy optimization used by Kafka, which is one of the main reasons explaining its performance, the ptrace
system call is used by debugguers like gdb to inspect your program using breakpoints, and so on. The question is, therefore, what are you going to build using these system calls? 😉
- System calls are doors to the kernel. They act like a security guard that must check your identity before executing your action. This could not be as fast as staying in the same room and executing the action yourself.
- System calls are dependent on your architecture. The ABI defines which registers and which instructions must be used on your architecture.
- System calls are accessible in the standard library of your programming language. But as modern languages are often implemented in their own language, they cannot interact directly with registers and must implement workarounds like rewriting the logic in Assembly and relying on the compiler to merge it with the rest of the compiled code.
- System calls are often implemented in a generic way. Glibc lists most system calls in metadata files and uses a build tool to generate the source files. Similar toolings exist in Go and inside the Linux kernel.