Contents

Single syscall "Hello, world" - part 1

“Hello World” is the first program of many. Regardless of the programming language, we are learning it is a canonical example of how to create a program that simply prints “Hello, world!” to the screen.

One might then ask, how complex it really is? After all, it is just a single write(2) syscall, right?

NOTE: This post refers specifically to Linux, as I will use some Linux-only tools.

The basics

For the purpose of my “Hello, world” I want to use Rust. It is a modern language suitable for low-level programming, so it surely will have much less overhead than many others. Besides that, it is just a good language.

Let’s get our “Hello, world” going:

1
2
3
fn main() {
    println!("Hello, world!");
}

We can now run it:

1
cargo run
1
Hello, world!

Everything works as expected! Well, that is not a big achievement, but hey we need to be happy with small things.

To see our write(2) syscall and get it over with we will use strace, a system call tracing tool for Linux:

-c option prints a summary of the system calls at the end of the trace. If you also want to see specific system calls with their arguments as they occur, use -C instead.

1
strace -c ./target/debug/hello-world
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Hello, world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
  0.00    0.000000           0         5           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         4           close
  0.00    0.000000           0         1           poll
  0.00    0.000000           0        13           mmap
  0.00    0.000000           0         5           mprotect
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         2           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         3           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         4           openat
  0.00    0.000000           0         4           newfstatat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         1           getrandom
  0.00    0.000000           0         1           rseq
------ ----------- ----------- --------- --------- ------------------
100.00    0.000000           0        63         2 total

That is quite a bit more stuff than one might have expected… This begs the question then: Does the simple “Hello, world” need to do all of this? We should certainly do something about it.

Tracing complexity

Let’s start by looking at those syscalls a bit closer and see if we can get an idea of what is going on:

1
strace ./target/debug/hello-world

The output is pretty verbose, so I chop it down to relevant pieces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
...
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=44627, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 44627, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9c5c9ae000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=571848, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9c5c9ac000
mmap(NULL, 127304, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9c5c98c000
mmap(0x7f9c5c98f000, 94208, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f9c5c98f000
mmap(0x7f9c5c9a6000, 16384, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f9c5c9a6000
mmap(0x7f9c5c9aa000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d000) = 0x7f9c5c9aa000
close(3)                                = 0
...
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P4\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1953472, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1994384, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f199ee8c000
mmap(0x7f199eeae000, 1421312, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f199eeae000
mmap(0x7f199f009000, 356352, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17d000) = 0x7f199f009000
mmap(0x7f199f060000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d4000) = 0x7f199f060000
mmap(0x7f199f066000, 52880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f199f066000
close(3)
...

We can see a lot of openats, newfstatats, mmaps, reads, and closes. And most of them refere to some dynamic shared object. In the above we can see: ld.so.cache, libgcc_s.so.1, libc.so.6.

While libgcc_s.so.1 and libc.so.6 are standard shared libraries, and ld.so.cache is basically a cache built by ldconfig. I was not really familiar with ld.so.preload, which, if we look at our system calls was not loaded successfully:

1
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

After a quick search, it turns out it works the same as LD_PRELOAD environment variable. It allows the user to specify ELF shared object that is loaded before all others. And indeed we can see it was accessed first, but since I do not have this file on my system, the result was ... = -1 ENOENT (No such file or directory).

We can correlate that a lot of those syscalls refer to the same files by looking at the file descriptor, which is a return value from openat(2) syscall:

1
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3

In this case, the file descriptor is 3. We can see 3 being passed to syscalls that follow, and if we consult manpages for those, we can verify that this argument is indeed expected to be a file descriptor (fd):

1
2
3
4
5
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=571848, ...}, AT_EMPTY_PATH) = 0
...
mmap(NULL, 127304, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9c5c98c000
...

Since those are dynamic libraries we do not explicitly touch their files in the code (only call functions etc.), as it is the job of the linker to make them available. This makes sense since most Rust targets are by default linked dynamically. If we inspect our binary with file we can see it for ourselves:

1
file ./target/debug/hello-world
1
./target/debug/hello-world: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=54d56ea3e059ced4d3b8cc088c409da6411264af, for GNU/Linux 4.4.0, with debug_info, not stripped

And in simple terms “dynamically linked” means that shared libraries are loaded into memory, and sections are mapped after the process is started.

Running the ldd on our binary shows us some of the same files we have seen in the strace output:

1
ldd ./target/debug/hello-world
1
2
3
4
	linux-vdso.so.1 (0x00007ffc75f26000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fdc33996000)
	libc.so.6 => /usr/lib/libc.so.6 (0x00007fdc337af000)
	/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fdc33a13000)
  • We have seen libgcc_s.so.1 and libc.so.6 being linked from our syscalls.
  • VDSO in linux-vdso.so.1 stands for virtual dynamic shared object and is used for some syscalls optimizations.
  • The last one remaining /usr/lib64/ld-linux-x86-64.so.2 is the linker itself. You can see it for yourself by trying to run it:
    1
    
    /usr/lib64/ld-linux-x86-64.so.2 --help | head -n 4
    
    1
    2
    3
    4
    
    Usage: /usr/lib64/ld-linux-x86-64.so.2 [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
    You have invoked 'ld.so', the program interpreter for dynamically-linked
    ELF programs.  Usually, the program interpreter is invoked automatically
    when a dynamically-linked executable is started.
    

So what all of it means for our problem is that before actually running our code that simply prints the “Hello, world!”, the linker will do all this magic, open, memory map all dependencies, and so on.

While dynamic linking is great, it sounds like way too much work for a simple “Hello, world!”. Let’s try to cut it out…

Eliminating linker

Since we identified our first suspect that bloats output of our strace we can now eliminate it.

From the same Rust docs linked above we can read that is possible to link Rust with C runtime (crt) statically using crt-static target feature. We can pass it to the compiler using RUSTFLAGS:

1
RUSTFLAGS="-C target-feature=+crt-static" cargo build

Let’s check our improvements in action:

1
strace -c ./target/debug/hello-world
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Hello, world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
  0.00    0.000000           0         2           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         1           close
  0.00    0.000000           0         1           poll
  0.00    0.000000           0         1           mmap
  0.00    0.000000           0         2           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         5           brk
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           readlink
  0.00    0.000000           0         3           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           openat
  0.00    0.000000           0         1           newfstatat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         1           getrandom
  0.00    0.000000           0         1           rseq
------ ----------- ----------- --------- --------- ------------------
100.00    0.000000           0        35         1 total

This is significantly better as we dropped from 63 to 35 syscalls, but that is still way more than we need. We can however confirm that our binary is now linked statically:

1
ldd ./target/debug/hello-world
1
	statically linked

An alternative way of building statically linked binary is to use musl libc instead of glibc. musl was designed with static linking in mind so it is worth giving it a shot. We can do that by specifying the x86_64-unknown-linux-musl target. We no longer need to pass RUSTFLAGS as static linking is a default behavior for musl target:

1
cargo build --target x86_64-unknown-linux-musl && strace -c ./target/x86_64-unknown-linux-musl/debug/hello-world
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Hello, world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           write
  0.00    0.000000           0         1           poll
  0.00    0.000000           0         1           mmap
  0.00    0.000000           0         1           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         2           brk
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         3           rt_sigprocmask
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         3           sigaltstack
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000           0        21           total

Mind the different binary path in the target directory!

We dropped another few syscalls. It is pretty hard to tell “why” without diving into the actual source code of both glibc and musl. Both are completely different implementations of libc so as long as function interfaces are preserved, the implementation can handle things differently.

Coming back to our task, however, we are still quite far from the goal. Perhaps it is Rust that is at fault here? Maybe it was not a good choice after all…

Descending into C

There sometimes comes a time when you have to abandon your ideals, and just get the job done. This time is now. To verify if it is Rust runtime causing all those syscalls we can try to write the same program in good old C:

1
2
3
4
5
#include <stdio.h>

int main() {
    printf("Hello, world!\n");
}

Wasn’t too bad… Since we already identified musl as a good candidate for static linking, we can build it with musl-gcc (a wrapper for gcc that links against musl):

1
musl-gcc -static main.c && ./a.out
1
Hello, world!

Let’s see how it does:

1
strace -c ./a.out
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Hello, world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0         1           writev
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000           0         5           total

Now, that gets us much closer to what we want.

You may have noticed that the write syscall was replaced with writev(2). writev is simply a different version of write that allows writing multiple buffers at once (known as vectored I/O). If we check the actual arguments passed to the syscall:

1
strace -e trace=writev ./a.out

-e option allows us to specify an expression that modifies events to trace and how to trace them. In our case, we want to trace only the writev syscall.

1
2
3
writev(1, [{iov_base="Hello, world!", iov_len=13}, {iov_base="\n", iov_len=1}], 2Hello, world!
) = 14
+++ exited with 0 +++

We can see that our string was split into two buffers, one for "Hello, world!" and another for the new line "\n".

Why? Well, it is complicated… Syscall itself comes somewhere from here. If we are adventurous enough and go up the stack we can find printf_core, which is called by vprintf, which can (indirectly) take us back to the printf itself…

There seems to be really a lot of code until we get to the actual syscall… I am sure it is all justified and so on, but for us, it sounds like a lot of unnecessary complexity.

Fortunately, we can just use the syscall directly bypassing all that magic:

1
2
3
4
5
6
#include <unistd.h>
#include <sys/syscall.h>

int main(void) {
    syscall(SYS_write, 1, "Hello, world!\n", 14);
}

We pass SYS_write as a first argument to syscall, which is nothing more than a constant that represents the syscall number (1 in the case of write). The rest of the arguments, are syscall specific, and as described in man page for write, those are:

  • file descriptor (1 for stdout)
  • buffer
  • number of bytes to write

Let’s run it:

1
musl-gcc -static main.c && strace -c ./a.out
1
2
3
4
5
6
7
8
9
Hello, world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           write
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000           0         4           total

printf hidden another syscall (ioctl) from us, and we are back to a simple write.

This brings us down to four system calls remaining. It still sounds like more than necessary. As there are no more obvious things to chop off, it might be time to put down our axe and approach it with a bit more precision.

Last syscalls standing

Let’s start with an easy one. We cannot really get rid of execve as something (in this case strace) needs to actually execute our program. So even tho we see it in strace output, it is “not really” our “Hello, world!” program that calls it.

When running strace, the process will fork, and starts tracing a child process. The child, therefore, needs to later execute the desired program that we pass as an argument (a.out binary in the case of our C program), to do that it calls the execve syscall. We can take a peek at that by straceing the strace:

1
strace strace -c ./a.out

The output is a bit messy, but if we zoom in on important parts, we can see the clone syscall, which is used to create a new process, followed by ptrace with PTRACE_SEIZE argument as __ptrace_request, which attaches to the process with a pid that we got as a result of clone:

1
2
3
4
5
...
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb11c716550) = 4031
...
ptrace(PTRACE_SEIZE, 4031, NULL, PTRACE_O_TRACESYSGOOD|PTRACE_O_TRACEEXEC|PTRACE_O_TRACEEXIT) = 0
...

Only after that child process will run our program with the execve syscall, which we can see by straceing our binary:

1
strace -e trace=execve ./a.out
1
2
3
execve("./a.out", ["./a.out"], 0x7ffd96e60d30 /* 31 vars */) = 0
Hello, world!
+++ exited with 0 +++

We now know we cannot live without the execve, but what about arch_prctl and set_tid_address, then?

To the best of what I have found, those are responsible for setting up thread local storage (TLS). As we can read in man pages for arch_prctl:

arch_prctl - set architecture-specific thread state

Digging a bit more, what this means is interfacing with FS (and GS) registers (FS in particular for TLS), which cannot be set from user space and is used to store per thread context.

Link to source code

Another syscall related to threading is set_tid_address (“set pointer to thread ID”). I did not find great sources on this one, but from reading the man page we can try to reason about it. set_tid_address will set the clear_child_tid attribute of the given thread to the address specified by the system call. And as the name (clear_child_tid) suggests, when the thread terminates, the value at the address will be set to 0, or in other words, it will be cleared.

Why is it useful? Again a per man page, if applicable the kernel will then perform:

1
futex(clear_child_tid, FUTEX_WAKE, 1, NULL, NULL, 0);

which can be thought of as releasing the lock of a given memory location and waking up a single thread that is waiting on it. This does not happen for our program since we only have a single thread, so there is nothing to wake up.

If you are familiar with Go, this sound similar to sync.Cond.

Link to source code

Okay, we have a better idea of what those system calls do, and we can reasonably suspect that they come somewhere from libc (musl). At the same time, both of them are not necessary for simply printing Hello, world!. If only we could get rid of libc

Look! There is one more door in this dark basement, and it leads to an even darker place…

Assembly

There is one language that we can “easily” reach for to write the “Hello, world!” in without all that overhead – Assembly. We will use 64-bit x86 assembly as this is the machine I am running on. So… brace yourself and create hello.asm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
section .text
   global _start

_start:
  mov rax, 1        ; write syscall number to rax - 1 is write
  mov rdi, 1        ; 1 is stdout file descriptor
  mov rsi, msg      ; msg is our "Hello, world!\n" defined in .rodata section
  mov rdx, msglen   ; specify message length in rdx - sizeof("Hello, world!\n")
  syscall           ; execute syscall

  mov rax, 60       ; write syscall number to rax - 60 is exit
  mov rdi, 0        ; write program exit code to rdi
  syscall           ; execute syscall

section .rodata
  msg: db "Hello, world!", 10
  msglen: equ $ - msg

Well, we got through it. The code is pretty simple, and if you do not speak assembly (do not worry me neither), comments on the right side should give you an idea of what is going on.

You may have noticed that we actually call syscall twice, which is not exactly what we wanted. However, the second call is just an exit syscall. Technically we could get rid of it and we would still get our Hello, world! printed on the screen.

The catch here is that the CPU would not know that our program is finished, and would try to run the next instruction, which is not there. So this would cause the CPU to try to read some memory that is not accessible by our program, and result in an error beloved by all C programmers:

1
2
Hello, world!
[1]    2529 segmentation fault (core dumped)  ./hello

Let’s be nice to our CPU, accept the exit syscall as necessary, and do not count it for our “one syscall” goal. As we have seen before strace -c will not show it in the summary anyway.

In most cases one would likely prefer to use exit_group(2) syscall instead of exit(2), as it exits all threads in the process. That is what most (if not all) exit functions in different standard libraries do. You can see that is what our previous programs (both Rust and C) did by running strace.

For this case exit is completely sufficient.

With that in mind we can assemble the program with nasm assembler, and link it using ld:

1
2
nasm -f elf64 hello.asm
ld -static -o hello hello.o

and feed it to strace:

1
strace -c ./hello
1
2
3
4
5
6
7
Hello, world!
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           write
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000           0         2           total

And there we have it, “Hello, world!” stripped down to a single syscall! Doesn’t victory taste sweet? If only not for this smell of Assembly everywhere, and a touch of C flashbacks… And yeah, I know, I know, it was supposed to be in Rust…

Okay, fine, let’s look at the positives… At least now we have a chance to rewrite it in Rust…

We are going to embark on that journey in the next part.