- How to create an archive manually.
- How to create a simple package to install a Go binary using the
dpkg
command. - What is the format of the archive. How to check its content.
- How to create the same package using standard Unix tools.
- Case study: How to create a package in Go.
- How to create a simple package to install a Go binary using the
- What happens when you install a package using
dpkg
.- What contains the database.
- How files are copied to the host.
- What changed in the database.
- How to check that the package has been installed.
- Case study: How to install a package like
dpkg
in Go.
- What happens when you install a package using
apt
.- How does the command
apt
know where to search for packages. - What is the format of a repository.
- What does the command
apt update
. - How
apt
usesdpkg
under the hood. - Case study: How to install a package like
apt
in Go.
- How does the command
A Linux package is a bundle of files that your package manager knows how to unpack on your system. Installing packages is something you are doing regularly and I suggest that we are looking under the hood to understand the steps between the creation and the installation of a Linux package.
I assume you have already installed many Linux packages. A basic comprehension of the languages C and C++ is required and being familiar with the Go language will be helpful to follow the case studies.
Table of Contents
- How to create an archive manually.
- What you need to know about the Debian package format, the
dpkg
command, the DEB822 format. - The command
dpkg --build
. - The implementation in Go.
- What you need to know about the Debian package format, the
- What happens when you install a package using
dpkg
.- What you need to know about conffiles, the Dpkg database.
- The command
dpkg -i
. - The implementation in Go.
- What happens when you install a package using
dpkg
.- What you need to know about
apt
,apt-get
,aptitude
, configuration files, configuration options, source lists, repositories, diffs,/var/cache/apt/
,/var/lib/apt/
, cache files. - The commands
apt update
,apt list
, andapt install
. - The implementation in Go.
- What you need to know about
The repositories dpkg
and apt
contain more than 100,000 lines of code.
When trying to explain how code works, there is a though balance to find between showing the code untouched, and simplifying it at the risk of denaturing it. In this post, I decide to use both approaches. I present the original code slightly annotated, removing only debug messages and the support of command flags not covered in this article. I also present a minimal rewrite of these programs in Go richly commented. Overall, that represents a lot of code, but as developers, we are used to skim over large codebase, and I hope you will find your way.
In addition, there are many asides to explain some Dpkg and Apt features that you can safely skip if you are already familiar with the tools.
Please remember that if you find the post too long to read, just imagine how long it was to write it 😁. Happy reading!
How to create a package manually
Linux packages are commonly available in a .deb
and a .rpm
file.
- The
.deb
files are meant for distributions of Linux that derive from Debian (Ubuntu, Linux Mint, etc.). - The
.rpm
files are used primarily by distributions that derive from Redhat based distros (Fedora, CentOS, RHEL).
Because there are two main Linux distributions: Red Hat and Debian and each one has its own file formats: .rpm
for Red Hat Package Manager and .deb
for Debian.
Both package formats have a lot in common and we will only discuss Debian packages in this document. The following table summarizes the main differences between the archive files.
.rpm | .deb | |
---|---|---|
Archive Format | Uses the cpio command and file format | Uses the ar command and file format |
Package Manager | rpm + (1997, Written in C) | dpkg + (1993, Written in C) |
Frontend Package Manager | yum + (2011, Written in Python) | apt + (1999, Written in C++) |
Database | /var/lib/rpm | /var/lib/dpkg |
Database Format | Berkeley DB files | DEB 822 flat files |
A package is a collection of files to distribute applications or libraries via the Debian package management system. The aim of packaging is to allow the automation of installing, upgrading, configuring, and removing computer programs in a consistent manner.
A .deb
file is an ar
archive. The ar
command is an ancestor of the common tar
command and was already present in the first Unix version in 1971! Now, this command is (mostly) only used by Debian packages. This archive contains 3 files:
-
debian-binary
: A text file containing2.0\n
. This states the version of the deb file format. For 2.0, all other lines get ignored. -
data.tar.gz
: Atar
archive containing all files that will be installed with their destination paths -
control.tar.gz
: Atar
archive containing various files useful for thedpkg
command to do its job: metadata about the package (control
) including the list of required dependencies, the md5 sums of every data file to check integrity (md5sums
), and also maintainer scripts (ex:postinst
for post-installation,prerm
for pre-removal, etc.), which are executables that must be run when installing or removing a package.
Further documentation:
- 5 reasons why a Debian package is more than a simple file archive, Raphaël Hertzog
- Debian New Maintainers’ Guide, the official procedure to create a package the “Debian way”.
You can also learn more about Debian packages by installing a Debian package 😀 (the PDF is also available online):
dpkg
The project Dpkg started in 1994, at the same time the Debian package format was created, and thus the command dpkg
works only with .deb
binary archives. You must provide the archive as the command does not know how to retrieve it by itself. The command manages a database stored under /var/lib/dpkg
to keep note of everything that is installed on the server, which is essential to determine what to clean when you remove a package.
Note that the command dpkg --build
redirects to the command dpkg-deb --build
and the command dpkg --list
redirects to the command dpkg-query --list
. The code of these commands is present in the same repository in ./dpkg-deb/
and ./src/querycmd.c
respectively.
- Official Repository: https://git.dpkg.org/cgit/dpkg/dpkg.git
- GitHub Mirror: https://github.com/guillemj/dpkg
To illustrate this post, we will use the Hello World example present in the Go by example tutorial.
Our goal is to package this binary and the most popular solution to build a Debian package for a Go program is the utility dh-golang
. As we want to use the most basic commands to get as close as possible to the process, we will use the standard dpkg
command even if that means not building a world-class Debian package.
Prerequisites
To test the packages we are going to build and install, we will use a Debian VM in order to keep your system safe. We will use Vagrant to create this server. Make sure Vagrant is installed on your system by following the installation procedure for your operating system.
There is a companion GitHub repository julien-sobczak/linux-packages-under-the-hood to this blog post. This repository is optional for this article. It mostly contains a Vagrantfile
to start the virtual machine, the files to create various Debian versions of the package hello
, and also the Go code that reimplements minimal versions of the dpkg
and apt
commands. You will find more information in the README.md
file of this repository.
Then:
When using Vagrant, the directory containing your Vagrantfile
is accessible from the virtual machine from the directory /vagrant
. We will use it to copy our hello
binary program:
All commands whose prompt starts with vagrant#
must be run inside the virtual machine. Otherwise, run the commands from your host.
We are ready to create a Debian package for our Hello program.
- 1
- The first version of our package
hello
contains only the binaryhello
built previously and a DEB822 filecontrol
with the package metadata. - 2
- We also append basic maintainer scripts that displays a message in the console so that we will know when the installation process runs them.
This format can be seen as an ancestor of YAML or JSON. Here is an example showing the three supported types of fields:
The format is used by the file control
but also by some files in the dpkg
database such as /var/lib/dpkg/status
. This format is also used by the command apt
, which will be covered later.
Further documentation: Check the man page for additional information.
dpkg --build
We will use the command dpkg --build
to build our package:
- 1
- This command builds a Debian package, which as outlined before, consists in building an
ar
archive containing twotar
archives: the content of our directoryDEBIAN/
incontrol.tar.gz
and the other files indata.tar.gz
. We use thefakeroot
command to make sure files inside the archive are created with the userroot
.
We can also reproduce its working using standard Bash commands:
- 1
- The package will fail most linter checks. Indeed, we ignored many of the best practices that higher-level commands ensure but we will still be able to install this package on our server.
Now is the time to look at the code. Dpkg is written in C, and the function executed by the command dpkg --build
is the function do_build
in ./dpkg-deb/build.c
.
- 1
- The variable
dir
is the local directory containing the package files to build. The variabledest
is the optional filename for the final package file anddebar
is the final name as determined by the functiongen_dest_pathname
, which determines a default name if the argument is missing. - 2
- The function
dpkg_ar_create
creates the archive file named after the variabledebar
. - 3
- The function
dpkg_ar_put_magic
defines the magic number!<arch>\n
telling Linux the file is of typear
. - 4
- The function
dpkg_ar_member_put_mem
appends the filedebian-binary
with the content of the variabledeb_magic
. - 5
- The function
dpkg_ar_member_put_file
appends the filecontrol.tar
with the content of a temporary file. - 6
- Same as above for
data.tar
. - 7
- The function
dpkg_ar_close
is part of the housecleaning logic and simply closes the file descriptor.
Case Study
What follows is a minimal rewrite of this code in Go. The full code is available on GitHub in the repository julien-sobczak/linux-packages-under-the-hood.
To run the code:
To inspect the resulting archive hello.deb
, we can use the command dpkg -c
to view the data files or use the command ar
to view the real content of the archive:
🎉 We have finished with the format .deb
. This completes the first part of this article. We created a Debian package from scratch! Now, we will inspect the installation process.
What happens when you install a package using dpkg
The command to install a Debian binary package file is dpkg -i _myarchive.deb_
and will be the subject of this second part.
dpkg -i
Let’s run the command on our Debian archive:
The command does a lot of interesting things and the code is larger than the previous build
command. The man page details the installation steps and we will present the main code for every one of them.
The entry point for the installation of a package is the function archivefiles
, and most specifically the function process_archive
:
- 1
- The main function iterates over all packages to install and delegates to the function
process_archive
for the unpacking. - 2
- The function
process_queue
configures all packages that have been unpacked in the previous step. We will explain the differences between these two steps.
Let’s go!
- Extract the control files of the new package.
- 1
- Create a temporary directory (commonly
/var/lib/dpkg/tmp.ci/
). - 2
- Run the command
dpkg --control
to extract theDEBIAN/
directory into it.
Then, the code parses the control
file to initialize the struct pkginfo
, which is the main structure to represent a package. (You can check the const fieldinfos
in parse.c
to find the mapping between the file and the struct.) Here is a minimal version of this structure with the most important fields annotated:
- 1
- The enum
want
determines the expected action for this package, likePKG_WANT_INSTALL
for installation, orPKG_WANT_PURGE
for the removal of the package and its configuration files. - 2
- The
eflag
is initialized if the parser finds an error in the control file (ex: missing field), and also later during the installation process. - 3
- The
installed
andavailable
fields contain most of the information present in thecontrol
files concerning a possible installed version of the package and the new version to install. - 4
- Some fields like
files
are initialized later by other functions likedb-fsys-files.c#ensure_packagefiles_available
, which reads the file/var/lib/dpkg/list/hello.list
to populate this field. - 5
- The
status_dirty
flag is set when the current status of the package changes, for example fromPKG_STAT_UNPACKED
toPKG_STAT_INSTALLED
.
And now, the function responsible to create this struct:
- 1
- The function
parsedb
simply reads a file in Debian RFC822 format, the format we used to write thecontrol
file.
- If another version of the same package was installed before the new installation, execute
prerm
script of the old package.
- 1
- The status read during parsing is reused to determine if the package is already installed.
- 2
- Update the package status to keep trace that the package has been partially installed. The status will be changed several times during the installation. The function
modstatdb_note
persists the new state to disk. - 3
maintscript_fallback
andmaintscript_installed
delegates tomaintscript_exec
defined in the same filesrc/script.c
. This function runs the script in a fork process and aborts if the return code is greater than 0. Differences between the various calls are explained in the next step.
- Run
preinst
script, if provided by the package.
- 1
- The function
maintscript_new
is a variadic function whose latest arguments are passed to the maintainer script to provide context. For example, thepreinst
maintainer script can be called using one of these formats:preinst install
,preinst install <old-version>
, orpreinst upgrade <old-version>
. This allows the package developer to take different actions based on the current state of the package.
- Unpack the new files, and at the same time back up the old files, so that if something goes wrong, they can be restored.
This step is similar to running the command dpkg --unpack
. The unpacking process is simple to understand: extract every file present in the data.tar
to their destination path. But things are not so simple as outlined by this comment:
We still haven’t talked about conffiles. When upgrading a package, you want the package manager to overwrite the previous version of the files, except for configuration files. You don’t want to lose your customizations, don’t you?
A Debian archive can therefore include a file conffiles
in the DEBIAN/
directory to list a subset of files present in the data.tar
archive. These “conffiles” are files that must be managed specially to take care of preserving user changes.
Conffiles explains the difference between the commands dpkg remove
and dpkg purge
. (The first command ignores conffiles while the second removes them completely.)
The version 2.1-1 of our package hello defines a different version written in Python, which reads a configuration file /etc/hello/settings.conf
, also present in the package. This conffile is referenced in DEBIAN/conffiles
.
If we try to create this configuration file manually before installing this new version:
The package manager detects the conflict by keeping a checkum of the last installed version of every conffile (files named md5sums
in the database) and asks the user what to do about it. Options exist to avoid the prompt and the default is, of course, to preserve existing conffiles.
The unpacking runs the command dpkg-deb --fsys-tarfile
to extract the content of data.tar
. The command sends each file to a pipe created in the same function process_archive
and delegates to the function tarobject
defined in archives.c
, which implements all the rules presented in the previous comment. The code is rather obvious but is too long to introduce it in this article.
We can mention that the backup process consists in extracting files with a special extension like .dpkg-tmp
, .dpkg-old
and .dpkg-new
. Files are renamed to their definitive name if no problem occurs, except for conffiles, which must wait until the last installation step to be renamed.
- If another version of the same package was installed before the new installation, execute the
postrm
script of the old package. Note that this script is executed after thepreinst
script of the new package, because new files are written at the same time old files are removed.
The execution code of the maintainer script postrm
is similar to the previous scripts.
What is more interesting is what happens at the end of the unpacking step. Indeed, the Dpkg database is updated to reflect the changes.
Dpkg maintains a database under /var/lib/dpkg
, which regroups various files including the followings:
file | description |
---|---|
/var/lib/dpkg/status | A DEB822 file containing the status information for all packages (i.e., the current state of each package and the fields in their control file). |
/var/lib/dpkg/status-old | The last backup of the /var/lib/dpkg/status file. |
/var/lib/dpkg/available | The list of packages available for installation or upgrade from external origins only if you are using dselect as your package manager frontend (instead of apt or aptitude ). See details. (not described in this article) |
/var/lib/dpkg/diversions | The list of diversions used by dpkg and set by dpkg-divert to force a package file to be installed elsewhere. (not described in this article) |
/var/lib/dpkg/statoverride | The stats used by dpkg and set by dpkg-statoverride to change the default ownership and mode of the package files. (not described in this article) |
In addition, for every installed package, Dpkg keeps a list of additional files:
file | description |
---|---|
/var/lib/dpkg/info/<package_name>.list | The list of files and directories installed by the package (the data.tar listing) |
/var/lib/dpkg/info/<package_name>.md5sums | The list of MD5 hash values for files installed by the package. Used for example to detect if a conffile had been edited by the user. |
/var/lib/dpkg/info/<package_name>.conffiles | The list of configuration files. Same as the conffiles file under DEBIAN/ |
/var/lib/dpkg/info/<package_name>.{preinst, postinst, prerm, postrm} | Copies of the maintainer scripts present in the package under DEBIAN/ . |
/var/lib/dpkg/info/<package_name>.config | Debconf-generated configuration files used only by a minority of packages. (not described in this article) |
Here are the different functions called to update the different files in the database:
- 1
- Edit the file
/var/lib/dpkg/info/hello.list
. - 2
- Copy all files under
DEBIAN/
into/var/lib/dpkg/info/
by prefixing them with the package namehello.
. - 3
- Edit the file
/var/lib/dpkg/info/hello.md5sums
. - 4
- Update the field
Status
in/var/lib/dpkg/status
for the packagehello
to set the valueinstall ok unpacked
.
We are getting close to the end of the function process_archive
. The last instruction is enqueue_package(pkg)
. This function simply push a new package waiting to be configured in a queue. Since the dpkg
command can be executed with several packages to install, the queue ensures all packages have been unpacked before proceeding to their final configuration.
We are now back to the archivefiles
function:
- 1
- We are here.
What follows is the data structure representing the queue:
- 1
- The global variable containing the packages to configure.
- 2
- These variables control the algorithm that decides which package must be configured first, which must be postponed, and when to abort the installation completely.
Finally, the logic to empty the queue present in the function process_queue
:
- 1
- The function
deferred_configure
is the main function doing the configuration and is the subject of the next step.
- Configure the package.
- Unpack the conffiles, and at the same time back up the old conffiles, so that they can be restored if something goes wrong.
- Run
postinst
script, if provided by the package.
The last step uses the same code as the command dpkg --configure
, which may be used to reconfigure a package that had already been unpacked.
The configuration step is implemented by the function deferred_configure
which focuses on a single package to configure. If the configuration cannot proceed, the package will be enqueued to be reprocessed later or not. Here is a simplified version:
- 1
- In case of a missing dependency, the installation will abort only at this step, after the unpacking of the package files.
- 2
- The function
deferred_configure_conffile
renames the conffiles still ending with the suffix.dpkg-new
created during the unpacking. This function also shows the confirmation prompt. - 3
- Run the
postinst
maintainer script. - 4
- Change the status to
PKG_STAT_INSTALLED
and force the update in thestatus
database file.
The installation of our package is now completed. We can check the package has been installed by running the hello
command:
Or by using the command dpkg
to get the status of the package:
Case Study
What follows is a minimal rewrite in Go of the code covered in this second part. The full code is available on GitHub in the repository julien-sobczak/linux-packages-under-the-hood.
But first, let’s remove the package or we will not be able to test our program:
Here is the code:
Let’s test the new command:
Our package has been correctly installed. The standard dpkg
command recognized it and can be used to remove the package like any other installed package:
🎉 We have finished with the command dpkg
. We succeeded in creating a package manually and installed it using a basic Go program. We have a better understanding of how dpkg
is working and what information is available in its database. Now, we will have a look at the package manager frontend apt
to understand how these programs are working together to install a package.
What happens when you install a package using apt
The main reason to use apt
is for the dependency management support. This command understands that in order to install a given package, other packages may need to be installed too, and apt
can download and install them. In practice, dpkg
is called a package manager and apt
is called a frontend package manager.
apt
, apt-get
, aptitude
APT is a vast project started in 1997 organized around a core library. The command apt-get
was the first frontend developed within the project, and apt
is the second command provided by APT, which overcomes some design mistakes of apt-get
, for example, apt
refuses to install dependencies that were not installed beforehand during an upgrade. Under the hood, both tools are built on top of the core library and are thus very close.
External projects like aptitude
have been developed later to support new features like auto-removing of packages when they are no longer required, but most of these features are now available in apt
too.
The most widespread command remains apt
, and it is the one that we will use in this section.
- Apt (
apt
andapt-get
) Official Repository: https://salsa.debian.org/apt-team/apt - Apt GitHub Mirror: https://github.com/Debian/apt
- Aptitude Official Repository: https://salsa.debian.org/apt-team/aptitude
Further documentation: apt-get, aptitude, … pick the right Debian package manager for you, Raphaël Hertzog
APT makes software available to the user by doing the dirty work of downloading all the required packages and installing them using dpkg
in the correct order to respect the dependencies. The scope of APT is larger than Dpkg and its behavior is highly configurable.
APT configuration resides under /etc/apt/
, which contains the following files:
-
apt.conf
andapt.conf.d/
: The main configuration files where hundred of options are available (more about them soon). The commandapt-config dump
can be used to view all available options with their default values: -
sources.list
andsources.list.d/
: lists of repositories (more about them soon). Here are the default repositories on my Debian server: -
preferences
andpreferences.d/
: APT pinning is the only available preference. By default, when multiple repositories are configured, a package can exist in several of them and APT applies logic to decide which one must be installed. Pinning allows you to change this logic (called a policy) for some packages. The commandapt-cache policy [pkg]
can be used to view the global policy when called without argument:You can create preferences files to privilege a specific repository for a given package or to prevent this package to be upgraded. (not covered in this article)
-
trusted.gpg
andtrusted.gpg.d/
: keys for secure authentication of packages (known as “Secure APT” and used in Debian since 2005). The commandapt-key
can be used to show the keys, and to add or remove a key. APT uses public-key (asymmetric) cryptography using GPG:When installing a package, APT retrieves the package from an external repository and the
Release
file, which is the entry file to findPackages
index files, may have be altered (which means checking the MD5 sums inside these index files is useless if we can’t guarantee that theRelease
file is safe against a man-in-the-middle attack). This is the goal of secure APT. Concretely, secure APT always downloads aRelease.gpg
file if existing before downloading aRelease
file. (NB: The fileInRelease
had now merged the intent of these two deprecated files.) Using cryptography, APT can be sure that the file is safe and can trust the MD5 sums present inside it to check other files likePackages
files. Otherwise, APT will complain with the following message you have probably encountered before: -
auth.conf
andauth.conf.d/
: APT configuration and repositories list must be accessible to any user on the system but some repositories may require login information to connect, which are stored in these restrictive files. For example, instead of specifying the user/passwordapt:debian
in the source list file directly (deb https://apt:[email protected]/debian buster main
), you can create an entry inauth.conf
:(not covered in this article)
-
listchanges.conf
andlistchanges.conf.d
: Only used by the commandapt-listchanges
to show what has been changed in a new version of a Debian package, as compared to the version currently installed on the system. It does this by extracting the relevant entries from both theNEWS.Debian
andchangelog[.Debian]
files, usually found in/usr/share/doc/_package_
in Debian package archives. (not covered in this article)
In practice, .d
directories are privileged so that the configuration can be split into several files. Single file may not even exist on your machine and are often deprecated.
Further documentation: APT configuration, Secure APT.
Now is the time to start looking at the code again. APT is written in C++. The entry point for any APT command is the file cmdline/apt.cc
which contains a function GetCommands()
that maps each command with a function defined in the directory apt-private/
, which delegates to other functions in the main APT lib present in the directory apt-pkg/
(i.e., cmdline/ -> apt-private/ -> apk-pkg/):
Before invoking the command function, APT simply initializes a few classes like pkgSystem
to set the default configuration options.
Unlike Dpkg, APT is highly configurable using the files /etc/apt/apt.conf
and /etc/apt/apt.conf.d/
. The format is similar to some Linux tools like bind
or dhcp
.
The configuration file is organized in a tree organized into functional groups. For instance, APT::Get::Assume-Yes
is an option within the APT
tool group, for the Get
tool. A new scope can be opened with curly braces, like this:
You can retrieve the full list of options using the command apt-config
:
Inside the code, the configuration is accessible using the class Configuration
(defined in apt-pkg/contrib/configuration.h
):
Further documentation: man page
apt update
Here is the entry point when running the command apt update
:
The command is divided in four steps that we will cover separately:
- Read the
sources.list
andsources.list.d/*
files.
Apt downloads packages from one or more software repositories, which are often remote servers. The precise list of repositories is determined by the file /etc/apt/sources.list
and the ones inside /etc/apt/sources.list.d
. Two formats are supported: one source per line (the widespread one-line style) or multiline stanzas defining one or more sources per stanza (the newer deb822 style).
Example using the old format:
Example using the new format:
We will ignore the new DEB 822 format in this article.
Further documentation: man 5 sources.list
The class pkgSourceList
represents the list of configured sources and is defined like this:
The list is initialized by the method BuildSourceList()
:
The method ReadMainList()
is used to read the sources.list files:
- 1
- The
Read*
methods parse the sources files. We omit the parsing code for brievity but both parsers pushes a new instance ofdebReleaseIndex
in theSrcList
.
- Fetch index files from each repository (
InRelease
,Packages
, …).
- 1
AcqTextStatus
is used to report progress of the files downloading.
A repository is a set of Debian binary or source packages organized in a special directory tree along various additional files—checksums, signatures, translations, … APT downloads some of these files to install a package on your system.
Ex: deb https://deb.debian.org/debian stable main contrib non-free
deb
is used for binary packages,deb-src
for source packages.https://deb.debian.org/debian
specifies the root of the repository.stable
is the distribution, which is commonly a suite (stable
,oldstable
,testing
,unstable
), which is an alias for a Debian codename (wheezy
,jessie
,stretch
), which is based on Toy Story characters.main contrib non-free
are the three component types and indicate the licensing terms of the software they contain.
Here is a preview of files tree for this repository:
And now the explanations.
The root directory contains a directory dists/
which in turn has a directory for each release and suite, the latter usually symlinks to the former. Each release subdirectory contains a signed Release
file and a directory for each component. Inside these are directories for the different architectures, named binary-<arch>
and sources
. And in these are files Packages
and Sources
that are text files (in DEB 822 format and often compressed) containing the metadata of available packages.
Example of a Packages
file:
Example of a Sources
file:
But still no .deb
packages… We need to move to another directory at the root of the repository to find them:
The directory pool/
has a directory for all the components, and in these are directories named 0
, …, 9
, a
, … z
, liba
, … , libz
. And in these are directories named after the software packages they contain, and these directories finally contain the actual packages, i.e the .deb
files.
Notes:
- The “single letter” directories are just a trick to avoid having too many entries in a single directory which is what many systems traditionally have performance problems with.
- The
pool/
directory avoid file duplication as binary and source packages are stored only once even if used by many releases underdists/
. Packages
andSources
index files are control files using a similar format as used in the first part of this article when creating our Debian archive package, with a special fieldFile
andDirectory
respectively, to link to thepool/
directory.Release
is an index file in the DEB822 format but containing only a single document whose field names refers to the repository —Origin
,Suite
,Codename
,Architectures
(plural),Components
— and whose fieldMD5Sum
contains the checksums for all files in this repository.
Further documentation: Debian Repository and the more complete Repository Format
Here is the function ListUpdate
that actively downloads index files from the repositories:
- 1
- The class
pkgAcquire
is the main component of the Acquire subsystem. APT is responsible to retrieve the packages from various sources, mainly remote repositories through HTTP and the Acquire system is responsible to fetch allItem
required by APT in the most efficient way. It uses for example a pool of workers to speed up the downloading and is able to test for diffs files before downloading full index files. - 2
- Most APT commands tries to acquire a lock to prevent two processes using the lib APT to run at the same time. The lock file is
/var/lib/apt/lists/lock
but other lock files exists for example to update the APT cache. - 3
- The method
GetIndexes()
creates new items to downloadInRelease
files using the Acquire system. - 4
- The function
AcquireUpdate()
collects the results from theFetcher
and update the cache.
Packages
files (and also some other indices files present in a Debian repository) can be relatively large. For example, the compressed Package.xz
file for the architecture amd64
and the component main
of the stable Debian repository weights 8 MB. These files are typically retrieved when you run the command apt update
and APT provides a solution to this problem.
Indeed, a Debian repository can contains diff files (whose content are similar to the output of the command diff
) along the standard files like Packages
:
The apt
command will try to retrieve these files and apply successive diffs on top of its local index file.
- Read the package lists and build the dependency tree.
/var/cache/apt/
This directory stores the latest version of the APT cache, used to speed up the execution of most commands:
The APT Cache files under this directory (except the lock
file) can be safely deleted using the command apt clean
to reclaim disk space:
APT is highly configurable and there are several options to clean the cache regurlarly, like after every package installation.
/var/lib/apt/
This directory stores the current state of APT, that is which packages have been installed, what is the latest version of retrieved index files used when updating the cache, etc.
This directory doesn’t have to be edited like /etc/apt/
and doesn’t have to be cleaned like /var/cache/apt/
. It can be safely ignored by the Apt user but we will still have to talk about it in this article.
The method pkgCacheFile::BuildCaches()
calls the method BuildSourceList()
we covered in the previous step, and then delegates to the method pkgCacheGenerator::MakeStatusCache()
for the effective cache initialization:
- 1
- The cache is stored in
/var/cache/apt/pkgcache.bin
and/var/cache/apt/srcpkgcache.bin
. There are binary files that are loaded in memory. - 2
- The method
CheckValidity
loads each cache file in memory and checks that they are up-to-date, by verifying that every required index files for every source exists. - 3
- If both cache files are correct, we can returns immediately. Otherwise, we need to rebuild from scratch the ones that are not fine.
The APT Cache files are two binary files /var/cache/apt/pkgcache.bin
and /var/cache/apt/srcpkgcache.bin
.
Basically, these cache files contains all index files (InRelease
, Packages
, Sources
, and Translations
) retrieved from the APT repositories present in the list of sources (/etc/apt/source.list
and /etc/apt/source.list.d/
). The only difference between these two files is that the file pkgcache.bin
appends also the content of /var/lib/dpkg/status
.
Therefore, every time a new index file is retrieved by APT or when the Dpkg status file changes, the APT cache must be updated too.
The format of the cache files is optimized for the sole usage of APT and the main motivations is to speed up the loading of the cache in memory, and to reduce the memory usage. Therefore, the cache uses a binary format, which means you cannot read the files using your text editor. For example, Header
is the first struct copied and starts like this:
Field names are logically omitted and only values (sometimes converted to enums like the status string installed
that becomes 6
in the binary file) are appended in successive order as confirmed by the command xxd
which dump a file in hexadecimal:
When APT is launched, these two files are loaded in memory using the mmap()
system call and the rest of the code interacts with an instance of the class pkgCache
and another of the class pkgDepCache
. In fact, pkgDepCache
wraps pkgCache
to add state informations about the packages on the system so that pkgCache
is mostly read-only.
The code to initialize these instances is not covered in the article. Check the files apt-pkg/pkgcache.h
, apt-pkg/cachefile.h
and apt-pkg/pkgcachegen.h
if you are curious.
Further Documentation: APT Cache File Format
We will not go deeper into the APT Cache code. We have already inspected the structure of the different index files (InRelease
, Packages
, …) and we know that APT commands use pkgCacheFile.GetPkgCache()
and pkgCacheFile.GetDepCache()
to retrieve information from the cache.
What follows are annotated definitions to give you an idea of the kind of information present in the APT Cache:
Here is the definition of the class pkgDepCache
:
- Display statistics about package upgrades.
This last step simply traverses the cache to extract the relevant information.
- 1
- The operator
[]
is overloaded inpkgDepCache
to returnPkgState[I->ID]
, which is a structStateCache
containing the current installed and candidate versions. - 2
- The method
Upgradable()
reads the state to determine if a new candidate version is available and increments a counter. - 3
- The macro
P_
is defined bydefine P_(msg,plural,n) (n == 1 ? msg : plural)
.
That’s all for the command apt update
. We will now cover other APT commands, reusing the knowledge we built about the APT cache.
apt list
Here is the code of the command apt list
. This version omits optional arguments that are used to filter the list of results.
- 1
- The function
CacheFile.GetPkgCache()
delegates to the methodBuildCaches()
we covered in the previous section aboutapt update
. This method is responsible to build the APT cache. - 2
- Concrete values will be replaced in the function
ListSingleVersion
by replacing${Package}
,${Origin}
, … by their real values. - 3
- The real implementation uses the type
LocalitySortedVersionSet
which is a list ordering packages based on their names in theTranslation
files of the user locale.
Like for the apt update
command, the code is simply using the information present in the APT cache. In this case, it happens in the function GetVersionSet
:
- 1
- The command
apt list --installed
searches for installed packages. - 2
- The command
apt list --upgradable
searches for installed packages that can be upgraded. - 3
- The command
apt list --all-versions
searches for all packages in the APT cache.
The packages are then formatted in the function ListSingleVersion()
:
- 1
- The function ignores which fields are present in the output format and thus will try to replace all of them. If a field is missing, the replacement will do nothing.
- 2
- The code uses the state information present in
depPkgCache
to determine if the package is installed, or upgradable, and so on. - 3
- The code ensures no remaining braces are left.
We will close the APT section by covering the most useful command.
apt install
The entry point is the function DoInstall()
which is called by various commands: install
, reinstall
, remove
, purge
, … The code will be simplified to keep only the installation usage.
- 1
- The package problem resolver is launched during step 2 and can add new packages to install to satisfy dependencies. Therefore, the number of packages to install can be different from the number of packages specified in the command line.
- Load the APT cache
The first step is without surprise to load the APT Cache using the method pkgCacheFile::Open()
which reuses methods we have already discussed before.
- Determine the packages to install
Installing a package can also means uninstalling some other packages. Maybe the new version of a package stops using a dependency that was used only by this package and APT will try to autoremove it. The code is therefore a little more complicated.
For this step, we ignore most of these problems and focus on the installation of new packages with only new dependencies to install. The code will be adapted in consequence.
For every package to install, the code will update the state in pkgDepCache
using the function Cache->GetDepCache()->SetCandidateVersion()
and Cache.MarkInstall()
. After that, the code executes the pkgProblemResolver
. The goal is to fix broken packages, that is packages with missing or conflicting dependencies if the installation continues. The code is huge with more than 1000 lines of code. To give you an idea of the kind of constraints the resolver must satisfy, here are the relevant fields for a common package:
The code documentation recognizes that the code has become complex and very sophisticated over time. Moreover, the resolver may even not be able to fix all broken packages. Packages may be missing and conflicts may still exist. Check the function pkgProblemResolver::ResolveInternal()
defined in apt-pkg/algorithms.cc
for more details.
- 1
- Add one to
CmdL.FileList
to skip theinstall
command name. - 2
- Mark this package version to be installed.
- 3
- Ensure the resolver fixed the broken packages before continuing the installation.
- Ask confirmation for additional packages to install
This step simply iterates over the package to install and inspects the calculated dependencies list to keep packages present in the fields Recommends
and Suggests
. The “recommended” dependencies are the most important and considerably improve the functionality offered by the package (these recommended packages are now installed by default by APT).
Here is an example of a package with recommended and suggested packages:
Note that dependencies of a package can also have recommended and suggested packages, and so on. Therefore, the final list displayed to the user is often pretty long:
We can confirm from the previous output that recommended packages are well installed by default.
- Proceed to the installation
The last step is managed by the function InstallPackages
:
- 1
- APT acquires a lock using the
fcntl()
system call which is used to manipulate file descriptors. When called using the flagF_SETLK
, the call returns -1 if the lock is already held by another process. - 2
- APT supports multiple package managers but the default is the
dpkg
command. APT uses the classdebSystem
and the associatedpkgDPkgPM
to interact with thedpkg
command. - 3
- The Acquire subsystem is reused to download the archives. Internally, the code keeps for every item to retrieve two fields
FileSize
andPartialSize
, which are the size of the object to fetch and how much was already fetched. The methodsFetcher.FetchNeeded()
andFetcher.FetchPartial()
iterates over the items to determine the total values. - 4
- APT asks for confirmation before proceeding to the installation, except if you use options like
apt -y install
. - 5
- Unlock Dpkg lock
/var/lib/dpkg/lock
to make sure thedpkg
command can use it. - 6
- The package manager reads the
/var/lib/dpkg/status
to found out the packages that were removed because none of their files was referenced by another package.
The installation logic is implemented by the class pkgDPkgPM
.
- 1
- The package manager keeps a list of actions to perform.
- 2
- The method
Install
simply appends a new item inList
of typeInstall
. - 3
- The method
Go
reads the list of actions and execute them.
The only remaining code is the dpkg
command execution:
- 1
- The code is a classic example of Linux programming. The code uses the system calls
fork()
,exec()
, andwait()
to delegate to the commanddpkg
.
After the dpkg
command has run, the APT cache will still have to be updated as the state of some packages has been updated. There is nothing really new and we can stop our inspection of the APT code.
Case Study
Like for other parts, we will write a minimal version of the command apt install
in Go. We will not bother with a cache and simply read the Debian repositories systematically.
To test our program, we need a basic package so that we can focus on the core logic of the APT installation process without having to support advanced logics. We will use a new version of our package hello
(the code is available in the companion GitHub repository):
- 1
- Declare a required dependency available in the standard Debian repository.
- 2
- Use the binary installed by this dependency.
To build the new package:
- 1
- We use the command
dpkg
but we could also have used our Go version created in the first part.
Example of installation using APT:
The challenge is to install the same package using a basic Go program. We will reuse the dpkg
version we wrote in Go.
Here is the code:
🎉 We have finished with the command apt
. We have also finished with this article! We created a Debian archive using a basic Go program and we install the package using Go versions of dpkg
and apt
.
”One” Last Word
Linux packages are just archives containing files to extract into a different system. The problem is trivial but the evil always comes from details.
In this article, we have glimpsed at some of the challenges that a package manager must address. Packages use others packages which means the package manager must face one of the most difficult problems in computing, dependency management. Despite that, Dpkg and Apt are still approachable programs.
We wrote basic versions from scratch using only a few hundreds of lines of Go code. The biggest obstacle was that the commands dpkg
and apt
are interactive and try do too much to avoid to rely on the user to fix problems, which explains why the sum of the two programs represents approximatively 100,000 lines of C and C++ code.
If you are managing a large pool of servers like a datacenter, reimplementing your own package manager can be interesting. For example, you could centralize all local databases to ensure that all machines share the same state, or you can take corrective actions like excluding a server from the pool when an upgrade ends in a bad state. Google provides a great example of application. They decided to implement their own package management system. “Any package change is guaranteed to succeed, or the machine is rolled back completely to the previous state. If the rollback fails, the machine is sent through our repairs process for reinstallation and potential hardware replacement. This approach allows us to eliminate much of the complexity of the package states.”1 The decision was surely not obvious, but the benefits are for sure obvious.
Implementing a package manager from scratch can be intimidating, but as we have seen in this article, the reality is not so bad, especially if we consider the long list of features that Apt supports that are not useful when managing a large number of homogenous machines in an automated way.
Footnotes
-
Building Secure and Reliable Systems, O’Reilly, Chapter 9 - Design for Recovery, Footnote 18 ↩