Yaird — Yet Another Mkinitrd

Erik van Konijnenburg

2005-02-09

Abstract

This memo discusses the design goals and implementation of Yaird (Yet Another mkInitRD), a proof-of-concept application to create an initial boot image, a minimal filesystem used to bring a booting Linux kernel to a level where it can access the root file system and use startup scripts to bring the system to the normal run level. It differs from earlier mkinitrd implementations in that it leverages the information in sysfs to minimise the number of special cases that the application has to know about, and in that it uses a template system to separate the analysis of the system from the building of the image.


Table of Contents

Introduction
Goals, features, to do
Concepts
The interface between kernel and image
Supporting Raid Devices
Supporting EVMS
Supporting Encrypted Disks
Supporting NFS Root
Supporting S390
Supporting Input Devices
Supporting Shared Libraries
Security
Tool Chain
Authors
License

Introduction

Yaird (Yet Another mkInitRD) is an application to create an initial boot image, a minimal filesystem used to bring a booting Linux kernel to a level where it can access the root file system and use startup scripts to bring the system to the normal run level.

It differs from earlier mkinitrd implementations in that it attempts to leverage the information in sysfs to minimise the number of special cases that the application has to know about, and in that it uses a template system to separate the analysis of the system from the building of the image.

This document gives an overview of the design and implementation of Yaird; see the README file for usage information. This text assumes familiarity with Linux system administration and the basics of hotplug and sysfs.

This document describes version 0.0.12 of Yaird. This is a very rough, proof of concept, version.

Goals, features, to do

The purpose in life of a tool like Yaird is to produce an initial boot image that loads the required modules to allow a booting kernel to access the root file system and from there use the startup scripts to get to the default run level. This means that hardly any drivers need to be compiled into the kernel itself, so a distribution can produce a kernel with a large amount of modules that will run unchanged on practically any hardware, without introducing a large number of unused drivers that would waste RAM. In a sense, the initial boot image customises the kernel to the hardware it happens to be running on.

That purpose still leaves a lot of room to optimise for different goals: as an example, you could attempt to make the generated image as small as possible, or you could attempt to make the generated image so flexible that it will boot on any hardware. This chapter discusses the goals that determined the design, the resulting features, and what's still left to do.

The goals[1] of Yaird are as follows:

  • Be free, as in GPL.

  • Be maintainable. Small functions with documented arguments and result are better than a shell script full of constructs like eval "awk | bash | tac 3>&7".

  • Be secure and reliable. The application should stop with an error message at the slightest provocation, rather than run the risk of producing a non-booting initrd image. The application should not open loopholes that allow the 'bad guys' to modify the image, gain access to raw devices or overwrite system files.

  • Be distribution agnostic. Fedora and Debian run similar kernels and similar startup scripts, so there's little reason why the glue between the two levels should be completely different.

  • Have limited footprint. The tools needed to build and run the application should be few and widely available, with a preference for tools that are installed anyway.

  • Be future proof. Future kernels may use different modules and may change device numbers; the application should need no changes to cope with such migrations.

  • Promote code reuse. Make functions side-effect free and independent of context, so that it's easy to package the core as a library that can be reused in other applications.

  • Generate small images. The application should accurately detect what modules are needed to get the root file system running and include only those modules on the generated image.[2]

Requirements:

  • Linux 2.6.8 or later, both when running yaird and when running the generated image. By limiting the goal to support only recent kernels, we can drastically reduce the number of special cases and knowledge about modules in the application.

  • A version of modprobe suitable for 2.6 kernels.

  • Sysfs and procfs, both on the old and on the new kernel.

  • Perl and the HTML-Template module.

To achieve these goals, the following features are implemented:

  • Templating system to tune the generated image to a given distribution; templates for Debian and Fedora FC3 included.

  • Interprets /etc/fstab, including details such as octal escapes, ignore and noauto keywords, and — for ext3 and reiser file systems — label and uuid detection. Where applicable, options in /etc/fstab are used in the generated image.

  • Supports volume management via LVM2; activates only the volume group required for the root file system.

  • Supports software RAID via mdadm; activates only required devices.

  • Supports EVMS, an alternative user level interface to LVM2 and MD.

  • Supports encrypted filesystems via cryptsetup and cryptsetup-luks.

  • Supports NFS root filesystems.

  • Understands SATA, IDE, DASD devices.

  • Generated image does not use hard coded device numbers.[3]

  • Image generation understands how included executables may depend on symbolic links and shared libraries. Shared libraries work for both glibc and klibc.

  • Support input devices such as USB keyboard, if the input device supports sysfs. Input devices are needed in the initial image to supply a password for encrypted root disk and to do debugging.

  • Basic support for kernel command line as passed by the boot loader. Interprets init=, ro, rw.

  • Module aliases and options as specified in /etc/modprobe.d are supported.

  • Interprets the blacklist information from hotplug.

  • Interprets the kernel configuration file that defines whether a component is built in, available as a module or unavailable. By maintaining a mapping between module name and config parameter for selected modules, we avoid error messages if for instance a required file system is built into the kernel.

  • Supports initramfs, both in Debian and Fedora versions. An example template using the older initrd model is included for Debian.

  • Does not require devfs in either the old or the new kernel.

  • Behaviour of the generated image can be tuned using configuration files.

Obviously, this tool is far from complete. Here's a list of features that still need to be implemented:

  • Understands USB storage, no special provisions are needed for code generation, but it's not tested yet.

  • Swsusp is not supported yet.

  • Firewire is not supported.

  • Loopback file systems are not supported yet.

  • Filesystems encrypted via loopaes are not supported yet.

Concepts

This section discusses the basic concepts underlying yaird. The main procedure of the program is this:

      #
      # go -- main program:
      # write an image to destination in specified format.
      # destination must not exist yet.
      # templates define how actions in masterplan are expanded.
      #
      sub go ($$) {
	      my ($config, $destination) = @_;
	      my $masterPlan = Plan::makePlan ($config->{goals});
	      my $image = $masterPlan->expand ($config->{templates});
	      Pack::package ($image, $config->{format}, $destination);
      }
    

What it does is simple:

  1. given some goals, make a plan with a number of actions that the generated image should execute;

  2. transform the plan to a detailed description of the image;

  3. build and pack the image.

About Goals

The generated initial boot image should achieve a number of goals before handing over control to the root file system. There is a configuration file that determines what these goals are; the default list of goals is as follows:

	GOALS
		TEMPLATE	prologue
		INPUT
		MODULE		mousedev
		MODULE		evdev
		MOUNTDIR	"/" "/mnt"
		TEMPLATE	postlude
	END GOALS
      

The complete list of goals that yaird knows about is as follows:

TEMPLATE name

Add the contents of the named template to the image. It is not possible to pass arguments to the template.

MODULE name

Add the named module to the image.

INPUT

Add modules for every keyboard device found on the system to the image.

NETWORK

Add modules for every ethernet device found on the system to the image.

MOUNTDIR fsdir mountPoint

Given a directory that occurs in /etc/fstab, get the underlying block device and file system type working, then mount it at mountPoint.

MOUNTDEV blockDevice mountPoint

Given a block device that occurs in /etc/fstab, get the block device and corresponding file system type working, then mount it at mountPoint. It is not possible to express activating a block device without mounting it somewhere.

It is likely that new types of goal will need to be introduced to support features such as software suspend.

Making the Plan

The goals listed in the configuration file need to be translated into actions to be taken by the generated image. As an example, before mounting a file system, the modules containing the implementation of the file system need to be loaded.

To refine the goal of loading a kernel module, the ModProbe module invokes the modprobe command to find any prerequisite modules, skipping any modules that are blacklisted or compiled into the kernel. Aliases are handled transparantly by modprobe, module options are recorded to be included in the initial image. If the modprobe command decides a module needs an install command, an error is generated because we cannot in general determine which executables the install command would need to be on the initial boot image.

The KConfig module determines if loading a module can be omitted because the module is hardcoded into the kernel. As an example, it is aware of the fact that the module ext3 is not needed if the new kernel configuration contains CONFIG_EXT3_FS=y.[4] Only a few modules are known: yaird looks for modules such as ext3 when that filesystem is used, so it makes sense to check whether a missing module is compiled in. On the other hand, hardware modules that are compiled in never show up in modules.pcimap and friends, so they remain completely outside the view of yaird.

Before a device as listed in /etc/fstab can be mounted, that device needs to be enabled. That device could be an NFS mount, a loopback mount or it could be a block device. The loopback case is not supported yet, but block devices are. This support is based on a number of sources of information:

  1. Scanning the /dev directory gives us the relation between all block special files and major/minor numbers.

  2. Scanning the /sys/block directory gives us the relation between all major/minor numbers and kernel names such as dm-0 or sda1; it also gives the relation between partitions and complete devices.

  3. If there is a symlink in a /sys/block subdirectory to the /sys/devices directory, it also gives is the relation between a block device and the underlying hardware.

Based on the kernel name and partition relationships of the device, we determine the steps needed to activate the device. As an example, to activate sda1, we need to activate sda, then create a block special file for sda1. As another example, to activate dm-0, our first bet is to check whether this is an LVM logical volume, and if so activate the physical volumes underlying the volume group, and finally running vgchange -a y. Otherwise, it could be an encrypted device, for which we generate different code.

Hardware Planning

Some devices, such as sdx or hdy, are expected to have underlying hardware; as an example, sda may be backed by pci0000:00/0000:00:1f.2/host0/0:0:0:0. This represents a hardware path, in this case a controller on the PCI bus that connects to a SCSI device. In order to use the device, every component on the path needs to be activated, the component closest to the CPU first. Based on the pathname in /sys/devices and on files within the directory for the component, we can determine what kind of component we're dealing with, and how to find the required modules.

Finding modules closely follows the methods used in the hotplug package, and the hotplug approach in turn is an almost literal translation of the code that the kernel uses to find a driver for a newly detected piece of hardware.

For components that talk some protocol over a bus, like SCSI or IDE disks or CDROMs, this is a simple hard coded selection; as an example, the ScsiDev module knows that a SCSI device with a type file containing "5" is a CDROM, and that sr-mod is the appropriate driver.

Devices such as PCI or USB devices cannot be classified into a few simple categories. These devices have properties such as "Vendor", "Device" and "Class" that are visible in sysfs. The source code of kernel driver modules for these devices contains a table listing which combination of properties mark a device that the driver is prepared to handle. When the kernel is compiled, these tables are summarised in a text file such as modules.pcimap. Based on this table, we find a driver module needed for the device and mark it for inclusion on the image.

Multiple modules can match the same hardware: as an example, usb-storage and ub both match an USB stick. In such cases, we load all matching modules into the kernel and leave it to kernel to decide who gets to manage the device. There's one complication: some modules, such as usb-core, match any device (probably to maintain some administration of their own, or to provide an ultra-generic interface), but do not actually provide access to the device. Such devices are weeded out by the Blacklist module, based on information in /etc/hotplug/blacklist and /etc/hotplug/blacklist.d.

It turns out that the "load modules for every component in the sysfs path" approach is not always sufficient: sometimes you have to load siblings as well. As an example, consider a combined EHCI/UHCI USB controller on a single chip. The same ports can show up as EHCI or UHCI devices, different PCI functions in the same PCI slot, with different sysfs directories, depending on what kind of hardware is connected. Purely following the sysfs path, we would only need to load the EHCI driver, but it appears that on this kind of chip, EHCI devices are not reliably detected unless the UHCI driver is loaded as well. For this reason, we extend the algorithm with a rule: "for PCI devices, load modules for every function in the PCI slot".

That's actually a bit much: it would load all of ALSA if you have a combined ISA/IDE/USB/Multimedia chipset. So we limit the above to those PCI functions that provide USB ports.

Plan Transformation

The plan generated in the first phase is a collection of general intentions, stuff like 'load this module', but it does not specify exactly what files must be placed on the image and what lines are to be added to the initialisation scripts.

The module ActionList represents this plan with a list of hashes; every hash contains at least 'action' and 'target', with other keys added to provide extra information as needed. If two steps in the plan have identical action and target, the last one is considered redundant and silently omitted.

This plan is transformed to an exact image description with the help of templates. These templates are read from a configuration file; for every type of action they can contain:

  • files to be copied from the mother system to the same location on the image;

  • directories to be created on the image; these do not have to exist on the mother system;

  • trees to be copied recursively from the mother system to the image;

  • script fragments: a few lines of code to be appended to the named file on the image.

All of the above are fed through HTML-Template, with the hash describing this action as parameters. In practice, this looks like so:

	#
	# Template -- translating general intentions to exact image layout
	#
	TEMPLATE SET

	    TEMPLATE insmod
	    BEGIN
		FILE "<TMPL_VAR NAME=target>"
		FILE "/sbin/insmod"

		# optionList may be undef
		# and already is suitably escaped.
		SCRIPT "/init"
		BEGIN
		    !/sbin/insmod '<TMPL_VAR NAME=target>' \
		    !      <TMPL_VAR NAME=optionList>
		END SCRIPT
	    END TEMPLATE

	    # lots of other templates omitted ...
	END TEMPLATE SET

      

There are a few attributes that are available to every template:

version

The kernel version we're generating an image for. Useful if you want your image to include a complete copy of /lib/modules/(version)/kernel.

appVersion

The version of yaird used to build the image.

auxDir

The directory where yaird keeps executables intended to go on the image, such as run_init.

Currently, there are templates for Debian and for Fedora, plus a template showing how to use the older initrd approach.

Image Generation

The detailed image description consists of a collection of names of files, directories, symbolic links and block or character devices, plus a number of lines of shell script. The image description does not contain permission or ownership information: files always have mode 444, executables and directories always 555, devices always mode 600,[5] and everything is owned by root.

The Image module contains the image description and can write the image to a directory. It understands about symlinks: if /sbin/vgscan is added to the image and it happens to be a symlink to lvmiopversion, both vgscan and lvmiopversion will be added to the image. Shared libraries are supported via the SharedLibraries module, as discussed in the section called “Supporting Shared Libraries”. Invocations of other executables are not recognised automatically: if lvmiopversion executes /etc/lvm-200/vgscan, the latter needs to be added explicitly to the image.

The copying of complete trees to the image is influenced by the copying for executables: if there is a symlink in the tree, it's target is also included on the image, but if the target is a directory, its contents are not copied recursively. This approach avoids loops in image generation. Note that the target of a symlink must exist: yaird refuses to copy dangling links.

Packing the Image

The final step is packing the image in a format that the bootloader can process; this is handled by the module Pack. The following formats are supported:

cpio

A zipped cpio file (new ASCII format), required for the initramfs model as used in the templates for Debian and Fedora.

directory

An unpacked directory, good for debugging or manually creating odd formats.

cramfs

A cramfs filesystem, used for Debian initrd images.

The interface between kernel and image

The initial boot image is supposed to load enough modules to let the real root device be mounted cleanly. It starts up in a very bare environment and it has to do tricky stuff like juggling root filesystems; to pull that off successfully it makes sense to take a close look at the environment that the kernel creates for the image and what the kernel expects it to do. This section contains raw design notes based on kernel 2.6.8.

The processing of the image starts even before the kernel is activated. The bootloader, grub or lilo for example, reads two files from the boot file system into ram: the kernel and image. The bootloader somehow manages to set two variables in the kernel: initrd_start and initrd_end; these variables point to the copy of the image in ram. The bootloader now hands over control to the kernel.

During setup, the kernel creates a special file system, rootfs. This mostly reuses ramfs code, but there are a few twists: it can never be mounted from userspace, there's only one copy, and it's not mounted on top of anything else. The existence of rootfs means that the rest of the kernel always can assume there's a place to mount other file systems. It also is a place where temporary files can be created during the boot sequence.

In initramfs.c:populate_rootfs(), there are two possibilities. If the image looks like a cpio.gz file, it is unpacked into rootfs. If the file /init is among the files unpacked from the cpio file, the initramfs model is used; otherwise we get a more complex interaction between kernel and initrd, discussed in the section called “Booting with initrd”.

Booting with Initramfs

If the image was a cpio file, and it contains a file /init, the initram model is used. The kernel does some basic setup and hands over control to /init; it is then up to /init to make a real root available and to transfer control to the /sbin/init command on the real root.

The tricky part is to do that in such a way that there is no way for user processes to gain access to the rootfs filesystem; and in such a way that rootfs remains empty and hidden under the user root file system. This is best done using some C code; yaird uses run_init, a small tool based on klibc.

	# invoked as last command in /init, with no other processes running,
	# as follows:
	# exec run_init /newroot /sbin/init "$@"
	- chdir /newroot
	# following after lots of sanity checks and not across mounts:
	- rm -rf /*
	- mount --move . /
	- chroot .
	- chdir /
	- open /dev/console
	- exec /sbin/init "$@"
      

Booting with initrd

If the image was not a cpio file, the kernel copies the initrd image from where ever the boot loader left it to rootfs:/initrd.image, and frees the ram used by the bootloader for the initrd image.

After reading initrd, the kernel does more setup to the point where we have:

  • working CPU and memory management

  • working process management

  • compiled in drivers activated

  • a number of support processes such as ksoftirqd are created. (These processes have the rootfs as root; they can get a new root when the pivot_root() system call is used.)

  • something like a console. Console_init() is called before PCI or USB probes, so expect only compiled in console devices to work.

At this point, in do_mounts.c:prepare_namespace(), the kernel looks for a root filesystem to mount. That root file system can come from a number of places: NFS, a raid device, a plain disk or an initrd. If it's an initrd, the sequence is as follows (where devfs can fail if it's not compiled into the kernel)

      - mount -t devfs devfs /dev
      - md_run_setup()
      - process initrd
      - umount /dev
      - mount --move . /
      - chroot .
      - mount -t devfs devfs /dev
      

Once that returns, in init/main.c:init(), initialisation memory is freed and /sbin/init is executed with /dev/console as file descriptor 0, 1 and 2. /sbin/init can be overruled with an init=/usr/bin/firefox parameter passed to the boot loader; if /sbin/init is not found, /etc/init and a number of other fallbacks are tried. We're in business.

The processing of initrd starts in do_mounts_initrd.c:initrd_load(). It creates rootfs:/dev/ram, then copies rootfs:/initrd.image there and unlinks rootfs:/initrd.image. Now we have the initrd image in a block device, which is good for mounting. It calls handle_initrd(), which does:

      # make another block special file for ram0
      - mknod /dev/root.old b 1 0
      # try mounting initrd with all known file systems,
      # optionally read-only
      - mount -t xxx /dev/root.old /root
      - mkdir rootfs:/old
      - cd /root
      - mount --move . /
      - chroot .
      - mount -t devfs devfs /dev
      - system ("/linuxrc");
      - cd rootfs:/old
      - mount --move / .
      - cd rootfs:/
      - chroot .
      - umount rootfs:/old/dev
      - ... more ...
      

So initrd:/linuxrc runs in an environment where initrd is the root, with devfs mounted if available, and rootfs is invisible (except that there are open file handles to directories in rootfs, needed to change back to the old environment).

Now the idea seems to have been that /linuxrc would mount the real root and pivot_root into it, then start /sbin/init. Thus, linuxrc would never return. However, main.c:init() does some usefull stuff only after linuxrc returns: freeing init memory segments and starting numa policy, so in eg Debian and Fedora, /linuxrc will end, and /sbin/init is started by main.c:init().

After linuxrc returns, the variable real_root_dev determines what happens. This variable can be read and written via /proc/sys/kernel/real-root-dev. If it is 0x0100 (the device number of /dev/ram0) or something equivalent, handle_initrd() will change directory to /old and return. If it is something else, handle_initrd() will decode it, mount it as root, mount initrd as /root/initrd, and again start /sbin/init. (if mounting as /root/initrd fails, the block device is freed.)

Remember handle_initrd() was called via load_initrd() from prepare_namespace(), and prepare_namespace() ends by chrooting into the current directory: rootfs:/old.

Note that rootfs:/old was move-mounted from '/' after /linuxrc returned. When /linuxrc started, the root was initrd, but /linuxrc may have done a pivot_root(), replacing the root with a real root, say /dev/hda1.

Thus:

  • /linuxrc is started with initrd mounted as root.

  • There is working memory management, processes, compiled in drivers, and stdin/out/err are connected to a console, if the relevant drivers are compiled in.

  • Devfs may be mounted on /dev.

  • /linuxrc can pivot_root.

  • If you echo 0x0100 to /proc/sys/kernel/real-root-dev, the pivot_root will remain in effect after /linuxrc ends.

  • After /linuxrc returns, /dev may be unmounted and replaced with devfs.

Thus a good strategy for /linuxrc is to do as little as possible, and defer the real initialisation to /sbin/init on the initrd; this /sbin/init can then pivot_root into the real root device.

	#!/bin/dash
	set -x
	mount -nt proc proc /proc
	# root=$(cat proc/sys/kernel/real-root-dev)
	echo 256 > proc/sys/kernel/real-root-dev
	umount -n /proc
      

Kernel command line parameters

The kernel passes more information than just an initial file system to the initrd or initramfs image; there also are the kernel boot parameters. The bootloader passes these to the kernel, and the kernel in turn passes them on via /proc/cmdline.

An old version of these parameters is documented in the bootparam(7) manual page; more recent information is in the kernel documentation file kernel-parameters.txt. Mostly, these parameters are used to configure non-modular drivers, and thus not very interesting to yaird. Then there are parameters such as noapic, which are interpreted by the kernel core and also irrelevant to yaird. Finally there are a few parameters which are used by the kernel to determine how to mount the root file system.

Whether the initial image should emulate these options or ignore them is open to discussion; you can make a case that the flexibility these options offer has become irrelevant now that initrd/initramfs offers far more fine grained control over the way in which the system is booted. Support for these options is mostly a matter of tuning the distribution specific templates, but it is possible that the templates need an occassional hint from the planner. To find out just how much "mostly" is, we'll try to implement full support for these options and see where we run into limitations. An inventarisation of relevant options.

ydebug

The kernel does not know about this option, so we can use it to enable debugging in the generated image.

ide

These are options for the modular ide-core driver. This could be supported by adding an attribute "isIdeCore" to insmod actions, and expanding the ide kernel options only for insmod actions where that attribute is true. It seems cleaner to support the options from /etc/modprobe.conf. Unsupported for now.

init

The first program to be started on the definitive root device, default /sbin/init. Supported.

ro

Mount the definitive root device read only, so that it can be submitted to fsck. Supported; this is the default behaviour.

rw

Three guesses. Supported.

resume, noresume

Which device (not) to use for software suspend. To be done.

root

The device to mount as root. This is a nasty one: the planner by default only creates device nodes that are needed to mount the root device, and even if you were to put hotplug on the inital image to create all possible device nodes, there's still the matter of putting support for the proper file system on the initial image. We could make an option to yaird to specify a list of possible root devices and load the necessary modules for all of them. Unsupported until there's a clear need for it.

rootflags

Flags to use while mounting root file system. Implement together with root option.

rootfstype

File system type for root file system. Implement together with root option.

ip, nfsaddrs = <client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>

These two are aliases, with "ip" being the preferred form. This option may appear more than once. It tells the kernel to configure a network device, either based on values that are part of the option string or based values supplied by DHCP.

In yaird, it also triggers the mounting of an NFS root.[6]

See the section called “Supporting NFS Root” and the kernel documentation file nfsroot.txt for details.

nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]

Where the root file system to be mounted is coming from. If you don't give any options, we try first with NFS over TCP, then over UDP and finally NFSv2. If DHCP specifies a root directory, server and root are based on DHCP, but options in nfsroot are still applied. If nfsroot does not give server-ip, the server IP given by DHCP is used.

Supporting Raid Devices

This section discusses software raid devices from an initial boot image perspective: how to get the root device up and running. There are other aspects to consider, the bootloader for example: if your root device is on a mirror for reliability, it would be a disappointment if after the crash you still had a long downtime because the MBR was only available on the crashed disk. Then there's the issue of managing raid devices in combination with hotplugging: once the system is operational, how should the raid devices that the initial image left untouched be brought online?

Raid devices are managed via ioctls (mostly; there is something called "autorun" in the kernel) The interface from userland is simple: mknod a block device file, send an ioctl to it specifying the devnos of the underlying block devices and whether you'd like mirroring or striping, then send a final ioctl to activate the device. This leaves the managing application free to pick any unused device (minor) number and has no assumptions about device file names.

Devices that take part in a raid set also have a "superblock", a header at the end of the device that contains a uuid and indicates how many drives and spares are supposed to take part in the raid set. This can be used be the kernel to do consistency checking, it can also be used by applications to scan for all disks belonging in a raid set, even if one of the component drives is moved to another disk controller.

The fact that the superblock is at the end of a device has an obvious advantage: if you somehow loose your raid software, the device underlying a mirror can be mounted directly as a fallback measure.

If raid is compiled into the kernel rather than provided as a module, the kernel uses superblocks at boot time to find raid sets and make them available without user interaction. In this case the filename of the created blockdevice is hardcoded: /dev/md\d. This feature is intended for machines with root on a raid device that don't use an initial boot image. This autorun feature is also accessible via an ioctl, but it's not used in management applications, since it won't work with an initial boot image and it can be a nuisance if some daemon brought a raid set online just after the administator took it off line for replacement.

Finally, by picking a different major device number for the raid device, the raid device can be made partitionable without use of LVM.

There are at least three different raid management applications for Linux: raidtools, the oldest; mdadm, more modern; and EVMS, a suite of graphical and command line tools that manages not only raid but also LVM, partitioning and file system formating. We'll only consider mdadm for now. The use of mdadm is simple:

  • There's an option to create a new device from components, building the superblock.

  • Another option assembles a raid device from components, assuming the superblocks are already available.

  • Optionally, a configuration file can be used, specifying which components make up a device, whether a device file should be created or it is assumed to exist, whether it's stripe or mirror, and the uuid. Also, a wildcard pattern can be given: disks matching this pattern will be searched for superblocks.

  • Information given in the configuration file can be omitted on the command line. If there's a wildcard, you don't even have to specify the component devices of the raid device. A typical command is mdadm --assemble /dev/md-root auto=md uuid=..., which translates to "create /dev/md-root with some unused minor number, and put the components with matching uuid in it."

So far, raid devices look fairly simple to use; the complications arise when you have to play nicely with all the other software on the box. It turns out there are quite a lot of packages that interact with raid devices:

  • When the md module is loaded, it registers 256 block devices with devfs. These devices are not actually allocated, they're just names set up to allocate the underlying device when opened. These names in devfs have no counterpart in sysfs.

  • When the LVM vgchange is started, it opens all md devices to scan for headers, only to find the raid devices have no underlying components and will return no data. In this process, all these stillborn md devices get registered with sysfs.

  • When udevstart is executed at boot time, it walks over the sysfs tree and lets udev create block devices files for every block device it finds in sysfs. The name and permissions of the created file are configurable, and there is a hook to initialise SELinux access controls.

  • When mdadm is invoked with the auto option, it will create a block device file with an unused device number and put the requested raid volume under it. The created device file is owned by whoever executed the mdadm command, permissions are 0600 and there are no hooks for SELinux.

  • When the Debian installer builds a system with LVM and raid, the raid volumes have names such as /dev/md0, where there is an assumption about the device minor number in the name of the file.

For the current Debian mkinitrd, this all works together in a wonderful manner: devfs creates file names for raid devices, LVM scans them with as side effect entering the devices in sysfs, and after pivotroot udevstart triggers udev into creating block device files with proper permissions and SELinux hooks. Later in the processing of rcS.d, mdadm will put a raid device under the created special file. Convoluted but correct, except for the fact that out of 256 generated raid device files, up to 255 are unused.

In yaird, we do not use devfs. Instead, we do a mknod before the mdadm, taking care to use the same device number that's in use in the running kernel. We expect mdadm.conf to contain an auto=md option for any raid device files that need to be created. This approach should work regardless of whether the fstab uses /dev/md\d or a device number independent name.

Supporting EVMS

The EVMS suite aims to be a complete disk management solution: it recognises disk partitions, RAID configurations, concatenation of disk partitions, and file systems. It does all of this using its own plugin architecture, and is largely selfcontained: in particular, LVM, mdadm or libdevmapper are not required. There are some external dependencies though: EVMS uses the same kernel modules to do RAID that other packages use, and it uses an external mkfs command to support file systems.

What can be moved out of the kernel has been moved out: as an example, EVMS does not rely on code in the kernel to interpret partition tables: a partition such as hda1 is unused. Instead, EVMS uses the dm mechanism to present parts of a physical disk as independent block devices. The advantage of this approach is that new partition table formats can be supported without kernel changes.

The plugin architecture provides three different user interfaces: command line, curses based and GUI. There also is a configuration and backup/restore mechanism, where plugins can send and receive state related data. There does not seem to be a central state file other than basic configuration: all state information is kept with the plugins.

Plugins are implemented as shared libraries, but the relation between library and plugin is not simple: there's no command to determine which plugins are contained in a library. This makes it difficult to determine in a maintainable way what's the minimal set of plugins needed to boot the system; the current implementation makes no attampt in that direction, and just loads the lot of them.

Once the hardware is available and device drivers are loaded, the EVMS system expects to take care of everything. This means yaird support can be fairly simple: once we find that a device is supported by EVMS (it's listed with by the command "evms_query volumes"), we determine the underlying physical disks with the command "evms_query disks". We then build a boot image that loads drivers for the physical disk and afterwards runs the command "evms_activate" that will recreate all volumes.

There's a twist: the volume may need RAID drivers; to accomodate this, all RAID related modules are inserted into the kernel before starting "evms_activate". A possible improvement is to include modprobe on the image, and to let EVMS load only the required modules. This would save RAM at the expense of a somewhat larger initial boot image.

Note that some devices are visible in EVMS without actually working; these normally are shown with device number 0:0. This seems to happen mostly with devices that are not completely under the control of EVMS. I'm not sure whether this a bug or a feature; but either way yaird will need to be aware of such devices and the fact that they may be visible, but that they are not bootable.

Supporting Encrypted Disks

To protect the content of your disk against unwanted reading even if the machine is stolen, it can make sense to encrypt the disk. This section discusses Linux support for disk encryption and the impact this has on the initial boot image.

The idea here is to encrypt the entire disk with a single key: the kernel encrypts and decrypts all blocks on an underlying device and presents it as a new ordinary block device, where you can use mkfs and fsck as always. Thus an encrypted disk only protects the confidentiality of your data in cases where the hardware is first switched off and then taken away for later perusal by the bad guys. It will not protect confidentiality if the bad guy gains access to a running system, either through an exploit or with a valid account.

There are different implementations of this idea. All implementations use the kernel crypto modules (the same stuff that supports IPsec), but they differ in how that cryptography is squeezed between userland and the diskplatter.[7] Note that we do not compare how effective the various implementations are at keeping your data secret: if your data is important enough to encrypt, it's also important enough to do your own research into which implementation is most robust.

cryptoloop

Is in mainline kernel 2.6.10, but has reliability problems, such as possible deadlocks. The cryptoloop maintainer: "We should support cryptoloop. No new features, but working well. At the same time we should declare it 'deprecated' and provide dm-crypt as alternative." See kerneltrap for background. The on-disk format is trivial: just the encrypted data. When the device is initialised, the user enters a passphrase and a hash of this phrase is used as key to do the decryption, and if the result is a filesystem, the key was valid.

dm-crypt

Is in mainline kernel since 2.6.4. It uses device mapper (the same framework that is also used by LVM), which makes it more stable than cryptoloop. See dm-crypt: a device-mapper crypto target. Dm-crypt can use the same on-disk format as cryptoloop, but the device mapper makes it easy to reserve part of the disk for a partition header with key material.

Such a partition header, LUKS, is now under development; it will offer improved protection against dictionary attacks and will make it easier to change the password on an encrypted disk. Due to the way the device mapper works, support for the partition header can be implemented completely in userspace.

LUKS is integrated in Gentoo and included in Fedora FC4 test1. A debian package exists (cryptsetup-luks), but is not (yet) included in the main archive.

loopaes

An encrypting loop device; see http://loop-aes.sourceforge.net, http://mail.nl.linux.org/linux-crypto/. It's not in mainline kernel, and the author has no intentions of pushing it in.

All these implementations need some kind of userspace tool to pass key material to the kernel; this key material may come from lots of places:

  • in the most simple case, it could be a hashed version of the password

  • it could be a large random key stored in a gpg-encrypted file

  • for swap devices, it could be randomly regenerated on each reboot

  • for file systems other than the root, it could be from a file with mode 600 on the root file system

  • the key could be stored on a USB stick, stored separately from the machine.

An overview of relevant userspace tools:

  • the losetup command has an encryption option to use the cryptoloop module. Note that this does not cause cryptoloop to be mounted automatically.

  • versions of the mount command in Debian and Fedora have a 'loop,encryption' option that will be passed to losetup for use with cryptoloop, like so:

    	    /dev/vg0/crwrap /crypt1 ext3 loop,encryption=aes,noauto 0 0
    	  

  • The dmsetup command can set and show parameters (including key hashes!) for dm based devices, including dm-crypt and LVM. With a bit of shell scripting, you can hash a password and pass it on the command line to set up a dm-crypt device.

    	    # : dont bother cracking this key, its a dummy
    	    # dmsetup table
    	    crypted: 0 2097152 crypt aes-cbc-plain \
    		    e9975dcb10992fbc03a52f44e8f830d8e997\
    		    5dcb10992fbc03a52f44e8f830d8e9975dcb\
    		    10992fbc03a52f44e8f830d8 0 254:7 0
    	    vg0-crwrap: 0 2097152 linear 8:3 56623488
    	    #
    	  

  • The cryptsetup command adds a friendly wrapper around this. In particular, it has hashing of the keyword built in.

  • A modified package cryptsetup-luks exists, that adds extra options to (1) create a luks headers for a partition and (2) open a partition given one of a number of possible passphrases.

  • The file /etc/crypttab is a debian extension to cryptsetup: it provides a list of crypted devices, their underlying devices, corresponding cipher and hash settings, plus the source for the passphrase: either some file or the controlling terminal. This allows the devices to be activated by /etc/init.d/cryptdisks. There is a thread on adding /etc/crypttab to Fedora: too late for FC3, to be considered again for FC4: see here and here.

In order to activate an encrypted device with cryptsetup, we need to detect:

  • which underlying device to use

  • which encryption and hash algorithm to use

  • where the passphrase comes from

  • whether we have a plain crypted partition from LUKS partition

In order to determine all these points we need information from /etc/crypttab; as a consistency check, we'll compare this to the output from "cryptsetup status".[8]

The resulting actions:

  1. If the source of the passphrase is something other than the console, abort. There are too many variables to support this reliably.

  2. For the passphrase hash algorithm, no modules need to be loaded, since it is included by cryptsetup from a user space library.

  3. Make the underlying device available.

  4. Modprobe the dm-crypt and the cipher (the module name is the part of the cipher name before the first hyphen). If the cipher block mode needs a hash, load that too. Note that the cipher block mode hash is something different from the passphrase hash: it's the part after the colon in eg 'aes-cbc-essiv:sha256'.

  5. For plain cryptsetup, invoke an action with the following parameters:

    	    cryptsetup  target=device
    			src= ...
    			hash= ...
    			cipher= ...
    			size= ...
    			verify=y|undef
    	  

    Here the cryptsetup action will result in a script fragment in /init that has "cryptsetup create" in a loop until exit status is 0. For plain cryptsetup, this only has effect in combination with the "verify" option: exit status is 0 is the user gives the same password twice in succession. With cryptsetup-luks, this would test that the passphrase actually gives access to the encrypted device.

  6. For cryptsetup-luks, invoke a similar action with fewer parameters, since so much of the required information is already in the header.

Supporting NFS Root

It is possible to use an NFS share rather than a local disk as root device; this is (obviously) useful for diskless terminals, but it also can come in handy for recovery.

Examples of projects using NFS root for diskless work are LTSP, Lessdisks and Stateless Linux. In these projects, the initial boot image comes with the distribution and it must be sufficiently generic to support a wide range of hardware; in particular it must probe for different network cards. For yaird, we'll focus on recovery use, where the initial boot image is tailored for a single computer.

Although in principe the kernel and initial boot image for an NFS root system can be stored on a local disk, it's more common to have them loaded over the network with TFTP. This means you'll need a boot loader that can work over the network, such as pxelinux. This takes place before the initial boot image takes over; we won't dive into the details here.

There are a number of issues that make it impossible to automatically determine exactly what is needed to do a network boot:

  • Not all interfaces are suitable for booting: think of loopback devices IPsec tunnels, 802.1Q endpoints.

  • Interfaces may be renamed by udev; thus there is no link between the name while running yaird and the name while running the initial boot image.

  • Once the system is running, there is no way to determine how an interface got its IP address: could be RARP, DHCP or static.

  • An NFS share in /etc/fstab contains a hostname and directory, with no portable indication how that name is resolved to an IP address, whether that IP address will be unchanged during the next reboot and whether the route to that IP address will stay unchanged.

This means we cannot determine how to mount the NFS root using only information that is readily available on the running system: we'll need a hint. Rather than give that hint in the form of yaird configuration options, we will use the kernel command line.

The NFS part of the boot process takes place after loading of keyboard drivers and before switching to the final root. It has the following phases:

  • Load device drivers for every interface that is backed by hardware: /sys/class/net/*/device.

  • load protocols: nfs for file sharing (this implies lockd and sunrpc), and af_packet for raw ether, needed for DHCP.

  • Configure interfaces: get an IP address, netmask, broadcast, gateway. As a side effect, get hostname, dns, rootserver, rootpath.

  • Mount the NFS root.

The last two steps are done by a single program, trynfs. This is based on the klibc components ipconfig and nfsmount. This program only is invoked if the kernel command line parameter ip= (or its alias nfsaddrs=) is set. The kernel parameters ip=, nfsaddrs=, nfsroot= are passed as arguments to trynfs.

Earlier versions of Yaird had a command line option "--nfs" to enable NFS code generation. Starting with version 0.0.11, this option no longer is available. Instead, write a configuration file based in Default.cfg that uses the 'nfsstart' template to get an IP address and mount a root file system. The reason the command line option is dropped is that there are more ways to use NFS than can be expressed with a simple command line option: some people need only a driver for a specific card, others need lots of network drivers; you may or may not want to use a local drive as backup if no network is available; using a configuration file makes it possible to tune the generated image exactly for the situation at hand.

NFS Pitfalls

Yaird can get the system to a state where init is running from an NFS mounted root device, but that is not always sufficient to get a reliable system: the init scripts will also need to be written to work well in an NFS mounted environment. This section discusses some potential problems.

The Linux version of NFSv4 (Working Group, Linux reference implementation) has a new channel of communication between the kernel and user space: rpc_pipefs. This is normally mounted on /ar/lib/nfs/rpc_pipefs, and is used to let a user space daemon do locking and Kerberos on behalf of the kernel.

The rpc_pipefs support on a machine can interfere with yaird. As an example, in Fedora, /etc/modprobe.conf.dist has an 'install' line for module 'sunrpc' that automatically mounts the rpc_pipefs filesystem when the module is loaded. This means the filesystem is not mounted if the sunrpc module happens to be compiled into the kernel; it also can't be mounted if sunrpc is loaded from the initial boot image, since there is no /var/lib/nfs/rpc_pipefs yet to mount it on. When yaird sees such an install line, it can no longer determine what should go on the initial boot image and terminates.

The workaround is to remove the 'install' line from modprobe.conf and to do the mounting in an /etc/init.d script before the rpc.gssd and rpc.statd daemons are started.

Note that using Kerberos with an NFS mounted root is of questionable value: Kerberos relies on a secret file on the root file system to guarantee the security of NFS, and if that secret file is on an NFS file system that is itself not protected by Kerberos, the guarantee loses value.

Another potential problem is dhclient, a tool to configure a network interface with DHCP. This can call a user script to manage DHCP state changes, and on FC4, that script happens to stop and start the interface to get it to a known state. Since the script itself is accessed over NFS via the interface, the stopping works, but the starting doesn't ... By using a fixed IP address you avoid this problem, but that is not a generally applicable solution.

Supporting S390

Information on the IBM zSeries architecture can be found in the Linux 2.4 to 2.6 Transition Guide.

The most common boot device will have a kernel name like "dasdb2", and is visible in sysfs as /sys/devices/css0/0.0.a104/0.0.010a; here the subdirectories of css0 are for subchannel, then device.

There is a file modules.ccwmap that looks like so:

      # ccw module         match_flags cu_type cu_model dev_type dev_model
      zfcp                 0x000f      0x1731  0x03      0x1732  0x03
      zfcp                 0x000f      0x1731  0x03      0x1732  0x04
      qeth                 0x0003      0x1731  0x01      0x0000  0x00
      qeth                 0x0003      0x1731  0x05      0x0000  0x00
      cu3088               0x0003      0x3088  0x08      0x0000  0x00
    

Here channel type/model and device type/model are found in the files devtype and cutype, formatted like "%04x/%02x\n". The match_flags have this interpretation:

      #define CCW_DEVICE_ID_MATCH_CU_TYPE             0x01
      #define CCW_DEVICE_ID_MATCH_CU_MODEL            0x02
      #define CCW_DEVICE_ID_MATCH_DEVICE_TYPE         0x04
      #define CCW_DEVICE_ID_MATCH_DEVICE_MODEL        0x08
    

The upshot of all this seems to be that we can ignore the css0 and subchannel directory, then should look up required module based on the modules.ccwmap, in the same way that lookup is done in usbmap and pcimap.

There also is the concept of "ccwgroup", where a single device uses a number of S390 channels. No indications that this has implications for booting.

Supporting Input Devices

A working console and keyboard during the initial boot image execution is needed to enter a password for encrypted file systems; it also helps while debugging. This section discusses the kernel input layer and how it can be supported during image generation.

The console is a designated terminal, where kernel output goes, and that is the initial I/O device for /sbin/init. Like all terminal devices, it provides a number of functions: you can read and write to it, plus it has a number of ioctl() functions to manage line buffering, interrupt characters and baudrate or parity where applicable.

Terminals come in different types: it can be a VT100 or terminal emulator connected via an RS232 cable, or it can be a combination of a CRT and a keyboard. The keyboard can be connected via USB or it can talk a byte oriented protocol via a legacy UART chip.

The CRT is managed in two layers. The top layer, "virtual terminal", manages a two dimensional array describing which letter should go in which position of the screen. In fact, there are a number of different arrays, and which one is actually visible on the screen is selected by a keyboard combination. Below the virtual terminals is a layer that actually places the letters on the screen. This can be done a letter at a time, using a VGA interface, or the letters can be painted pixel by pixel, using a frame buffer.

Below the terminal concept we find the input layer. This provides a unified interface to the various user input devices: mouse, keyboard, PC speaker, joystick, tablet. These input devices not only generate data, they can also receive input from the computer. As an example, the keyboard needs computer input to operate the NUM LOCK indicator. Hardware devices such as keyboards register themselves with the input layer, describing their capabilities (I can send relative position, have two buttons and no LEDs), and the input layer assigns a handler to the hardware device. The handler presents the device to upper layers, either as a char special file or as the input part of a terminal device. This is not a one-to-one mapping: every mouse gets its own handler, but keyboard and PC speaker share a handler, so it looks to userland like you have a keyboard that can do "beep".

In addition to handlers for specific type of upper layers (mouse, joystick, touch screen) there is a generic handler that provides a character device file such as /dev/input/event0 for every input device detected; input events are presented through these devices in a unified format. The input layer generates hotplug events for these generic event handlers; hotplug uses modules.inputmap to load a module containing a suitable upper layer event handler. The keyboard handler is a special case that does not occur in this map, so for image generation there is little to be learned from hotplug input support.

To guarantee a working console, yaird should examine /dev/console, determine whether it's RS232 or hardware directly connected to the computer, and then load modules for either serial port, or for virtual terminals, the input layer and any hardware underlying it. Unfortunately, /dev/console does not give a hint what is below the terminal interface, and unfortunately, lots of input devices are legacy hardware that is hard to probe and only sketchily described by sysfs in kernel 2.6.10.

This means that a guarantee for a working console cannot be made, which is why distribution kernels come with components such as the keyboard and serial port driver compiled into the kernel. We can do something else though: provide modules for keyboard devices provided the kernel provides correct information. That covers the case of USB keyboards, and that's something that's not compiled into distribution kernels, so that the administrator has to add modules explictly in order to get the keyboard working in the initial boot image.

Lets examine the sources of information we have to find which input hardware we have to support.

  • In /sys/class/input, all input devices are enumerated. Mostly, these only contain a dev file containing major/minor number, but USB devices also have a device symlink into /sys/devices identifying the underlying hardware.

  • In kernel 2.6.15, /sys/class/input is far more complete. It has links from class device to hardware devices, and hardware devices such as atkbd and psmouse have a 'modalias' file that can be fed to modprobe. This contains everything that's in /proc/bus/input/devices, in a nice accessible manner.

    As an aside, can we do all device probing based on the modalias file? This would mean we no longer would have to distinguish between sysfs format for usb and pci, making the code simpler. The tricky part is to distinguish between modules compiled in and modules simply missing from the kernel: dealing with "FATAL: Module ... not found". As a first step, we could simply assume that aliases that cannot be resolved refer to compiled in modules; this is in essence what the current scan of eg modules.usbmap does.

  • In /boot/menu/grub.lst, kernel options can be defined that determine whether to use a serial line as console and whether to use a frame buffer. The consequence is that it is fundamentally impossible to determine by looking at the hardware alone what's needed to get an image that will boot without problems. This probably means we'll have to consider supplying some modules in the image that will only get loaded depending on kernel options.

  • The file /proc/bus/input/devices gives a formatted overview of all known input devices; entries look like this:

    	    I: Bus=0003 Vendor=413c Product=2003 Version=0100
    	    N: Name="DELL DELL USB Keyboard"
    	    P: Phys=usb-0000:00:1d.7-4.1/input1
    	    H: Handlers=kbd event2
    	    B: EV=100003
    	    B: KEY=7f f0000 0 3878 d801d101 1e0000 0 0 0
    	  

    Here the "I" line shows identification information passed to the input layer by the hardware driver that is used to look up the appropiate handler. "N" is a printable name provided by the hardware driver. "P" is a hint at location in a bus of the device; note how this line is completely unrelated to the location of the hardware in /sys/devices. The H (Handlers) line is obvious; The B lines specify capabilities of the device, plus extra information for each capability. Known capabilities include:

    CapabilityDescription
    SYNInput event is completed
    KEYKey press/release event
    RELRelative measure, as in mouse movement
    ABSAbsolute position, as in graphics tablet
    MSCMiscelanious
    SNDBeep
    REPSet hardware repeat
    FFDon't know
    PWRPower event: on/off switch pressed.
    FF_STATUSDon't know.

Finally, let's consider some kernel configuration defines, the corresponding modules and their function. This could be used as a start to check whether all components required to make an operational console are available on the generated image:

DefineModuleDescription
VT(bool) Support multiple virtual terminals, irrespective of what hardware is used to display letters from the virtual terminal on the CRT.
VT_CONSOLE(bool) Make the VT a candidate for console output. The alternative is a serial line to a VT100 or terminal emulator
VGA_CONSOLE(bool) Display a terminal on CRT using the VGA interface.
FRAMEBUFFER_CONSOLEfbcon Display a terminal on a framebuffer, painting letters a pixel at a time. This has to know about fonts.
FB_VESAvesafb Implement a framebuffer based on VESA (a common standard for PC graphic cards), a place where an X server or the framebuffer console can write pixels to be displayed on CRT. There are many different framebuffer modules that optimise for different graphics cards. Note that while vesafb and other drivers such as intelfb can be built as a module, they only function correctly when built into the kernel. Most framebuffer modules depend on three other modules to function correctly: cfbfillrect, cfbcopyarea, cfbimgblt.
ATKBDatkbd Interpret input from a standard AT or PS/2 keyboard. Other keyboards use other byte codes, see for example the Acorn keyboard (rpckbd).
SERIOserio Module that manages a stream of bytes from and to an IO port. It includes a kernel thread (kseriod) that handles the queue needed to talk to slow ports. It is normally used for dedicated IO ports talking to PS/2 mouse and keyboard, but can also be interfaced to serial ports (COM1, COM2). The atkbd driver uses a serio driver to communicate with the keyboard.
SERIO_I8042i8042 Implement a serio stream on top of the i8042 chip, the chip that connects the standard AT keyboard and PS/2 mouse to the computer. This is legacy hardware: it's not connected via PCI but directly to the 'platform bus'. When a chip such as i8042 that implements serio is detected, it registers itself with the input layer. The input layer then lets drivers that use serio (such as atkbd and psmouse) probe whether a known device is connected via the chip; if such a device is found, it is registered as a new input device.
SERIAL_8250serial Support for serial ports (COM1, COM2) on PC hardware. Lots of other configuration options exist to support multiple cards and fiddle with interrupts. If compiled in rather than modular, a further option, SERIAL_8250_CONSOLE, allows using the serial port as a console.
USB_HIDusbhid Driver for USB keyboards and mice. Another define, USB_HIDINPUT, needs to be true for these devices to actually work.
USB_KBDusbkbd Severely limited form of USB keyboard; uses the "boot protocol". This conflicts with the complete driver.

The following figure gives an example of how the various modules can fit together.

Figure 1.  Module relation for common console setup

Module relation for common console setup

In practical terms, a first step toward a more robust boot image is to support new keyboard types, such as USB keyboards. The following algorithm should do that.

  1. Interpret /proc/bus/input/devices.

  2. Look for devices that have handler kbd and that have buttons. Mice and the PC speaker don't match that criterium, keyboards do.

  3. You could interpret the name field of such devices if you're interested in supporting legacy keyboards.

  4. The devices that have handler 'kbd' also have a handler 'event\d', where input is presented in a generalised event format; look up this device in /sys/class/input/event\d/.

  5. If it's got a device symlink, load the hardware drivers for that hardware device (most likely it's usbhid plus a usb core driver).

  6. Don't bother with a mknod, the input is handled via /dev/console.

  7. Otherwise it's presumable a legacy device; you could check for the existence of /sys/devices/platform/i8042/serio\d/, or you could just assume the appropriate driver to be compiled in.

  8. Implement support for /etc/hotplug/blacklist, since some USB keyboards publish two interfaces (full HID and the limited boot protocol), the input layer makes both visible in /proc/bus/input/devices and the corresponding modules are mutually conflicting. The blacklist is used to filter out one of these modules.

Supporting Shared Libraries

When an executable is added to the image, we want any required shared libraries to be added automatically. The SharedLibraries module determines which files are required. This section discusses the features of kernel and compiler we need to be aware of in order to do this reliably.

Linux executables today are in ELF format; it is defined in Generic ELF Specification ELFVERSION, part of the Linux Standard Base. This is based on part of the System V ABI: Tool Interface Standard (TIS), Executable and Linking Format (ELF) Sepcification

ELF has consequences in different parts of the system: in the link-editor, that needs to merge ELF object files into ELF executables; in the kernel (fs/binfmt_elf.c), that has to place the executable in RAM and transfer control to it, and in the runtime loader, that is invoked when starting the application to load the necessary shared libraries into RAM. The idea is as follows.

  • Executables are in ELF format, with a type of either ET_EXEC (executable) or ET_DYN (shared library; yes, you can execute those.) There are other types of ELF file (core files for example) but you can't execute them.

  • These files contain two kind of headers: program headers and section headers. Program headers define segments of the file that the kernel should store consequetively in RAM; section headers define parts of the file that should be treated by the link editor as a single unit. Program headers normally point to a group of adjacent sections.

  • The program may be statically linked or dynamically (with shared libraries). If it's statically linked, the kernel loads relevant segments, then transfers control to main() in userland.

  • If it's dynamically linked, one of the program headers has type PT_INTERP. It points to a segment that contains the name of a (static) executable; this executable is loaded in RAM together with the segments of the dynamic executable.

  • The kernel then transfers control to the userland interpreter, passing program headers and related info in a fourth argument to main(), after envp.

  • There's one interesting twist: one of the segments loaded into RAM (linux-gate.so) does not come from the executable, but is a piece of kernel mapped into user space. It contains a subroutine that the kernel provides to do a system call; the idea is that this way, the C library does not have to know which calling convention for system calls is supported by the kernel and optimal for the current hardware. The link editor knows nothing about this, only the interpreter knows that the kernel can pass the address of this subroutine together with the program headers. [9]

  • The interpreter interprets the .dynamic section of the dynamic executable. This is a table containing various types of info; if the type is DT_NEEDED, the info is the name of a shared library that is needed to run the executable. Normally, it's the basename.

  • The interpreter searches LD_LIBARY_PATH for the library and loads the first working version it finds, using a breath-first search. Once everything is loaded, the interpreter hands over control to main in the executable.

  • Except that that's not how it really works: the path that glibc uses depends on whether threads are supported, and klibc can function as a PT_INTERP but will not load additional libraries.

The ldd command finds the pathnames of shared libraries used by an executable. This works only for glibc: it invokes the interpreter with the executable as argument plus an environment variable that tells it to print the pathnames rather than load them. For other C libraries, there's no guaranteed correct way to find the path of shared libraries.

Update: ldd also works for another C library, uclibc, unless you disable that support while building the library by unsetting LDSO_LDD_SUPPORT.

Thus, to figure out what goes on the initial ram image, first try ldd. If that gives an answer, good. Otherwise, use a helper program to find PT_INTERP and DT_NEEDED. If there's only PT_INTERP, good, add it to the image. If there are DT_NEEDED libraries as well, and they have relative rather than absolute pathnames, we can't determine the full path, so don't generate an image.

There are a number of options to build a helper to extract the relevant information from the executable:

  • Build it in perl. The problem here is that unpacking 64-bit integers is an optional part of the language.

  • Build a wrapper around objdump or readelf. The drawback is that there programs are not part of a minimal Linux distribution: depending on them in yaird would increase the footprint.

  • Building a C program using libbdf. This is a library intended to simplify working with object files. Drawbacks are that it adds complexity that is not necessary in our context since it supports multiple executable formats; furthermore, at least in Debian it is treated as internal to the gcc tool chain, complicating packaging the tool.

  • Building a C program based on elf.h. This turns out to be easy to do.

Yaird uses the last approach listed.

Security

This section discusses security: avoiding downtime, avoiding revealing sensitive information, avoiding unwanted modifications to the data; either through accident or malice. A good introduction to secure programming can be found in Secure Programming for Linux and Unix HOWTO.

For yaird, security is not very complicated: although it runs with root privileges, the program is not setuid, and all external input comes from files or programs installed by the admnistrator, so our main focus is on avoiding downtime caused by ignored error codes. A full blown risk assessment would be overkill, so we'll just use the HOWTO as a checklist to verify that the basic precautions are in place.

GroupMitigationStatus
Bad inputVerify command line Yes.
Bad inputVerify and clean up environment Complete environment is reset at start of program.
Bad inputAvoid assumptions about file descriptors Handled by perl.
Bad inputVerify file names Perl taint check shows filenames are verified for absence of odd characters before passing to subprocesses. TODO: examine UTF-8 impact.
Bad inputVerify file content File contents in sysfs verified. Fstab entries properly quoted. TODO: check for spaces in names of LVM volume or of modules; could end up in generated /sbin/init.
Bad inputVerify locale settings All locale related environment variables are wiped at program startup.
Bad inputVerify character encoding All IO is byte oriented.
Bad inputBuffer overflow In perl?
Program structureSeparate data and control Under this heading, the HOWTO discusses the dangers of auto-executing macros in data files. The closest thing we have to a data file are the templates that tune the image to the distribution. We use a templating language that does not allow code embedding, and the image generation module does not make it possible for template output to end up outside of the image. Conclusion: broken templates can produce a broken image, but cannot affect the running system.
Program structureMinimize privileges The user is supposed to bring his own root privileges to the party, not much to be done here. A related issue is the minimizing of privileges in the system that is started with the generated image. This would include starting SELinux at the earliest possible moment. At least in Fedora, that earliest possible moment is in rc.sysinit, well past the moment where the initial boot image hands over control to the newly mount root file system. No yaird support needed.
Program structureSafe defaults Configuration only specifies sources of information, like /etc/hotplug, not much can go wrong here.
Program structureSafe Initialisation The location of the main configuration file is configured as an absolute path into the application.
Program structureFail safe Planning and writing the image is separated; writing only starts after planning is succesfully completed. Todo: consider backout on write failure.
Program structureAvoid race conditions Temporary files and directories are created with the File::Temp module, which is resistant to name guessing attacks. The completed image is installed with rename rather than link; if an existing file is overwritten, this guarantees there's no race where the old image has been deleted bu the new one is not yet in place. (Note that there is no option in place yet which allows overwriting of existing files.) To do: examine File::Temp safe_level=HIGH.
Underlying resourcesHandle meta characters Protection against terminal escape sequences in output is not yet in place.
Underlying resourcesCheck system call results Yes.
Language specificVerify perl behaviour with taint. Yes.
Language specificAvoid perl open magic with 3rd argument. Yes.

Tool Chain

This section discusses which tools are used in implementing yaird and why.

The application is built as a collection of perl modules. The use of a scripting language makes consistent error checking and building sane data structures a lot easier than shell scripting; using perl rather than python is mainly because in Debian perl has 'required' status while python is only 'standard'. The code follows some conventions:

  • Where there are multiple items of a kind, say fstab entries, the perl module implements a class for individual items. All classes share a common base class, Obj, that handles constructor argument validation and that offers a place to plug in debugging code.

  • Object attributes are used via accessor methods to catch typos in attribute names.

  • Objects have a string method, that returns a string version of the object. Binary data is not guaranteed to be absent from the string version.

  • Where there are multiple items of a kind, say fstab entries, the collection is implemented as a module that is not a class. There is a function all that returns a list of all known items, and functions findByXxx to retrieve an item where the Xxx attribute has a given value. There is an init function that initializes the collection; this is called automatically upon first invocation of all or findByXxx. Collections may have convenience functions findXxxByYyy: return attribute Xxx, given a value for attribute Yyy.

The generated initrd image needs a command interpreter; the choice of command interpreter is exclusively determined by the image generation template. At this point, both Debian and Fedora templates use the dash shell, for historical reasons only. Presumably busybox could be used to build a smaller image. However, support for initramfs requires a complicated construction involving a combination of mount, chroot and chdir; to do that reliably, nash as used in Fedora seems a more attractive option.

Documentation is in docbook format, since it's widely supported, supports numerous output formats, has better separation between content and layout than texinfo, and provides better guarantees against malformed HTML than texinfo.

Autoconf

GNU automake is used to build and install the application, where 'building' is perhaps too big a word adding the location of the underlying modules to the wrapper script. The reasons for using automake: it provides packagers with a well known mechanism for changing installation directories, and it makes it easy for developers to produce a cruft-free and reproducible tarball based on the tree extracted from version control.

C Library

The standard C library under linux is glibc. This is big: 1.2Mb, where an alternative implementation, klibc, is only 28Kb. The reason klibc can be so much smaller than glibc is that a lot of features of glibc, like NIS support, are not relevant for applications that need to do basic stuff like loading an IDE driver.

There are other small libc implementations: in the embedded world, dietlibc and uClibc are popular. However, klibc was specifically developed to support the initial image: it's intended to be included with the mainline kernel and allow moving a lot of startup magic out of the kernel into the initial image. See LKML: [RFC] klibc requirements, round 2 for requirements on klibc; the mailing list is the most current source of information.

Recent versions of klibc (1.0 and later) include a wrapper around gcc, named klcc, that will compile a program with klibc. This means yaird does not need to include klibc, but can easily be configured to use klibc rather than glibc. Of course this will only pay off if every executable on the initial image uses klibc.

Yaird does not have to be extended in order to support klibc, but it is necessary to avoid assumptions about which shared libraries are used. This is discussed in the section called “Supporting Shared Libraries”.

Template Processing

This section discusses the templates used to transform high-level actions to lines of script in the generated image. These templates are intended to cope with small differences between distributions: a shell that is named dash in Debian and ash in Fedora for example. By processing the output of yaird through a template, we can confine the tuning of yaird for a specific distribution to the template, without having to touch the core code.

One important function of a template library is to enforce a clear separation between progam logic and output formatting: there should be no way to put perl fragments inside a template. See StringTemplate for a discussion of what is needed in a templating system, plus a Java implementation.

Lets consider a number of possible templating solutions:

  • Template Toolkit: widely used, not in perl core distribution, does not prevent mixing of code and templates.

  • Text::Template: not in perl core distribution, does not prevent mixing of code and templates.

  • Some XSLT processor. Not in core distribution, more suitable for file-to-file transformations than for expanding in-process data; overkill.

  • HTML-Template: not in perl core distribution, prevents mixing of code and templates, simple, no dependencies, dual GPL/Artistic license. Available in Debian as libhtml-template-perl, in Fedora 2 as perl-HTML-Template, dropped from Fedora 3, but available via Fedora Extras.

  • A home grown templating system: a simple system such as the HTML-Template module is over 100Kb. We can cut down on that by dropping functions we don't immediately need, but the effort to get a tested and documented implementation remains substantial.

The HTML-Template approach is the best match for our requirements, so used in yaird.

Configuration Parsing

Yaird has a fair number of configuration items: templates containing a list of files and trees, named shell script fragments with a value that spans multiple lines. If future versions of the application are going to be more flexible, the number of configuration items is only going to grow. Somehow this information has to be passed to the application; an overview of the options.

  • Configuration as part of the program. Simply hard-code all configuration choices, and structure the program so that the configuration part is a well defined part of the program. The advantage is that there is no need for any infrastructure, the disadvantage is that there is no clear boundary where problems can be reported, and that it requires the user to be familiar with the programming language.

  • AppConfig. A mature perl module that parses configuration files in a format similar to Win32 "INI" files. Widely used, stable, flexible, well-documented, with as added bonus the fact that it unifies options given on the command line and in the configuration file. An ideal solution, except for the fact that we need a more complex configuration than can conventiently be expressed in INI-file format.

  • An XML based configuration format. XML parsers for perl are readily available. The advantage is that it's an industry standard; the disadvantage that the markup can get very verbose and that support for input validation is limited (XML::LibXML mentions a binding for RelaxNG, but the code is missing, and defining an input format in XML-Schema ... just say no).

  • YAML is a data serialisation format that is a lot more readable than XML. The disadvantage is that it's not as widely known as XML, that it's an indentation based language (so confusion over tabs versus spaces can arise) and that support for input validation is completely missing.

  • A custom made configuration language, based on Perl::RecDescent, a widely used, mature module to do recursive descent parsing in perl. Using a custom language means we can structure the language to minimise opportunities for mistakes, can provide relevant error messages, can support complex configuration structures and can easily parse the configuration file to a tree format that's suitable for further processing. The disadvantage is that a custom language is yet another syntax to learn.

Building a recursive descent parser seems the best match for this application.

Authors

This is a place holder section. Yaird was written by ... website here ... comments to ... bug reports ...

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; you may also obtain a copy of the GNU General Public License from the Free Software Foundation by visiting their Web site or by writing to


      Free Software Foundation, Inc.
      51 Franklin St, Fifth Floor
      BostonMA 02110-1301
      USA
    

Klibc code

Yaird contains code based on klibc; this code is made available by the author under the following licence. The relevant source files have this copyright notice included.

/* ----------------------------------------------------------------------- *
 *   
 *   Copyright 2004 H. Peter Anvin - All Rights Reserved
 *
 *   Permission is hereby granted, free of charge, to any person
 *   obtaining a copy of this software and associated documentation
 *   files (the "Software"), to deal in the Software without
 *   restriction, including without limitation the rights to use,
 *   copy, modify, merge, publish, distribute, sublicense, and/or
 *   sell copies of the Software, and to permit persons to whom
 *   the Software is furnished to do so, subject to the following
 *   conditions:
 *   
 *   The above copyright notice and this permission notice shall
 *   be included in all copies or substantial portions of the Software.
 *   
 *   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 *   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
 *   OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 *   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 *   HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 *   WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 *   FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 *   OTHER DEALINGS IN THE SOFTWARE.
 *
 * ----------------------------------------------------------------------- */
      



[1] Well, not really. I started this thingy to show off a small algorithm to find required modules based on sysfs information. To make that a credible demonstration, the small algorithm turned out to need a lot of scaffolding to turn it into a working program ...

[2] An alternative and equally interesting exercise would be an attempt to generate a universal initrd that could be distributed together with the kernel. Such an image would most likely be based on udev/hotplug.

[3] Except where the distribution depends on it; there are some issues with mdadm in Debian.

[4] Having knowledge of the relation between module names and kernel defines hardcoded into yaird is hardly elegant. Perhaps it is possible to generate this mapping based on the kernel Makefiles when building the kernel, but that's too complex just now.

[5] Having device files on the image is wrong: it will break if the new kernel uses different device numbers. Mostly this can be avoided by using the dev files provided by sysfs, but there is a bootstrap problem: the mount command needed to access sysfs assumes /dev/null and /dev/console are available.

[6] The idea that the "ip=" kernel command line option implies mounting an NFS root is debatable. Since the only use of the network for now is mounting NFS we can get away with it, and it simplifies passing a DHCP supplied boot path to the NFS mount code. If we find situations where IP is needed but NFS is not, we'll have to trigger NFS mount when "root=/dev/nfs".

[7] See GDBE for a similar mechanism under BSD.

[8] The output from "dmsetup table" would be an alternative. It's easier to parse, but introduces an additional package dependency.

[9] For more info on the kernel-supplied shared library for system calls, see LWN: How to speed up system calls, LWN: Patch: i386 vsyscall DSO implementation, LKML: common name for the kernel DSO.