Kapil Hari Paranjape
An operating system is a suite of programs. To answer the question ``What does this suite of programs do?'', we need to ask ourselves what we use computers for. Here are some possible answers:
Other than user requirements, three other critical ingredients have gone into the practical solution that we see before us today. The next input is from technology. The existence and availability of technology is often neglected but when one looks at GNU/Linux it is clear that its development and growth is closely tied to the emergence of comparatively cheap computer hardware. In the 1980's A. Tannenbaum had designed a nice operating system called Minix which worked on the PC/XT which had a 8086 chip. However, it was limited by the fact that the hardware did not offer any form of protection (see below) and so a badly designed program could crash the machine. With the 80386 appearing in the early 1990's, many users of Minix wanted to ``enhance'' it to make use of the memory protection features of the 80386. This was quite complicated and Bruce Evans did in fact do a lot of work. One of the Minix users was not quite satisfied with this piecemeal solution however and decided to write it all from ``scratch''--this user was Linus Torvalds. Thus the emergence of the 80386 was closely tied to the inception of the idea of Linux.
Neither Tannenbaum and Torvalds may like it, but the former had a big influence on the operating system we see before us. This is an instance of the influence of ``theoretical'' computer science on our system which comes about in two ways. On the one hand algorithms and data structures that computer scientists create do eventually make their way into real programs. On the other hand the logical ideas and metaphors created by TCS people are important as ways in which we learn to visualise problems and look for their algorithmic solutions. One need not look too deep to find concepts like ``B-trees'' and ``semaphores'' in our modern operating system. On a personal note, I first heard about a categorical view of programming in a TCS seminar at IIT, Kanpur in 1980 or so. Clearly this became a ``hot'' TCS topic for a while. Today similar ideas have lead to the much touted ``object-oriented'' programming paradigm.
The GNU aspect of the practical solution is perhaps not as evident as the ones above. We must thank Richard Stallman for pointing it out and further strengthening it by means of the General Public License (GPL). An important aspect of programming is re-usability--if not we would just hand everyone a computer with no programs and say ``Get to work''. We need to re-use what others have created before us. At the same time we often need to re-work and improve or adapt that work. In order that this process be carried forward we should of course provide others access to all the material that we made use of and that which we added to it. These principles are what the GPL clearly underlines. We cannot use, learn from, study and adapt a system to our needs were it not for ``Open Source''--this much is perhaps obvious. But we need to create a system that will remain usable, learnable, studiable and adaptable--for that we need the GPL copyleft.
When we combine all these inputs, we can list some of the features that we expect from our Operating System:
Let us further examine the chosen method of offering a solution. The solution suite is divided into three parts kernel, libraries, utilities (the latter are also called applications).
The user typically interacts with the utilities and (perversely!) for that reason we will discuss that aspect the least during this course. A better reason is that a sufficiently good understanding of the kernel and the libraries should give enough clues about how utilities can be (and are) written; on the other hand there are far too many interesting utilities to learn in one lifetime let alone in one course! However, we will learn about a few utilities along the way as these will be essential in order to run practical tests and examples under the GNU/Linux system.
The libraries serve two purposes. On the one hand they are programs that are used frequently and so have been written in an optimised fashion by ``super''-users (in the sense of very good programmers!). This allows other programmers to develop programs fast as they need not ``re-invent the wheel''; an important philosophical basis of the design that we have seen earlier. A more important aspect of the library (from the point of view of this course) is that it provides methods by which we can interact with the kernel.
The kernel is so named because it is the central program in the operating system. One of the functions of the kernel is to perform ``hardware abstraction''. The kernel gives us a few basic object types that we can use to deal with the plethora of hardware currently available without really having to worry about the specialities of each device. In addition, the kernel provides us with ``software abstraction'' by replacing bit manipulation with the manipulation of higher level objects.
The basic objects that were originally used by the Unix system to deal with all our requirements are files and processes. With the emergence of networks there was the addition of the notion of sockets which can (but need not be) thought of as a special kind of file. In particular, documents (respectively programs) usually start their life as a source file (e. g. TEX or C) which is processed into binary files (e. g. DVI or a.out) which can be used for printing or viewing (respectively for creating new processes). Data is also stored in files which can be ``flat'' source files or binary files (e. g. DB files) which are hash-indexed for quick searching; various processing mechanisms are provided to visualise this data--i. e. to convert it into other files. Games are processes that interact with the user through the manipulation of devices which are also represented via the file abstraction. This allows the games to be written in a device-independent fashion (e. g. music is stored as WAV or Ogg); similar ideas also allow us to provide non-textual means to visualise data (e. g. PBM or JPEG). The ``slow'' communication channels (such as e-mail) are also handled by files which store (spool) incoming and outgoing communication. The interactive usage of computers (which is nowadays the primary method by which computers are used) is handled by devices called ``terminals'' which are also a type of device file.
If we accept this point of view it is clear that we should now more specifically ask ourselves what we would like to do with files and processes--to be modern we will also include sockets in this question. Thus the three principle sub-questions that we can ask are: What are the services that we would like our Operating System to provide with regard to Files, Processes and Sockets?
Before going ahead we should mention some attempts that have been made to split up the job of the kernel into smaller pieces. The current structure of the Linux kernel is monolithic. One of the proposed designs is to separate the ``hardware abstraction'' layer from those parts of the kernel that support the files (file-system), processes (execution or process server) and the network layer. Such an approach is called a micro-kernel. Interestingly, Minix was such a solution; perhaps it was before its time or perhaps it was too much of a ``toy''. A further extension of this idea takes the point of view that with the expanding hardware arena, even the drivers for devices need to be moved out of the kernel--one may wonder what the kernel does in that case! Such a kernel is called an exo-kernel. Some of the features of this approach can already be found in Linux with the introduction of ``modular'' device drivers, file-systems and so on. However, the separation in an exo-kernel takes place at the level of the processor task space. In particular, a fault in a particular device driver or file-system should not make it possible to crash the kernel; this is not as yet a feature of the Linux kernel, thus it not quite an exo-kernel.
There are other possible choices for the namespace. For example, the convention on the internet is to use a forest as as namespace--these names are called URL's. The root of each tree in the forest has a name that describes the way in which one accesses this tree. For example, if we use the http protocol to access the port 8080 on the host the.end.of.the.internet then the URL is http://the.end.of.the.internet:8080/. A similar convention is found in DOS, where the root of a tree gets a single letter name followed by a `:'.
Now the namespace must itself be structured in terms of files since ``everything in Unix is a file''. One way to do this (and this is the traditional solution in Unix) is by means of ``special'' files called directories. Note that a name that ends in `/' doesn't really belong to our namespace since this denotes a node and not a leaf. At the same time, we can ask what the relation is between /usr and /usr/lib. We thus extend our namespace to allow names to end in / except that such a name should be the name of a directory, which is specially structured file that contains (at least) a list of all leaves that lie directly below the node. This allows us to dynamically manage the tree. Every time we look for a file with the name (say) /usr/local/lib/stow/apache, we recurse as follows. We examine the directory /usr/local/lib/stow/ for the existence of the leaf apache in it. Note that /usr/local/lib/stow/ is a leaf in the directory /usr/local/lib/ so that is where we can find it, and so on. It is further convenient (but not essential!) to mandate that in order for /usr/lib to make sense, the name /usr should point to the directory /usr/. (Let me point out that I have used ``point'' in the above sentence knowingly.)
What we have described so far is the space of names of files but not files themselves. We do not dictate what a file may or may not contain except for the special files called directories. In this sense all files are binary or data files. It is thus useful to assign attributes to files to indicate what we may or may not do with them--in what way we can ``access'' them. In traditional Unix these attributes are restricted to the read, write and execute attributes which are further distributed among the users under the headings user,group and world. Nowadays, people like to have extended file attributes including the possibility of providing detailed ``access control lists''.
A directory is essentially an association list (or a list of pairs) which associates a pointer to actual data with each leaf that lies below the directory node. Thus, it is clear that different names could point to the same data; two such names are said to be hard linked. Since the relationship between these names is symmetric it does not always satisfy our requirements. For example one user may create a file and want to only allow another to read it; at the same time the second user may have a different way of organising files--in other words the same file could have a different name.
One solution for this is to allow another kind of special file which only contains the name of another file. When an access is made to such a special file the operating system would then ``automatically'' use the name inside it to access the ``actual'' data. Such a special file is called a symbolic link.
Some other kinds of special files were also present in the traditional Unix file system. We saw earlier that ``everything'' in Unix is a file. So, in particular, access to devices should also be through a file interface. For example, as I am typing this the operating system is reading my keystrokes from a file called /dev/pts/3. Such files are called device special files and consist of names that point to devices. Devices are classified according a ``major'' and a ``minor'' number and this is the information stored in the parent directory of the device. In our modern system Linux the creation and deletion of device special files can be managed by a separate file system called devfs so we don't really have to bother with them too much at this point.
Finally, the tree of files needs to be dynamic--growing nd receding according to need. We might want to add to the tree all the files residing on floppy or somewhere on the network. This is achieved by ``mounting'' the new file system at some directory node. Some have even suggested that the use of nodes as directories is too limited. In their view, it should be possible to attach a process to any node; this process will provide the files required by the part of the name space that lies below the node. Thus ``special'' files are a particular case of a more general mechanism called ``translators'' in the GNU/Hurd operating system that we might discuss towards the end of this course.
Now, since only processes can create new processes there must be a first process (the parent of all processes). This process ``init'' is started by the operating system and we shall see more details about that when we examine the way the computer boots-up. Since each process has children processes we see that the structure of processes is again tree-like. Perhaps for this reason Linux provides a file-system like view of the space of all processes under the /procfs file-system. (However, in this view the tree is flattened and all processes are referred to be their process identification number--PID).
But how does a process start anyway?
The basic operation to create processes that is provided by the Unix system is the fork. A process that calls this causes a replica of itself to be created as a child process. The value returned by this to the calling process tells that process the PID of the child process. The child process meanwhile starts running at the same point but is given 0 as the value returned by fork. In this way the child and parent can recognise their own identities.
A process often needs more memory than is currently available to it or no longer needs memory that was previously alloted to it. The malloc and free are the calls provided to the process to manage its memory needs. Note that the memory so allocated may not be contiguous to the memory previously alloted and so the kernel maintains certain memory maps for the running process that indicate which chunks of actual memory are available to it. Processes that allocate memory to themselves must be careful to keep track of it or they will be the cause of memory leaks.
More often than not what we really want to do is to run a different program in a new process. This operation is called exec and is essentially carried out by allocating memory for the new process, writing a program in that memory and then jumping to the newly created process image after deallocating the memory assigned to ones own process image. We thus see that exec does not create a new PID.
Every process has a number of attributes. In particular, the process has access to the command line which was invoked in order to execute it and a certain number of ``environment'' variables that were set by the process that called it. In addition, there are the maps to memory locations that we saw above and the file descriptors that we shall see below. Further, in order to handle security matters a process also has two pairs of user/group identification numbers. One of them is the ``real'' UID/GID pair while the other is the ``effective'' UID/GID pair. We will see the utility of this when we examine how security and access control are implemented.
Another important set of attributes is a collection of flags called signals--actually the signal handling instructions. There are a number of different signals that can be sent to a running process by another running process (providing it satisfies certain access control restrictions--see below). Some of these signals are sent by the operating system as well--these are ``reserved''. For all the other signals, the running process can either decide on how it wants to deal with them. The process can decide to ignore certain signals or it can setup certain sub-routines called signal handlers to deal with the signals. Note that these handlers are entered from a context more unpredictable than that available to the final end-point of a ``goto'' and so these sub-routines need care in writing! Signals that are not ignored or handled will cause a running process to terminate.
Signals give a rather minimal way for running processes to communicate with each other since each signal is essentially just a single up/down flag. To provided more detailed access, the operating system typically each process to create a queue where others can deposit messages for it. In order for processes to exchange large chunks of data one also allows the primitive operation of allocating shared memory. Such memory can be shared by two or more processes. In order to synchronise their access to such memory processes can also create new sets of flags called semaphores in order to signal each other. The exact semantics of these flags are left to the processes to decide upon.
When a process starts it is provided with three ready-made file-descriptors. We have already seen examples of system calls when we saw that we need operations for process to deal with files. The open system call allows a process to open a new file descriptor and the close system call allows a process to close one of its file descriptors. The read and write system calls are the natural calls that one can make use of on an open file descriptor. In addition, there are ``control'' functions that one can use for device files. One can reposition the place at which operations are taking place (in order to append or insert) by means of the lseek system call. All these operate at the level of file descriptors.
Note that file descriptors are inherited by an child process that is forked. Also note that a file descriptor comes along with a ``position'' which is the location in the file at which read/write/seek operations will take place. However, file control operations such as open/close act independently. Thus, a parent can transfer control to the child by opening the file before the child is forked and closing it after; the child inherits the open file descriptor. This has important security implications as we shall see.
When one asks the operating system to execute such a program, the operating system must load the program and set it up as a process. This involves memory allocation, relocation and dynamic linking. Now, one way would be for the kernel to allocate a contiguous chunk of memory for each program as its ``working space''. The problem then arises that if one runs a large number of small programs and some of them terminate but others keep running then memory becomes ``fragmented'' preventing larger programs from getting loaded. If a large program is made up of smaller fragments that can be relocated then this can be solved. However, the operating system has then to do a lot of work in order to relocate the fragments; moreover, all this relocation information must be stored in the program header, making that quite complex. A much better solution called memory paging is available. For this to work, the underlying hardware must support ``page faults''.
In this solution, each program is allocated virtual memory in a contiguous manner in integer multiples of a unit called a memory page. The kernel maintains a table for each process that maps a virtual memory page to a ``backing store'' of real address space--but doesn't map all of them to the processor memory if not enough is available. Whenever the program tries to access memory that is not backed by processor memory, a ``page fault'' is triggered by the memory-management unit of the underlying hardware. This interrupts the kernel which then suspends the process until it can make a memory map that provides backing store for this memory location. When the process has not requested allocation of that portion of memory at all, a segmentation violation signal is sent by the operating system to the process.
This allows an useful extension called ``shared libraries''. Many running process may actually be using the same routines from the standard libraries to print, read from files and so on. The code for these can thus be kept in the same backing store by the operating system for all the programs. The data segments that are accessed by these code fragments will (in general) be different as these will belong to different processes. This leads to the possibility of dynamic linking of programs--certain number of symbols are resolved into locations within the shared library. The operating system links these code fragments with the program when it creates the process and its virtual address space.
Even this can be limiting at times. For example, when a user logs-in to the system the system allocates resources such as a new terminal access point. It is clear that the users should not in general be allowed to create new terminal access points for other users; this function should be reserved for the system alone. Thus it should be possible for a process to start as one user and then be converted to another user. One mechanism for this is to have the notion of ``real'' and ``effective'' UID/GID pairs for each process. Any process could be temporarily run with ``effective'' ID different from its real one. Again it is clear that this facility should only be granted on a limited basis. The decision about which programs are allowed to change their UID/GID's when they are loaded is decided on the basis additional permission bits about them that are stored by the file-system.
Another useful mechanism is to impose per-process restrictions--like the number of open files, the amount of memory that can be allocated and so on. The process in turn can be allowed to ``lock'' access to certain files for a limited period of time--this means that during that period only this process can access that file. Similarly, a process could be permitted to lock itself into memory for a certain critical section so that it would not be swapped out delaying its execution. Moreover, it may be a good idea to do this while the process is manipulating passwords so that the passwords are never written to the disk (in the swap area).
As Linux evolves it is becoming clear that even the mechanisms described above have a number of limitations leading to incorrect use and security compromises. Thus the notion of ``capabilities'' as a way of limiting process resource access has come into play along with access control lists as a way for files to dictate ``who'', ``how'' and ``when'' they can be accessed. The important change is (in the words of a kernel hacker) to make access to resources a question to which the answer is dynamically decided rather than a fact which which is statically stored on the system.
One wasteful method is called polling. The process waiting for a message (or an acknowledgement which is also a message, albeit a shorter one) keeps coming to a specific location to check if some data has arrived. Another method is to block or sleep waiting for data. In other words, a process tells the operating to ``wake it up'' when data is available. Finally, there is the method of signals. The process continues to run after telling the system to send it a signal whenever some data is ready. One the signal has been sent it is up to the process to handle the data in a fashion that it interferes constructively with its on-going task. Clearly, the latter method is the best if one can manage it.
A pipe is a one-way communication channel on a Unix system that one process writes to and another one reads from. The operating system uses signals to indicate to the relevant process when there is something to read from the pipe. Similarly, writes to the pipe block until some process is ready to read from it.
Another complication of communication over ``long distances'' and in particular, between distinct machines is that having bi-directional communication becomes essential--even if all that one side is doing is sending acknowledgements. This is because, there is a non-trivial possibility of some part of the communication being lost due to a temporary failure in the physical layer of the communication.
The common metaphor of a telephone is the most natural one to use in this context. Another way to think about it is as bidirectional pipes. Each end of the connection is called a socket. There are various namespaces for these sockets depending on the type of communications being used. One of the most common is the INET or Internet Protocol (IP) namespace. Each socket is then associated with an address of the form a.b.c.d : e where a, b, c and d are integers between 0 and 255 (some addresses are reserved); e is an integer between 0 and 65535 (again some ports are usually allocated for certain types of communications). An IP communication channel thus takes the form a.b.c.d : e : : t : w.x.y.z.
How does one use sockets to communicate? First of all we create a socket, then we ``bind'' to a certain name in a name space. We can then use this socket to ``listen'' for incoming connections. On the other end, the machine creates a socket and again gets a name for it. Next, it tries to ``connect'' to a listening socket. Once it succeeds the channel is ``established''. Note that the creation of a channel is asymmetric since one side must initiate a connection to the other side which must be waiting for such initiations. However, once a channel is opened it is symmetric since such a channel is bidirectional. This method of using the IP namespace is called the TCP/IP method.
Another method is for a ``packet'' oriented approach. After creating a socket, on just sends packets out onto the wire in search of a specified receiving address. If someone else has created a socket to receive these packets then the packets are received--otherwise they are lost ``in the ether''. This is a rough description of the UDP protocol.
Our modern operating system must support both kinds of protocols. There are also numerous variants and a number of different namespaces associated with those. We should try to support as many as possible. Note that the actual sequence and kind of data that can be sent in the data packets or over the bidirectional channel is a matter to be decided between the communicating processes. In order that certain ``Well-Known-Services'' are properly recognised and looked for the Internet Engineering Task Force (IETF) has created a number of ``standard'' port and address assignments so that machines that do not share the same application programs or operating system or even hardware can safely and usefully exchange data. Examples of such protocols are the HyperText Transfer Protocol http, File Transfer Protocol ftp, Simple Mail Transfer Protocol smtp and so on. Since these protocols lie in the realm of application programs we will not discuss them in this course on operating systems. Instead we will focus on the provision of the basic framework of sockets, their creation and use.