The iron penguin, Part 1

Linux Ports

The iron penguin, Part 1

Linux takes to big iron within virtual machines

Summary
In the first in a three-part series on the Linux port to IBM's S/390, Neale Ferguson introduces the S/390 architecture and describes the byzantine development of the VM/ESA operating system. The second installment will examine the technical details of the port and how it came about. The third part will show Linux for S/390 in action. (4,000 words)
By Neale Ferguson

he largest Linux machine in the world is a big hunk of iron. Mainframe-style big iron is back in fashion as IBM and independent developers alike have brought the Linux platform to the S/390 mainframe running within virtual machines. In his article "S/390: The Linux Dream Machine," Scott Courtney (see Resources introduced the S/390 port of Linux and hinted at its potential as a high-end system to support large clusters of independent virtual servers for running Linux applications.

Linux on S/390: Read the whole series!

Part 1. Linux Takes to Big Iron Within Virtual Machines

Part 2. Building the Iron Penguin

Part 3. Linux for S/390 in Action
For those who think that the mainframe is a dead architecture with limited potential, you should know that IBM and other vendors now sell more mainframes than ever before. In fact, the mainframe has flourished alongside the growth of the Internet as large vendors strive to put their information systems online and need to expand the capabilities of their mainframe systems to support the added demand.

So what is this thing called the S/390? What is VM/ESA and LPAR? Where did such a port come from? For those unfamiliar with the S/390 system but interested in hearing about this Linux port, this three part series will explain:

The S/390 architecture and its origins
The virtual machine (VM) hypervisor, an important part of the fledgling Linux-for-S/390 community
The Logical Partitioning facilities (LPAR)
The two organizations that were concurrently working on ports to S/390
The I/O architecture, which is the biggest difference between the S/390 Linux port and other ports
Why you might want to run Linux on S/390
The applications the community has started running
A vision, dubbed multiple virtual Linuxes or virtual penguin power, for using Linux in S/390.

Introduction to S/390
The S/390 (System/390) architecture evolved from the S/360 (System/360) of the 1960s. IBM, and Thomas Watson, Jr., in particular, risked the family jewels in undertaking the development of the S/360. It was the largest private venture in American history, with $5 billion spent on five new plants and 60,000 additional employees. S/360 was first to employ instruction microprogramming to facilitate derivative designs and create the concept of a family architecture. The family originally consisted of six computers that could each use the same software and peripherals. The system also popularized remote computing, with terminals communicating to the host via phone lines (see Resources, Data General, 1997).

Since that time there have been radical changes and enhancements, but a programmer from that era would recognize many of the facilities of S/390. S/360, S/370, and S/390 (or ESA/390 as it is now known) are upwardly compatible. The S/360 was originally designed to allow programs written for earlier IBM hardware to migrate to the new platform. This required that S/360 use the IBM EBCDIC character set rather than the standard ASCII system. Above all else, this feature is what has differentiated and separated the S/360 from the rest of the computing world.

Currently, the three leading vendors that offer mainframes are IBM, Hitachi Data Systems (HDS), and Amdahl, with IBM leading sales by a good margin. The S/390 uses a 31-bit custom processor, compared to the more common 32-bit systems. This is specific to memory addressing capabilities only and not to the general processor architecture. A 64-bit version of the hardware is rumored to be in the works for release in the near future.

Basic architecture
S/390 uses 31 bits to address 2 GB of physical memory. Like many other processor platforms (e.g., i386, PowerPC), the S/390 uses a two-tier paging scheme (segments and pages) as opposed to the three-tier mechanism defined in Linux. The good news is that the three-tier mechanism has already been built for these other environments, helping ease some of the porting tasks.

In addition, ESA/390 allows for multiple address spaces of 2 GB each and multiple translate lookaside buffers (TLBs) for mapping each separate address space to the physical memory. Theoretically, up to 16 terabytes of address spaces can be controlled by the hardware. We exploited this feature in the Linux for S/390 port, simplifying complex memory processes like copy_to_user() to a couple of instructions.

SMP support
The ESA/390 architecture is implemented on processors that range from a card that slips into your laptop to a 16-way SMP configuration not much larger than a refrigerator that sits in a corner of the machine room. IBM's largest model is a 12-way SMP system. HDS currently ships a 13-way and has a 16-way system on the way. Amdahl already offers a 16-way model.

Why Linux on VM/ESA?

Top ten reasons why running the Linux operating system as a guest of VM/ESA is a smart choice (as seen on IBM's VM/ESA Website):

Resources can be shared among multiple Linux images running on the same VM/ESA system. These resources include CPU cycles, memory, storage devices, and network adapters.

Server hardware consolidation. Running tens or hundreds of Linux systems on a single S/390 server offers customers savings in terms of the space and personnel required to manage real hardware.

Virtualization. The virtual machine environment is flexible and adaptable. New Linux guests can be added to a VM/ESA system quickly and easily without dedicated resources. This is useful for replicating servers in addition to giving users a flexible test environment.

Running Linux on VM/ESA means Linux guest(s) can transparently take advantage of VM's support for S/390 hardware architecture and RAS features.

VM/ESA provides high-performance communication among virtual machines running Linux and other operating systems on the same processor. The underlying technologies enabling high-speed TCP/IP connections are virtual channel-to-channel (CTC) adapter support and VM's IUCV (interuser communication vehicle).

Linux on S/390 includes a minidisk device driver that can access all DASD types supported by VM/ESA.

Data-in-memory performance boosts are offered by VM's exploitation of the S/390 architecture.

Debugging. VM/ESA offers a functionally rich debug environment that is particularly valuable for diagnosing problems in the Linux kernel and device drivers.

Control and automation. VM's longstanding support for scheduling, automation, performance monitoring and reporting, and virtual machine management is available for Linux virtual machines, as well.

Horizontal growth. An effective way to expand your Linux workload capacity is to add more Linux guests to a VM/ESA system.

In addition, emulators like Hercules and Flex will allow your PC to run any S/390 operating system and application.

Processor partitioning
Processor partitioning goes by various names according to manufacturer: Amdahl calls it Multiple Domain Facility (MDF); Hitachi calls it Multiple Logical Partition Feature (MLPF); and IBM calls it Logical Partitioning (LPAR). Whatever the name, the intent is the same: It divides a single machine into multiple virtual systems or images, each of which appears to the operating system running in it as a complete and isolated processor. Partitioning allows you to share all processing resources selectively. The number of partitions you can create depends on the manufacturer and the machine type, but typically the maximum is in the range of 10 to 15 images.

Partitioning can also be achieved using the hypervisor VM/ESA, which I'll discuss in greater detail in the next part of this series. It provides a processor with virtual machines for which the limit is measured in the range of hundreds to tens of thousands.

I/O subsystem
One of the distinguishing features of S/390 is its channel subsystem. S/390 defines a unified means of accessing its I/O subsystem. It does this by defining a channel subsystem that is, in effect, a collection of sophisticated independent outboard processing systems that take complete responsibility for performing I/O operations from the CPU. A System/390 operating system has only to issue a single instruction to get an I/O operation initiated. The channel subsystem and the I/O devices will perform all the support actions, such as memory access, path selection, and connection, and handle conditions such as RPS miss, caching, and error recovery.

Computers are often rated for speed in terms of MIPS, sometimes (correctly) referred to as "meaningless indicators of processor speed." This is especially true of S/390. Any true estimate of MIPS must include the work performed by the channel subsystem. Each component of the subsystem may have considerable processing power that is equivalent to a standalone server. Bear this in mind when you see comparisons of CPU performance.

A more detailed explanation of the I/O subsystem as it affects the implementation of Linux on S/390 will be detailed in part two of this series.

Early operating systems
In the early days, computing was batch oriented, and the operating systems first used on the S/390 architecture reflected this. They had names like Basic Operating System, Tape Operating System, Disk Operating System, and (my favorite acronym) PCP (Primary Control Program).

These evolved into the predecessors of the OS/390 and VSE/ESA that are available today. As they evolved, significant and robust timesharing and realtime transaction processing capabilities were added.

A brief history of IBM, S/360, and Unix
In her treatise "VM, Past, Present, and Future," Melinda Varian (see Resources) of Princeton University describes some interesting machinations involving the development of System/360, MIT, timesharing, and Unix. This passage is reproduced here with permission.

At the time IBM was embarking on its "make-or-break" development of System/360 (the grandfather of S/390), MIT was committed to timesharing and was providing timesharing services to several other New England universities as well as to its own users. At MIT, it was "no longer a question of the feasibility of a timesharing system, but rather a question of how useful a system [could] be produced". The IBMers in the MIT Liaison Office and the Cambridge Branch Office, being well aware of what was happening at MIT, had become strong proponents of timesharing and were making sure that the System/360 designers knew about the work that was being done at MIT. They arranged for several of the leading System/360 architects to visit MIT and talk with the faculty. However, inside IBM at that time there was a strong belief that timesharing would never amount to anything and that what the world needed was faster batch processing. MIT and other leading-edge customers were dismayed, and even angered, on April 7, 1964, when IBM announced System/360 without address relocation capability.
The previous fall, MIT had founded Project MAC to design and build an even more useful timesharing system based on the CTSS prototype. Within Project MAC, MIT were to draw on the lessons they had learned from CTSS to build the Multics system. The basic goal of the Multics project "was to develop a working prototype for a computer utility embracing the whole complex of hardware, software, and users that would provide a desirable, as well as feasible, model for other system designers to study." At the outset, Project MAC purchased a second modified 7094 on which to run CTSS while developing Multics. It then requested bids for the processor on which Multics would run.
One of the first jobs for the staff of the new center was to put together IBM's proposal to Project MAC. In the process, they brought in many of IBM's finest engineers to work with them to specify a machine that would meet Project MAC's requirements, including address translation. They were delighted to discover that one of the lead S/360 designers, Gerry Blaauw, had already done a preliminary design for address translation on System/360. Address translation had not been incorporated into the basic System/360 design, however, because it was considered to add too much risk to what was already a very risky undertaking. It must be remembered that IBM was placing the entire future of its business on the line with System/360.
The machine that IBM proposed to Project MAC was a System/360 that had been modified to include the "Blaauw Box." This machine was also bid to Bell Labs at about the same time. It was never built, however, because both MIT and Bell Labs chose another vendor. MIT's stated reason for rejecting IBM's bid was that it wanted a processor that was a mainline product, so that others could readily acquire a machine on which to run Multics. It was generally believed, however, that displeasure with IBM's attitude toward timesharing was a factor in Project MAC's decision.
Losing Project MAC and Bell Labs had important consequences for IBM. Seldom after that would IBM processors be the machines of choice for leading-edge academic computer science research. Project MAC would go on to implement Multics on a GE 645 and would have it in general use at MIT by October 1969. Also in 1969, the system that was to become Unix would be begun at Bell Labs as an offshoot and elegant simplification of both CTSS and Multics, and that project, too, would not make use of IBM processors.

So started a period of long estrangement between System/360 and its descendents and the world of Unix. How different things might have been!

In the late '80s and early '90s, IBM had made attempts to get back into the Unix game on its mainframes with the introduction of AIX/370 and AIX/ESA. Unfortunately, these birds would not fly, and they were quickly retired to the operating system graveyard. Fortunately for IBM, AIX on the RT and RS6000 platforms did take off and has been a great line of business for the company.

The proliferation of business applications that were appearing in the Unix world prompted IBM to try a different approach to making the Unix APIs available to System/390 programmers. This time IBM came up with OpenEdition for OS/390 (later called Unix System Services, or USS) and VM/ESA. The premise behind these offerings was to provide a set of APIs to the base that would allow vendors to port their Unix applications to System/390 without rewriting the programs.

Both USS and OpenEdition still have an important, and even growing, role to play within an enterprise as a result of the advent of Linux for S/390. Their chief problem is that they are both EBCDIC implementations. The beauty of Linux for S/390 for software vendors is that it is an ASCII implementation that should look, feel, and act the same in all important respects as any other port of Linux.

Enter VM
So where did VM come from and why was it created? Again, Melinda Varian's history of VM is the canonical source for this material:

In the fall of 1964, the folks in Cambridge suddenly found themselves in the position of having to cast about for something to do next. A few months earlier, before Project MAC was lost to GE, they had been expecting to be in the center of IBM's timesharing activities. Now, inside IBM, "timesharing" meant TSS, and that was being developed in New York State. However, Norm Rasmussen (who had headed IBM's bid for Project MAC) was very dubious about the prospects for TSS and knew that IBM must have a credible timesharing system for the S/360. He decided to go ahead with his plan to build a timesharing system, with Bob Creasy leading what became known as the CP-40 Project.
The official objectives of the CP-40 Project were the following:

The development of means for obtaining data on the operational characteristics of both systems and application programs;

The analysis of this data with a view toward more efficient machine structures and programming techniques, particularly for use in interactive systems;

The provision of a multiple-console computer system for the center's computing requirements; and

The investigation of the use of associative memories in the control of multiuser systems.
The project's real purpose was to build a timesharing system, but the other objectives were genuine, too, and they were always emphasized in order to disguise the project's "counter-strategic" aspects.
Bob Creasy and Les Comeau spent the last week of 1964 joyfully brainstorming the design of CP-40, a new kind of operating system, a system that would provide not only virtual memory, but also virtual machines. They had seen that the cleanest way to protect users from one another (and to preserve compatibility as the new System/360 design evolved) was to use the System/360 Principles of Operations manual to describe the user's interface to the Control Program. Each user would have a complete System/360 virtual machine (which at first was called a "pseudo-machine"). (The term virtual machine has been attributed to Dave Sayre at IBM Research.)

This skunk-works project (which seems to be paralleled 30 years later by the Linux for S/390 effort) resulted in CP-40, which became CP-67, VM/370, VM/SP, and VM/XA and had been transformed by the early '90s into VM/ESA. The internals are probably unrecognizable to the original developers but the underlying principles remain the same.

Virtual machines
Virtual machines have found renewed interest in things like VMWare and Java Virtual Machines. VM/ESA, a virtual machine, can run anything that could be run on the bare iron, including a copy of VM/ESA itself (and a copy running in that copy, and so on). Virtual machines provide a "padded-cell environment" that isolates one user from another while also allowing all users access to both the real resources of the machine and the virtual resources of the VM operating system. You can, for example, define multiple virtual CPUs when more or fewer real ones exist, or virtual disks that may or may not correspond to real hardware.

So, why virtual machines? R. P. Goldberg, in the March 1973 Proceedings of ACM SIGARCH-SIGOPS Workshop on Virtual Computer Systems, describes the rationale:

The development of interest in virtual computer systems can be traced to a number of causes. First, there has been a gradual understanding by the technical community of certain limitations inherent in conventional timeshared multiprogramming operating systems. While these systems have proved valuable and quite flexible for most ordinary programming activities, t hey have been totally inadequate for system programming tasks. Virtual machine systems have been developed to extend the benefits of modern operating system environments to system programmers. This has greatly expedited operating system debugging and has also simplified the transporting of system software. Because of the complexity of evolving systems, this is destined to be an even more significant benefit in the future.
As a second point, a number of independent researchers have begun to propose architectures that are designed to directly support virtual machines, i.e., virtualizable architectures. These architectures trace their origins to an accumulated body of experience with earlier virtual machines, plus a set of principles taken from other areas of operating system analysis. They also depend upon a number of technical developments, such as the availability of low-cost associative memories and very large control stores, which now make proposals of innovative architectures feasible.
A third reason for the widespread current interest in virtual machines stems from its proposed use in attacking some important new problems and applications such as software reliability and system privacy/security. A final point is that IBM has recently announced the availability of VM/370 as a fully supported software product on System/370. With this action, IBM has officially endorsed the virtual machine concept and transformed what had been regarded as an academic curiosity into a major commercial product.

VM/ESA is a hypervisor, that is, it provides an interface definition to the entities running on it that is the same as the interface definition provided by the real hardware. What this means is the logical entities we call virtual machines are idealized simulations of a computer. The Control Program (CP) component of VM/ESA operates the real machine hardware and multiplexes the physical resources of the computing system to the virtual machines.

The System/390 architecture allows VM to do this because it separates its instruction set into privileged (aka Supervisor State) and nonprivileged (aka Problem State) groups. In the Supervisor State, all instructions are valid. In the Problem State, only those instructions are valid that provide meaningful information to the problem program and that cannot affect system integrity; such instructions are called unprivileged instructions. The instructions that are never valid in the Problem State are called privileged instructions. When a CPU in the Problem State attempts to execute a privileged instruction, a privileged-operation exception is recognized. A CPU executes another group of instructions, called semiprivileged instructions, in the Problem State only if specific authority tests are met; otherwise, a privileged-operation exception or a special-operation exception is recognized.

An operating system uses these privileged operations to schedule resources between competing applications that are running under it. CP will dispatch a virtual machine running the operating system in non-privileged mode and then trap any privileged operations performed by the virtual machines. When it traps these operations it can:

Determine whether it is a valid thing for the virtual machine to have done
Determine whether the resource the virtual machine is trying to use is accessible to that virtual machine
Map any I/O operations to a virtual device or a real or emulated device
Allow the virtual machine to continue processing from the point of the trap

Similarly, when interrupts occur on the real machine, CP will determine if the interrupt needs to be reflected to a particular virtual machine, such as when an I/O operation that had been initiated by a Linux virtual machine has just completed.

Much of the workload for intercepting and simulating instructions and interrupts for a virtual machine has been lifted from CP by the inclusion of hardware assist functions built into the processor complexes. These hardware assists provide significant performance boosts for the virtual machine.

VM and open source
VM started out within IBM but was soon adopted by the user community, which soon started providing new functions, enhancements, and fixes to the operating system. The code was a licensed program product of IBM but was free of charge and came with complete source. The philosophy of the development team is best described in the words of one of the chief architects, Bob Creasy:

"The design of CP/CMS by a small and varied software research and development group for its own use and support was, in retrospect, a very important consideration. It was to provide a system for the new IBM System/360 hardware. It was for experimenting with timesharing system design. It was not part of a formal product development. Schedules and budgets, plans and performance goals did not have to be met. It drew heavily on past experience. New features were not suggested before old ones were completed or understood. It was not supposed to be all things to all people. We did what we thought was best within reasonable bounds. We also expected to redo the system at least once after we got it going. For most of the group, it was meant to be a learning experience. Efficiency was specifically excluded as a software design goal, although it was always considered. We did not know if the system would be of practical use to us, let alone anyone else. In January 1965, after starting work on the system, it became apparent from presentations to outside groups that the system would be controversial. This is still true today." (Varian, p. 97)

However, gradually what had been public started to become more and more private. On February 8, 1983, IBM announced its Object Code Only (OCO) policy. The VM community made an enormous effort to convince IBM's management that the OCO policy was a mistake. Many people contributed to the effort in SHARE (an IBM user group) and in the other user groups.

In February 1985, the SHARE VM Group presented IBM with a White Paper that concluded with the sentence, "We hope that IBM will decide not to kill the goose that lays the golden eggs." IBM chose not to reply to it.

A few months after the announcement of the OCO policy, IBM released the first OCO version of VM, VM/PC. VM/PC had a number of problems, including poor performance and incorrect, missing, or incompatible functions. Without source, users were unable to correct or compensate for these problems, so nobody was surprised when VM/PC fell flat.

IBM continued throughout the decade to divert much of its energy to closing up its systems, not noticing until too late that the rest of the industry (and many of its customers) were moving rapidly toward open systems. By 1991, the same time Linus Torvalds began releasing his first Linux efforts, IBM made major parts of VM Object Code Only (OCO: no source) and Object Code Maintained (OCM: source available but fixes are object files only). IBM was doing the exact opposite of what Richard Stallman was advocating with regard to open source.

This is a salutary lesson for devotees of open source software: The price of open source is eternal vigilance.

VM has always been the bastard child of IBM. It is extremely efficient, which means that you do not need as much hardware to run it. This does not please those who sell hardware. Every so often IBM attempts to kill it off, but it has proven resilient:

"Throughout 1967 and very early 1968, IBM's Systems Development Division, the guys who brought you TSS/360 and OS/360, continued its effort to have CP-67 killed, sometimes with the help of some IBM Research staff. Substantial amounts of Norm Rasmussen's, John Harmon's, and my time was spent participating in technical audits which attempted to prove we were leading IBM's customers down the wrong path and that for their (the customers'!) good, all work on CP-67 should be stopped and IBM's support of existing installations withdrawn." (R. U. Bayles quoted in Varian, p. 97).

Now with Linux for S/390, VM is again coming into its own. VM has a lot to offer Linux in the S/390 environment. Think of it as a highly intelligent BIOS that relieves Linux of distractions such as dynamic sparing and hardware recovery, as well as supporting the concurrent operation of thousands of virtual machines.

Finally, after years of working its way through the beast that is the IBM bureaucracy (and the fact that the bottom line was starting to hurt), IBM rediscovered open source.

About the author
Neale Ferguson is a long-time IBM S/390 system administrator with over nineteen years of experience with VM/ESA. He worked on the non-IBM port of Linux to S/390 and jumped to the IBM-sponsored effort as soon as it was released. Formerly with TAB Limited in Sydney, Australia, he currently works as a consultant at Computer Associates in Reston, Virginia.

Resources

"S/390: The Linux Dream Machine," Scott Courtney (Linuxplanet, 2000):
http://www.linuxplanet.com/linuxplanet/reports/1532/1/
A hardware retrospective from Data General, 1997:
http://www.dg.com/about/html/ibm_360.html
"VM, Past, Present, and Future," Melinda Varian (August 1997):
http://pucc.princeton.edu/~melinda/25paper.pdf
Proceedings of ACM SIGARCH-SIGOPS Workshop on Virtual Computer Systems, R.P. Goldberg, March 1973: Available in print only.
Computer history series from New Media News:
http://www.newmedianews.com/tech_hist/360.html
Computer history from Lars Poulsen:
http://www.cmc.com/lars/engineer/computer/ibm360.htm