OpenSSI Logo SourceForge Logo

project
 home page
 sourceforge page
 mailing lists
 feature list
 demos
 screenshots
 Bruce's corner
 related links
 wiki
downloads
 notes
 1.2 stable
 1.9 development
 CVS
documentation
 1.2 stable
 1.9 development
 roadmap
 wiki
work items
 task list
 bug database
 feature requests
 process mgmt hooks
  hide sidebar
			Introduction to the OpenSSI Cluster
				Release 1.9

I:	Overview
II:	Installation
III:	Single Root and Context Symlinks
IV:	/dev and devices
V:	Filesystems
VI:	NFS client and NFS server
VII:	Swap
VIII:	/proc and cluster process model
IX:	Networking model and CVIP (cluster virtual IP, cluster alias)
X:      Load balancing
XI:	Inter-process communication (IPC)
XII:	Added and changed commands/utilities
XIII:	HPTC Middleware
XIV:	libcluster, cluster.h and programming
XV:	System Management
XVI:    Functional Updates from the 1.0.0 release


I: Overview
   The goal of the SSI software is to provide a scalable, available and
manageable server cluster environment.  At the most general level, the
idea is that the cluster looks like one very big machine, with
applications and management working like it was one machine.  Because the
cluster is made up of a bunch of physical servers, however, it is not
possible to make all the seams completely transparent.
   The filesystem model is quite SSI, with a single root filesystem and
automatic visibility of all filesystems on all nodes.
   While remote device access is supported, each node does have its own
set of devices (via devfs) and device directory.
   Swap space is not particularly SSI, with each node providing its own
swap space.
   Processes have clusterwide unique pids and process management
is completely SSI.  Sessions and process groups can be distributed, as can
parent-child pairs.  Users, administrators and processes have complete
visibility and access to all processes on all nodes, just as if it was one
big machine.  Process can be launched on other nodes a variety of ways and
can even move from node to node while they are running (process migration),
without the process even being aware.  /proc shows all the processes on all
the nodes but the non-process part of /proc is showing local information
(e.g. cpuinfo, meminfo, etc.).
   
   The networking model has 2 parts to it.  First, each node has one or more 
addresses that act in a per-node manner.  One address is used for 
kernel-to-kernel communication to support the SSI.  That address can 
also be used for MPI or other cross-node application communication.  
Second, to make the cluster look like a single, highly available
machine, there is a CVIP, or cluster alias.  This address is on an external
network and is an alias on the external network card.  This CVIP fails over
to another card on another node if the first node leaves the cluster.
   There are two forms of load balancing built into the system.  First, 
incoming network connections can be load balanced using the HA-LVS 
capability.  Processes can be load balanced through the load leveling 
subsystem, which does exec-time load leveling and process migration.
   All inter-process communication (IPC) is nameable clusterwide and is
shared clusterwide.  This means there is one namespace for semaphores,
message queues and shared memory but all objects are usable from any node.
Pipes and fifos are shared, as are Unix Domain sockets.
   There are very few new commands, the idea being that standard single
system commands should just work.  There is a command to see
the cluster membership (cluster).  There are some commands to get
processes launched on particular nodes (onnode,onall,fast,migrate) and to see
where processes are running (where_pid).  There are a couple of commands to
control load levelling (loads, loadlevel) and a command to run other programs
in a "localview" mode (localview - more on this in the process management and
IPC sections below).  Man pages should exist for all of the commands.
   Many of the opensource HPTC middleware has been run on the SSI cluster,
along with some purchasable capabilities.  The list of things tested
include MPICH, LAMPI, HP MPI, openPBS, ScalablePBS, SLURM, ganglia,
supermon, Totalview, Lustre, PVFS, and Maui.
   Programming is very SSI and many/most programs don't need any specific
SSI calls.  However, there is a libcluster.so and cluster.h available to
do some cluster-specific functions.  In libcluster.so there are
rexec(), rfork() and migrate() calls to create/move process to other nodes.
There are also calls to check the membership of the cluster, get information
about nodes in the cluster and to get a cluster membership history.
   System management means a lot of things from installation to filesystem
management to networking and user accounts and performance monitoring.
Some aspects of this are covered in other sections.  In this section
we describe the "services" model in SSI and a few other topics.
This release has a first cut at some of the man pages.  There are also
documents in /usr/share/doc/openssi on the following subjects:
running X (README.X-Windows); filesystems (README.cfs, README.hardmounts, 
README.fstab, README.lustre, cdsl, README.nfs-server); networking 
(README.networking, README.ipvs, README.CVIP, README.network-bonding); 
OpenSSI and LTSP (README.ltsp); process loadbalancing (README.mosixll); 
clusterwide RC (rc-design-notes); and other topics (README.serial-console, 
README.upgrade, README.ntp, README.edit-ramdisk).

II: Installation
   Installation has 3 major pieces - installing on the first node, adding nodes
and the "group node installation".  Currently all pieces are incomplete and
somewhat clunky.  There are, however, installation instructions, so no further
information is included here.

III: Single Root and Context Symlinks
   The filesystem model is quite SSI, with a single root filesystem and
automatic visibility of all filesystems on all nodes.
   There are some context dependent symbolic links to allow node specific
files.  The syntax of the links is .../node{nodenumber}/... (e.g.
/var/log -> /cluster/node{nodenumber}/var/log ).
/cluster has the node specific files, for the most part.  You will
see /cluster/node1, /cluster/node2, etc.
/var has 3 context links (lock, log and run).
/cluster/var is the global /var files symlinked back from each node's
var (utmp is an example of this).  
/usr/share/doc/openssi has documents on CFS, the
clusterwide fstab and shared-disk hardmounts

IV: /dev and devices
   While remote device access is supported, each node does have its own
set of devices (via devfs) and device directory.
/dev is a context dependent mount, meaning you see the local /dev on
each node.  Within /dev is a set of directories (/dev/1, /dev/2, etc.)
which are bind mounts to /cluster/nodex/dev (so you can name the device
on any node).  If a process migrates, it retains the /dev context of the
node it started on.  The onnode command changes the context
to that of the destination node.   /dev/pts is now supported as a
clusterwide device.  There is a single pool of ptys, allocated uniquely
clusterwide.

V: Filesystems
   The goal is that the same filesystem tree is seen by all processes on
all nodes and that this is easily managed and enforced.  To that end, CFS
(the stacked cluster filesystem) is automatically stacked for ext3
physical filesystems mounted and is automatically visible on all nodes.
One need only do a standard ext3 mount and all processes on all nodes
see it.  This automatic stacking is done for ext2 and ext3 and should also
work for other physical filesystems (JFS, Reiserfs, XFS, etc.).  See the
next section for NFS.  OpenGFS was supported and the opensource RedHat GFS
will be supported in 2.6.  The Lustre client capability  is supported.
Filesystems with CFS automatically stacked can transparently
failover from node to another if they are stored on a shared disk
File record locking is visible and enforced clusterwide.  
BSD-style "flock" is only visible and enforced locally.
/etc/fstab is now used to describe all filesystems on all nodes, with a
way to indicate which filesystems are on each node and which can
failover from node to node.  /etc/mtab will better indicate where a
filesystem is mounted. 
/usr/share/doc/openssi has documents on CFS, the
clusterwide fstab and shared-disk hardmounts

VI: NFS client and NFS server
   NFS client capability is fully functional (including locking and statd
and failure handling), assuming all nodes have connectivity to the remote
NFS server.  An NFS mount done on any node in the cluster automatically
causes all nodes to do that mount.  Once mounted, all access from any
node goes directly to the remote NFS server, so there is no added
coherency in the cluster.
   NFS server capability is supported via the CVIP (see networking below).
There is one lockd/statd for the server, which works independently from the
NFS client function (so failure handling w.r.t. remote machines can
work correctly).  Much of the capability needed to support HA-NFS server
is in place but it is not fully operational.

VII: Swap
   Swap space is not particularly SSI, with each node providing its own
swap space.  Swap space designation is done in the common /etc/fstab.
To determine which nodes are using which swap, you can execute an
onall swapon -s.

VIII: /proc and cluster process model
   Processes have clusterwide unique pids and process management
is completely SSI.  Sessions and process groups can be distributed, as can
parent - child pairs. Users, administrators and processes have complete
visibility and access to all processes on all nodes, just as if it was one
big machine.  Process can be launched on other nodes a variety of ways and
can even move from node to node while they are running (process migration),
without the process even being aware.  /proc shows all the processes on all
the nodes but the non-process part of /proc shows local information
(e.g. cpuinfo, meminfo, etc.).
   /proc/cluster has some clusterwide info (events is for getting
notification of membership changes;  loadlevellist is for setting which
applications can be automatically moved from node to node if
loadlevelling is turned on (which it is by default);  lvs and ip_vs_...
have to do with the cluster ip alias).
   /proc/cluster/nodex/* are files used for load levelling and are
discussed in section X below.
   Note that the "localview" feature can restrict the view of the
processes in /proc to just those executing locally.  localview can
be used as a command, so "localview top" will just display the most
active processes on the local node and not clusterwide.  The
localview attribute is inherited.  The onnode command has a -l option
to restrict the view of the command executed to the node it is run
on.

IX: Networking model and CVIP (cluster virtual IP, cluster alias)
   The networking model has 2 parts to it.  First, each node has one 
or more addresses that are only locally visible.  One address is 
used for kernel-to-kernel communication in support of SSI.  
That address is in /etc/clustertab (which is used to build /etc/dhcpd.conf
and a file in the ramdisk so networking with the address can
begin early in the boot sequence).  The internal cluster address
can also be used for MPI or other cross-node application
communication.  The internal address is typically associated with a
hostname.  There is also a "nodename" command which does use the
address of the internal interface.
   Second, to make the cluster look like a single, highly available
machine, there is a CVIP, or cluster alias.  This address is on an external
network and is an alias on the external network card.  This CVIP fails over
to another card on another node if the first node leaves the cluster.
The "clustername" command returns the same result on all nodes and should
be associated with the CVIP.  In /usr/share/doc/openssi
there is a document called README.CVIP and another called README.ipvs.

X: Load Balancing
   There are two forms of load balancing built into the system.  First, 
incoming network connections can be load balanced using the HA-LVS 
capability described in the previous section.  
    Processes can be load balanced through the load leveling 
subsystem, which does exec-time load leveling and process migration.
/proc/cluster/nodex/ has "load" with a load value (algorithm from
Mosix) which is used by the loadleveller, and "loadlevel", which
indicates if loadlevelling is turned on for that node.  The loads(1)
command shows the loads of each node.
   /proc/cluster/loadlevellist is the list of programs that can be
loadlevelled (and this characteristics is inherited) so if you run
/bin/bash-ll, everything you execute after that can be loadlevelled,
either at exec time or while it is running (via process migration).
Loadlevelling is started by the standard RC service mechanism and the
/proc/cluster/loadlevellist is initialized by the loadlevel service.

XI: Inter-process communication (IPC)
   All inter-process communication (IPC) is nameable clusterwide and is
shared clusterwide.  This means there is one namespace for semaphores,
message queues and shared memory, but all objects are usable from any node.
Pipes and fifos are shared, as are Unix Domain sockets.
   IPC objects are created on the node where the process creating them
is executing.  After it is created, it can be accessed by any process on any
node.  IPC objects do not currently move from one node to another, like
processes can.
   The ipcs command shows all sysvipc objects  as does /proc/sysvipc/*.
   The "localview" command also limits the scope of
ipcs and also allow "local" sysVipc objects (keys need only be unique on
the given node). 

XII: Added and changed commands/utilities
   There are very few new commands, the idea being that standard single
system commands should just work.  There is a command to see
the cluster membership (cluster).  There are some commands to get
processes launched on particular nodes (onnode,onall,fast,migrate) and to see
where processes are running (where_pid).  There are a couple of commands to
control load levelling (loads, loadlevel) and a command to run other programs
in a "localview" mode (localview - more on this in the process management and
IPC sections).
   The "localview" command can be used to launch other commands in a mode
where readdir of /proc only shows local processes.  It also limits the 
visibility of sysVipc objects.  "onnode -l" allows you to
run a command on a specific node and only see the /proc of that node.
   ps shows all the processes on all the nodes.  It is modified to
include a --shownode option which changes the display output to have a
column indicating which a node each process is running on.  A --node
option provides a way to indicate you only want the processes running on
a particular set of nodes.  Also, running ps under localview will just show 
the processes on that node.
   The unmodified top command is interesting - it does show all the
processes from all the nodes;  the %cpu is w.r.t. to the node they are
running (which it doesn't tell you);  the header info is a mixture of
clusterwide and local information (clusterwide is number of processes,
etc. and local is number of cpus and memory).  Again, it can be run under
localview to just see local processes.
   fuser just works clusterwide.
   Some enhancements have been made to rc and sysinit and services to
to allow a clusterwide view of booting and service startup/management.
A little more about this is described below in the System Management
section.
   who is clusterwide, as is last.

XIII: HPTC Middleware
   Many of the opensource HPTC middleware has been run on the SSI cluster,
along with some purchasable capabilities.  The list of things tested
include MPICH, LAMPI, HP MPI, openPBS, ScalablePBS, SLURM, ganglia,
supermon, Totalview, Lustre, PVFS, and Maui.  A separate document
describes a little about each of these capabilities and how they relate
to the SSI cluster.

XIV: libcluster, cluster.h and programming
   Programming is very SSI and many/most programs don't need a specific
SSI calls.  However, there is a libcluster.so and cluster.h available to
do some cluster-specific functions.  In libcluster.so there are
rexec(), rfork() and migrate() calls to create/move process to other nodes.
There are also calls to check the membership of the cluster, get information
about nodes in the cluster and to get a cluster membership history.
Man pages for the libcluster.so calls are included in the distribution.

XV: System Management
   System management means a lot of things from installation to filesystem
management to networking and user accounts and performance monitoring.
Some aspects of this are covered in other sections.  In this section
we describe the "services" model in SSI and a few other topics.
   Cluster/system booting works the following way.  One of the potential root
nodes (nodes with access to a potentially shared root filesystem) boots
up initially. It is referred to as the initnode because /sbin/init (process
1) is running on that node.  It does not have to be node 1 and isn't really
a master node unless there is no root failover.  The initnode runs rc
and starts a bunch of services and enables other services which will start
when the node that they are to run on boots up.  After the initnode is up, init
allows other nodes to join and in parallel it runs a sysinit and an rc
on each node.  Services set up to run on all nodes or on the specific node
are started as that node boots.  Nodes are not generally considered UP
until they have finished their rc processing and are at the same run level
as the rest of the cluster.  Services can be configured to run on the
initnode (default) or on all nodes or on a specific node.  Services set up
to run on the initnode will automatically restart on the takeover init node
if the initnode fails (you can disable this if appropriate).
   It is assumed that ntp will be enabled and run on each node so the
time on each node will be almost the same.

XVI: Functional Updates from the 1.0.0 release
   A. CFS performance enhancements;
   B. atomic migration of process groups;
        If you do a migrate command with a negative pid, the pgrp with 
	that number will either completely be migrated or none of the 
	processes in it will be migrated.
   C. migration of thread groups;
        If you migrate a member of a thread group, all members will migrate
	with it.
   D. support for LVS-NAT;
   E. support for process migration while holding file record locks;
   F. Add support for the Fedora Core 2 distribution;
   G. Upgrade of the base Linux kernel to 2.6.10
   H. Addition of the "fast" and "fastnode" commands and their man pages;
   I. Added the /proc/cluster/{nm_rate, nm_log_threshold, and 
      nm_nodedown_disabled} files.  nm_rate can be used to alter how 
      often node monitoring messages are exchanged (default is 1 per 
      second) and how long before a node is declared down (default 
      10 seconds);  nm_log_threshold indicates how many missed monitoring 
      messages before a kernel printf is put out (default 2);  
      nm_nodedown_disabled can be set to turn off nodedown detection, which
      is useful if you need to go into the debugger on one of the nodes.

This page last updated on Tue Apr 12 03:40:04 2005 GMT
privacy and legal statement