show sidebar
Design Notes for Cluster RC Script Handling
written by David Zafman
August 21, 2003

Design by David Zafman & Bruce Walker

Note that this design document does not accurately describe the current 
implementation of clusterwide RC handling. Instead, it describes what
clusterwide RC handling should ultimately look like. Do not be alarmed
by differences between this document and the current implementation.


	I will describe the basic mechanism for how RC scripts, booting,
shutdown and run-level changes occur.  Although I'm basing this on the
Redhat distribution, much of it is the same in other distributions.
	If an application package wants to be started during system boot
it can install an RC script in /etc/rc.d/init.d (a symbolic link exists
pointing /etc/init.d at /etc/rc.d/init.d for backward compatibility).
An RC script generally responds to the following commands for argument 1:
start stop status restart reload condrestart reload.   The script may support
other package specific commands.  Many packages which comprise one or more
daemons can easily use the /etc/init.d/functions to make writing the script
easier.  After installing its script in /etc/rc.d/init.d the package
installation executes "chkconfig --add svc_name".  More on what chkconfig does
	When /sbin/init (process 1) enters a new run level it executes
/etc/rc.d/rc with argument 1 being the new run-level.  Part of booting is just
changing to run levels 1, 2, 3, 4, or 5.  Shutdown is simply going to runlevels
0 (halt) or 6 (reboot).  The runlevel changes are control by the contents of
/etc/rc.d/rc?.d directories.  Symbolic links for /etc/rc?.d (? = 0 - 6) exists
for backward compatibility.  Inside /ec/rc.d/rc?.d are symbolic links to the
actual RC scripts in /etc/rc.d/init.d.  The naming of these scripts control
the sequence and whether a service is enabled or disabled for a particular
run-level.  Symbolic links that begin with "S" are executed with "start" and
scripts that begin with "K" are executed with "stop".  When entering a runlevel
the K scripts are executed first then the S scripts.  They are executed in
shell globing order, which is why they are named S##service_name and
K##servicce_name.  The numeric part controls the sequencing.
	The chkconfig command completely automates the creation of symbolic
links in the /etc/rc.d/rc?.d directories.  It only works if the RC script has
a comment line with the following format:
# chkconfig: 345 05 95
The first field "345" specifies the default runlevels the service should be
started on.  The second field specifies the numeric sequencing for starting
it.  And the third field specifies the numeric sequencing for stopping.  If
the service named "test" had the chkconfig line above and chkconfig --add test
were executed, it would create the following symbolic links:
/etc/rc.d/rc0.d/K95test -> ../init.d/test
/etc/rc.d/rc1.d/K95test -> ../init.d/test
/etc/rc.d/rc2.d/K95test -> ../init.d/test
/etc/rc.d/rc3.d/S05test -> ../init.d/test
/etc/rc.d/rc4.d/S05test -> ../init.d/test
/etc/rc.d/rc5.d/S05test -> ../init.d/test
/etc/rc.d/rc6.d/K95test -> ../init.d/test
	From this we see that the "test" service is enabled in runlevels 3, 4
and 5.  Although this is the default as specified by the RC script itself the
administrator can modify this at any time.  Also, if the chkconfig line has
"-" instead of run-levels in the first field all run-levels will get stop
scripts.  It might seem as though this is a stupid configuration, but although
the service doesn't start on boot, it will be stopped on shutdown, if the
administrator starts it manually.  Also, many services can be installed which
might be the kind that open up remote access to files or services, and they
should not enabled by default.
	Now that we understand how machines boot, shutdown and change runlevels,
lets look at what happens when a particular services is started and stopped.
First of all the /etc/rc.d/rc script looks at /var/lock/subsys/SVC_NAME it
see if a service is running.  This means that the RC script should create
this file after performing whatever "start" action it wants to.  It must
remove the file after performing whatever "stop" action it wants to.  It
uses the presence of the lock file to control "condrestart" activity.  The
conditional restart means only restart if it is currently running otherwise 
do nothing.  During shutdown or really any runlevel change only the K scripts
are executed for services that are actually running.  In the same vein when
executing S scripts, if the /var/lock/subsys/SVC_NAME already exists, then the
start is skipped.  Most likely for backward compatibility  /etc/rc.d/rc also
looks for "/var/lock/subsys/SVC_NAME.init".
	On a running system the contents of /var/lock/subsys indicates all the
services that are currently running.  This may or may not have any relation to
the contents for /etc/rc.d/rc3.d if the machines is currently at runlevel 3 for
example.  An administrator can manually start and stop services at will.  Also,
during boot he can enter interactive start-up by type "I" and selectively
disable services that would otherwise have been started.


	Can boot the cluster to any run-level.  For now the initnode boots
first then other nodes join.

	Can shutdown the entire cluster.  Services on all nodes are stopped.

	Can shutdown a single cluster node.  Only services on that node are

telinit N
	The cluster can change run-levels.

	This GUI tools must be clusterized and provide the ability to edit
new cluster parameters.

Interactive start-up
	If services are disabled using this feature on the initnode, those
services do not try to start on dependent nodes.  Booting dependent nodes do
not have interactive start-up capability.  You can either stop the service,
or disable it (remove from node list), then boot the node.  A stopped services
does not start on failover.

Existing RC scripts
	Non-cluster aware RC scripts can be installed.  They default to running
on the initnode only.  They even get started on the new init node during
failover by default.  Most scripts which are used this way do not need
modification, although some CDSLs may be required.

	This script is cluster-aware and allows start, stop, restart,
condrestart and any other service specific requests to occur on a cluster-wide
basis.  Also, --status-all option will be cluster-wide as well as the
--full-restart options.


Can't directly execute /etc/init.d/SERVICE start
	An error is output and it does an exit 1.  This may cause some
non-cluster-aware programs, scripts, and RPMs to fail, since they don't use
the /sbin/service script.

RPM installations will require special handling

Manually stopping and then starting individual services, will change the
counter files in /cluster/var/lock/subsys.  This will possibly cause
incorrect node join semantics in terms of service start order.
Fortunately, restart and conrestart will not exhibit this behavior as designed.
They don't manipulate the /cluster/var/lock/subsys directory even
though the RC script MIGHT delete and re-create the
/cluster/node*/var/lock/subsys files.


Implementation notes:

/cluster/node{nodenum} is a symbolic link to /cluster/node1
	SSI installation must modify the root filesystem for this.
	This allows the base kernel to boot on node 1 using the SSI modified
root.  The value of 1 is really the initial install node, which current is
assumed to be 1, but in the future this could be made more flexible.

/var/lock is made a CDSL to /cluster/node#/var/lock
	SSI installation must modify the root filesystem for this.

/var/run is made a CDSL to /cluster/node#/var/run
	SSI installation must modify the root filesystem for this.
	The directories console, named, netreport, radvd, saslauthd, sudo might
need to be sym linked back to the root.  We must determine if these are global
or node specific items.  If an unknown service puts stuff here, it would be
local to a node.

Other root files
	Depending on the service other root file must be made into CDSLs.
For example /var/lib/random-seed must be made node specific.

New way to specify nodes (classes)
	We are creatng a new uniform way to specify sets of node.  In the future
this may become even more flexible.  In the initial release the following are
the allowable node classes:
none		- no nodes
initnode	- the current node running init (process 1)
all		- all currently UP nodes in the cluster
node=X,Y,Z	- the subset of nodes X, Y, and Z that are UP in the cluster

/etc/sysconfig files
	Some files in this directory will be changed into a CDSL to
/cluster/node#/etc/sysconfig/XXXX.  This must be determined on a per-service
basis.  Most "initnode" and "all" node services have global sysconfig.  Some
services which are device specific for example, need a symbolic link here.
A service like "rawdevices" is node specific.  So its /etc/sysconfig/rawdevices
configuration file is turned into a CDSL.  The xinetd service MUST be
the same on all nodes, since right now we don't want to add node specifications
to the redhat-config-services for xinetd services.  Its configuration file
is left global.
	SSI installation must modify the root filesystem for this.

New directory tree /cluster/nodetemplate
	This directory tree contains all the files needed to create an initial
version of a nodes /cluster/node# tree.  During "addnode" the following is
"cp -a /cluster/nodetemplate /cluster/node#"  When a file in /etc/sysconfig,
for example, is converted into a CDSL, a file in /cluster/nodetemplate may or
may not need to be installed.  Depending on the service, we might not put a file
in /cluster/nodetemplate, if the service had a missing configuration file, for
example, on its initial install.  Also, we could include in the SSI package
the "initial" version of some of these configuration files for the
/cluster/nodetemplate directory.

/etc/init.d/functions modifications
	The killproc function and any code using pidof must work locally only.
Since an unmodified RC script may use these functions and a pid file might not
be present the pidof command will read /proc which will be looking at the
entire cluster.
	Also the first thing this script should do is check for SSI kernel and
if an environment variable say RC_SSI is NOT set, it should output an error and
fail.  This prevents an administrator from executing /etc/init.d/$service_name
directly without getting the clusterization of the /sbin/service script.

New command "onclass"
	This has onall type funcationality, but it makes it easier to
implement shell scripts which have class designations for nodes.  The initial
classes are described above.  If we wanted later to add something like "all-5,7"
to exclude 5 and 7 from all, we would only need to expand the onclass command. 
So the /etc/rc.d/rc script described below can simply pass the class argument
if specified for a given service and pass it to the onclass command to start
the service for example:
onclass $class /etc/rc.d/init.d/$subsys start
This command is also used by the redhat-config-services.  If the class is
"none" the command will exit successfully without doing anything.  Also,
if node 5 is not UP and the class is "node=5" the command will exit
succcessfully without doing anything.
	An optional argument -L restrict the action to the local node only.
This way the onclass command can completely parse the node_class argument then
determine if the local node is included in that class.  This would be used for 
the rc.nodeup to determine if the joining node needs to start the service, for

New command "onsvc"
	This has onall type functionality, but it makes it easier to implement
shell scripts which need to run things on the set of nodes on which a service
is currently running.  The set of nodes are determined by looking at nodes
that are UP and which the file /cluster/node*/var/lock/subsys/SVC_NAME
exists.  An example, would be to get status of a service which is running on
some set of nodes in the cluster the command "onsvc $service
/etc/init.d/$service status". For the most part an administrator would not have
to use the onsvc command, but it is available.  The onsvc command is how the
/sbin/service script will actually get the status, stop, reload a service in
the cluster.
	An optional argument -L restrict the action to the local node only.
This way the onsvc command can look for the lock file only for the local node
This would be used clusternode_shutdown to determine if the leaving node needs
to stop a service.

/etc/rc.d/rc script modifications
	This script looks at K??* and S??* files just as in the base, however,
it also checks to see if an SSI kernel is running changes its behavior as
	This script will keep a counter in a file and create files in
the directory /cluster/var/lock/subsys with names L???$service_name
where the ??? is the current value of the counter.  The service_name is the
standard service name as in the standard /var/lock/subsys directory.
	When trying to "start" services in the cluster this script will do
the following.  It will check /cluster/var/lock/subsys/L???$service_name
file instead of the local file, to make sure the service is not already running
in the cluster.  It will parse the /etc/rc.d/rc.nodeinfo and retrieve the
node_class field for the service in question.  If the service is not in the
file a warning will be issued and the script skipped.  After issuing the
onclass command, it will create the file
/cluster/var/lock/subsys/L???$service_name.  It will then bump its
	When trying to perform the "stop" operations on services this script
will do the following.  It will check /cluster/var/lock/subsys to make
sure the service is running.  If it is, it will issue an
onsvc $service_name /etc/rc.d/init.d/$service_name stop
Afterward it will remove the /cluster/var/lock/subsys file.

The /sbin/services script is modified to behave must as /etc/rc.d/rc does for
"start" and "stop".  All other request are assumed to be for running services
and use the same mechanism as "stop".

	This script must join a node to an existing cluster.  Currently,
even simultaneous booting requries this to work, since dependent nodes wait
for the initnode to finish booting before they can join.  This script must
go through /cluster/var/lock/subsys/L???$service_name files in sorted
order just as /etc/rc.d/rc does the S* and K* files in sorted order.  The
service_name is then determined and based on the node_class in
/etc/rc.d/rc.nodeinfo, the service may or may not be started.  Instead of having
the script having to determine if the node class include the current node, we
should have an onclass local option to restrict and exec to the local node
only.  This way onclass is always called and it may be a no-op if the action
should not be performed.

	This GUI is modified to allow you to specify what nodes a service
should run on.  Initially this might be "all", "none", or specific nodes
"1,2,3".  Also this interface does start/stop/restart of a service on all
appropriate nodes, just as during boot or run-level change.
	When changing an xinetd service, a "reload" must be performed on all
nodes after saving changes.

	This data file specifies the default behavior of non cluster-aware
sevices in the cluster.  Entries will consist of lines having the following
service_name	type	default_class	default_failover_flag
The service_name matches the subsystem name (the file name in /etc/rc.d/init.d).
The type can be S for single cluster node, or M for multiple nodes allowed.
The default_class will be shipped as "none", "initnode", or "all".  Hardware
specific services will default to "none", since the administrator must select
the node or node(s) to run the service on.  Other services which we want to
run on every node will be set to "all".  The default_failover_flag
can be Y for yes or N for no to starting the service again on initnode failover.
This file is the SSI enhanced portion of chkconfig information.  In a
cluster-aware RC script this information is in a comment line following
"ssiconfig:".  We don't want to be in the business of maintaining this
information.  An RC script can over-ride the contents of this file.  This file
is just for non-cluster-aware services that are not operating per the defaults.
We don't bother to put in entries which are purely the default, for example:
cron	S	initnode	Y
As you can see from this the default for non-cluster aware services, it to
have the service running once in the cluster on the initnode at all times.
	This data files drives the behavior of enabled sevices in the cluster.
Entries have the following format:
service_name	node_class	failover_flag
When chkconfig is run the default_class and default_failover_flag specified
is put into this file.  First, if the RC script has a valid ssiconfig comment
line the information is retrieved from there.  If not, the information comes
from /etc/rc.d/ssiconfig file.  Otherwise, the values "initnode" and "Y" or
put in for node_class and failover_flag, respectively.
If chkconfig is used to delete a service, the line is removed from this file. 
The service_name matches the subsystem name (the file name in /etc/rc.d/init.d).
The node_class can be modified by using chkconfig (see below).  The
failover_flag indicates whether the service should be activated with "start"
on init node failover.

	We will add new options to output what nodes a given service will
be run on.  This is not the default in case anyone is relying on the standard
behavior of showing just the run-levels.  We will have this command allow you
to change what nodes a service will be run on.  It will be in the same format as
the onclass command accepts and it will be written to "/etc/rc.d/rc.nodeinfo".
As indicated above the chkconfig command will have new options to set the
class of node the service runs on.  For now the failover flag can not be set.
Also, as indicated above when a service is --add'ed an entry in rc.nodeinfo
is created based on the defaults.  Again the default would be to look in the RC
script's ssiconfig line, then in /etc/rc.d/ssiconfig file, otherwise use
node_class = "initnode", failover_flag = "Y".  Initially, we might not suport
the ssiconfig line in the RC script.  We should add two options
--ssi and --node <class>.  The first option modifies the --list option and
display the current node_class and failover flag selection in
/etc/rc.d/rc.nodeinfo information if present.  The usage is
chkconfig --ssi --list [name].  The --node <class> allows the user to
modify the /etc/rc.d/rc.nodeinfo for a given services the usage is:
chkconfig --node <class> <name>

	A new /etc/rc.d/ssiconfig will be installed by the chkconfig RPM. As
part of post installation, if there is no /etc/rc.d/rc.nodeinfo all currently
enabled services should be determined and "chkconfig --add $subsys" should be
run for each.  The standard behavior of chkconfig should be to add the default
rc.nodeinfo line if it is missing for an added service even when the symbolic
links are already there.

	This curses based configuration tool should let you edit node
specification parameters, much like redhat-config-services will be made to.

	This command is mentioned in chkconfig documentation, but my Redhat
installed machines don't have it nor its documentation.

	This script must be clusterized.  It is used to both halt and reboot
a machine.  Unfortunately, it also does some final clean-up on the node.  Most
of these activities must be done on all nodes.  We should break some of
these operations out so that they can be done using "onall" on all nodes for
cluster shutdown and locally for individual node shutdown.
The finally halt/reboot, but be sent to all nodes as a kernel message
possibly with CLMS monitoring disabled.

ssiadmin command
	The administrator might have to use a new command to successfully
install an RPM.  If an RPM directly calls /etc/init.d/$service_name start/stop
they will fail.  We must document that to upgrade a package after SSI is

	ssiadmin $service_name rpm -Uv RPM-NAME.rpm

The ssiadmin command will record the /cluster/lock/subsys/$service_name
state.  Then execute /sbin/service $service_name stop to stop service on all
nodes.  Then it will set the environment variable RC_SSI and execute rpm.
Afterward it will issue an /sbin/service $service_name start only if the global
state that was recorded first showed that the service was running.  The
administator must figure out the name to specify ("service_name") for any rpm
which is also a Linux service.

Local /proc option
	We need an inheritable task flag which causes the readdir of /proc
to work completely locally.  This partially need for performance and for
correctness.  This will speed up the pidof program extensively used by
RC scripts in /etc/init.d/functions shell functions.  However, if an RC
script did its own thing like call killall command itself, or used ps to find
its daemon, it would not operate properly.  The worst case is you stop a
service for clusternode_shutdown and the service stops on all nodes.

	Need a clusternode_shutdown command based on "shutdown" but initiates a
specific node going down.  It changes the node state to "SHUTDOWN" in CLMS
and eventually either signals init or just stops services locally on the
node being shutdown.  This ends when the halt script is run, but it somehow
knows not to halt/reboot all the nodes.
	Similiar options to shutdown.  New required option -N # to specify
which node is to be shutdown.  Additional option -A which is the application
notification time (app grace).  The app grace defaults to 20 seconds if not
	Let's change shutdown.c to check if it is invoked as
clusternode_shutdown and modify its behavior.  It should have a usage as below.
	Also, after sending warnings and getting ready to shutdown, it changes
the state of the node to SHUTDOWN and sleeps for app_grace period.  Afterward it
changes the node state to GOINGDOWN and execs a shell script much as init would
during shutdown.  This script similiar to /etc/rc.d/rc, but it specifies the
local flag to onsvc for stop and onclass for start.  We need special handling
of /etc/rc.d/rc6.d/S*reboot which is a sym link to  /etc/init.d/halt.  This
script must either be run with special options or we detect the
clustenode_shutdown case and run a different script or have different behavior.
The /etc/init.d/halt script as it is today will bring the entire cluster down.

 Usage:	clusternode_shutdown [-Akrh] [-t secs] -N # time [warning message]
		  -N: Specify the node to take down
		  -A: Specify how long applications get warning
		  -k: don't really shutdown, only warn.
		  -r: reboot after shutdown.
		  -h: halt after shutdown.
		  -t secs: delay between SIGTERM and SIGKILL for init.

Future items:

Have a richer set of node classes which can be specified
	user defined class
	n1 or n2 or n3 (to allow non-initnode failover)

Cluster-aware RC scripts
	Need to design and implement features which can be used by a
cluster-aware script.  For example, a script might want to do things on every
node, but differentiate between nodes joining and initial start.  It might
also want to be informed of nodes going down.  It might want to do special
processing for failover.  It must have an ssiconfig comment line in its script.
I envision a special node_class value going into /etc/rc.d/rc.nodeinfo from the
ssiconfig line that mark it cluster-aware and control where it runs things in
the cluster itself.  This must affect how the GUI and chkconfig react to this
kind of service.

	Allow a set of nodes to boot together.  The RC scripts should still
work since booting is the same as a run-level change, executing the
/etc/rc.d/rc script.
This feature requires enhancements to the init command.

	This program could also be made to immediately start/stop services on
specific nodes not just cluster-wide.  This might be in the form of a
reconfigure button and/or "node start", "node stop" or "node restart" buttons.
The reconfigure button would cause the current node settings to take effect.
To use it you could modify a services node list parameter and press the
reconfigure button.  The Service would be started or stopped as appropriate
on whatever nodes are required.  The "node XX" feature would allow you to
independent of the current node settings, start/stop/restart a service
only a particular node.

Other /sbin/init improvements
Don't allow run-level changes while nodes are joining or leaving (doing
clusternode_shutdown).  Delay transitioning to SHUTDOWN state if there is
a run-level change in progress.  Must do all node transitions coordinated
through init since it changes the node states so that these restrictions are
maintained.  Only init (process 1) can declare nodes fully up or down with the
exception of going back to UP from the SHUTDOWN state.  Only init (process 1)
can put a node into SHUTDOWN or GOINGDOWN state.

Ultimately, clusternode_shutdown will use the process 1 to perform
various functions, so that it can coordinate run-level changes,
clusternode_shutdown, and node up and down transitions. Also, init will be
asked to perform node state changes.

This page last updated on Wed Feb 9 21:41:41 2005 GMT
privacy and legal statement