OpenSSI Logo SourceForge Logo

project
 home page
 sourceforge page
 mailing lists
 feature list
 demos
 screenshots
 Bruce's corner
 related links
 wiki
downloads
 notes
 1.2 stable
 1.9 development
 CVS
documentation
 1.2 stable
 1.9 development
 roadmap
 wiki
work items
 task list
 bug database
 feature requests
 process mgmt hooks
  hide sidebar
/**
 *
 * Clusterwide Process Management hooks 
 *
 * 
 *     This is summary of longer paper describing the design and hooks proposed
 * for the clusterwide process model.
 *     The goals include: 
 *     a) no impact of the performance of the base Linux;
 *     b) install-able module to enable the clusterwide process management 
 *     	  capability;
 *     c) flexibility to enable a very tightly coupled cluster or more loose; 
 *     d) visibility and access to any process from any node; 
 *     e) ability to have distributed process relationships; 
 *     f) clusterwide unique pids; 
 *     g) ability to move running processes from one node to another either at 
 *        exec/fork time or during execution; 
 *     h) ability to have processes continue to execute even if the node they 
 *        were created on leaves the cluster; 
 *     i) ability to retain process relationships in the face of arbitrary node 
 *     	  failure; 
 *     j) optional ability to have a clusterwide /proc/<pid> view with full SSI 
 *     	  semantics.
 *    The above goals are accomplished with the help of a couple of other
 * cluster subsystems.  A cluster membership subsystem that one can register
 * with to get call-backs on node joining/up/leaving/down events is assumed. An
 * inter-node communication subsystem for kernel-to-kernel communication is 
 * assumed.  The ability to transparently do copy-in and copy-out calls from 
 * the kernel on one node to process space on another node is assumed. Remote 
 * device (particularly terminal device) access via remote file block 
 * operations is assumed.  A cluster filesystem is not assumed but would be 
 * valuable.  Instances of all these subsystems exist (in OpenSSI and 
 * elsewhere) and have minimal kernel impact.
 *    Clusterwide pids are accomplished by encoding the node number (each node
 * has a unique small integer node number) in the higher order bits of the pid.
 * The task is only modified by adding a flag to the flags field and by adding 
 * a pointer to a clusterproc structure, which would be allocated only if 
 * clusterproc was installed.
 *    To accomplish the availability goals, when a process moves, nothing that
 * can't be rebuilt is left behind.  The process maintains its pid, and the
 * creation node only has to track its existence and location.
 *    To deal with clusterwide relationships, we are proposing the existence
 * of surrogate task structure.  These are just struct task_struct but don't 
 * have a kernel stack.  They are not hashed on PIDHASH_PID and are not on the 
 * tasks list of local processes. They would be hashed on a clusterproc-only
 * hash.  These surrogates would not be executable but would serve to contain 
 * information about processes running remotely.  Surrogates are proposed for: 
 *    a) each remotely executing child or ptrace_child, linked to the parent; 
 *    b) each remote parent (real or otherwise);
 *    c) process creation node; 
 *    d) possibly for /proc/<pid> support.
 * Distributed parent/child support, including support for exit, wait and reap, 
 * can be done in one of two ways.  One could just keep information of the 
 * existence of remote children with the parent.  Then, on wait, the parent 
 * would have to interrogate all remote children to see if any need reaping.  
 * Alternatively one can have surrogate task structures for remotely executing 
 * children and those surrogate structures could be maintained so the parent 
 * could execute his wait completely locally, only going to the child execution 
 * node to do reap processing.  The proposal and accompanying hooks are geared 
 * to the second option, in order to avoid the significant overhead inherent 
 * in the simpler option.  On the child execution node there would be a 
 * surrogate task structure for the parent, in part to support orphan process 
 * group calculations.  Certain child events send updates to the parent 
 * execution node to update the child's surrogate task structure.
 *    Thread groups would not be split across nodes, although movement of 
 * complete thread groups would be supported.
 *    Process groups and sessions could have members on different nodes, with
 * full process group, session and controlling terminal semantics.  No surrogate
 * task structures are needed.  For each process group or session, the creation 
 * node (also referred to as the origin node) of the id would have the master 
 * list of members (actually have local members linked in the standard way and 
 * have a list of nodes which had additional members), along with a sleep lock 
 * to protect list walking from list changing.  On other nodes where members 
 * are executing there would be the PIDTYPE_PGID list for local members.  There
 * is some complexity w.r.t. handling orphan process groups.  The base code 
 * exhaustively searches one or more process groups when processes exit and this
 * could be quite expensive if the process group was distributed.  Consequently,
 * when clusterproc is installed, we maintain with the process whether it is
 * part of an orphan or non-orphan pgrp.  The pgrp leader list keeps track of
 * which nodes have processes which are contributing to the pgrp not being
 * orphan.  On exit, we only have to inform the pgrp leader list node if the 
 * local node changes its overall contribution.
 *    Controlling terminal is supported in part by maintaining the tty pointer
 * in the task structure (if the controlling terminal is local to the process)
 * and by having supplemental fields in the clusterproc structure if the
 * controlling terminal is managed on another node.  Hooks are needed in a few
 * places to interrogate or clear the supplemental clusterproc fields.
 *    Many process-related operations are protected under the tasklist_lock, a
 * base spin-lock.  Clusterproc hook routines are documented as to whether
 * they expect to be called with the tasklist_lock and whether they might 
 * release it to complete their operation.  If it is released, it may be
 * reacquired before returning and in some cases special return codes are 
 * presented to the base code to avoid continuing to follow link list chains.
 * To deal with the fact that some operations (eg. exit, setpgid, setsid) may 
 * not always be atomic w.r.t. the tasklist_lock, the proc_lock is made a sleep
 * lock when clusterproc is configured.  Thus setpgid, setsid and exit don't
 * collide (also used in process migration).  A pgrp_list_sleep_lock and
 * session_list_sleep_lock are also introduced.  These are locked when 
 * operations (like kill_pg_info()) are walking the list of remote process
 * group members and when processes are migrating.  As a result, processes 
 * don't miss actions as a result of migrations. 
 * Finally, we have two clusterwide sleep locks - 
 * clusterwide_migrate_lock and task_capability_lock.  The migrate lock 
 * interlocks process migration with 4 clusterwide activities - kill(-1), 
 * setpriority(PRIO_USER), getpriority(PRIO_USER) and capset(-1).  The 
 * capability lock (which is a spinlock if clusterproc is not configured)
 * ensures only one capability operations is ongoing in the 
 * cluster at once.
 * The hooks are currently quite complete except w.r.t. /proc.
 */

This page last updated on Tue May 10 20:45:41 2005 GMT
privacy and legal statement