show sidebar
/**
 *
 * Clusterwide Process Management hooks 
 *
 * 
 *     This is summary of longer paper describing the design and hooks proposed
 * for the clusterwide process model.
 *     The goals include: 
 *     a) no impact of the performance of the base Linux;
 *     b) install-able module to enable the clusterwide process management 
 *     	  capability;
 *     c) flexibility to enable a very tightly coupled cluster or more loose; 
 *     d) visibility and access to any process from any node; 
 *     e) ability to have distributed process relationships; 
 *     f) clusterwide unique pids; 
 *     g) ability to move running processes from one node to another either at 
 *        exec/fork time or during execution; 
 *     h) ability to have processes continue to execute even if the node they 
 *        were created on leaves the cluster; 
 *     i) ability to retain process relationships in the face of arbitrary node 
 *     	  failure; 
 *     j) optional ability to have a clusterwide /proc/<pid> view with full SSI 
 *     	  semantics.
 *    The above goals are accomplished with the help of a couple of other
 * cluster subsystems.  A cluster membership subsystem that one can register
 * with to get call-backs on node joining/up/leaving/down events is assumed. An
 * inter-node communication subsystem for kernel-to-kernel communication is 
 * assumed.  The ability to transparently do copy-in and copy-out calls from 
 * the kernel on one node to process space on another node is assumed. Remote 
 * device (particularly terminal device) access via remote file block 
 * operations is assumed.  A cluster filesystem is not assumed but would be 
 * valuable.  Instances of all these subsystems exist (in OpenSSI and 
 * elsewhere) and have minimal kernel impact.
 *    Clusterwide pids are accomplished by encoding the node number (each node
 * has a unique small integer node number) in the higher order bits of the pid.
 * The task is only modified by adding a flag to the flags field and by adding 
 * a pointer to a clusterproc structure, which would be allocated only if 
 * clusterproc was installed.
 *    To accomplish the availability goals, when a process moves, nothing that
 * can't be rebuilt is left behind.  The process maintains its pid, and the
 * creation node only has to track its existence and location.
 *    To deal with clusterwide relationships, we are proposing the existence
 * of surrogate task structure.  These are just struct task_struct but don't 
 * have a kernel stack.  They are not hashed on PIDHASH_PID and are not on the 
 * tasks list of local processes. They would be hashed on a clusterproc-only
 * hash.  These surrogates would not be executable but would serve to contain 
 * information about processes running remotely.  Surrogates are proposed for: 
 *    a) each remotely executing child or ptrace_child, linked to the parent; 
 *    b) each remote parent (real or otherwise);
 *    c) process creation node; 
 *    d) possibly for /proc/<pid> support.
 * Distributed parent/child support, including support for exit, wait and reap, 
 * can be done in one of two ways.  One could just keep information of the 
 * existence of remote children with the parent.  Then, on wait, the parent 
 * would have to interrogate all remote children to see if any need reaping.  
 * Alternatively one can have surrogate task structures for remotely executing 
 * children and those surrogate structures could be maintained so the parent 
 * could execute his wait completely locally, only going to the child execution 
 * node to do reap processing.  The proposal and accompanying hooks are geared 
 * to the second option, in order to avoid the significant overhead inherent 
 * in the simpler option.  On the child execution node there would be a 
 * surrogate task structure for the parent, in part to support orphan process 
 * group calculations.  Certain child events send updates to the parent 
 * execution node to update the child's surrogate task structure.
 *    Thread groups would not be split across nodes, although movement of 
 * complete thread groups would be supported.
 *    Process groups and sessions could have members on different nodes, with
 * full process group, session and controlling terminal semantics.  No surrogate
 * task structures are needed.  For each process group or session, the creation 
 * node (also referred to as the origin node) of the id would have the master 
 * list of members (actually have local members linked in the standard way and 
 * have a list of nodes which had additional members), along with a sleep lock 
 * to protect list walking from list changing.  On other nodes where members 
 * are executing there would be the PIDTYPE_PGID list for local members.  There
 * is some complexity w.r.t. handling orphan process groups.  The base code 
 * exhaustively searches one or more process groups when processes exit and this
 * could be quite expensive if the process group was distributed.  Consequently,
 * when clusterproc is installed, we maintain with the process whether it is
 * part of an orphan or non-orphan pgrp.  The pgrp leader list keeps track of
 * which nodes have processes which are contributing to the pgrp not being
 * orphan.  On exit, we only have to inform the pgrp leader list node if the 
 * local node changes its overall contribution.
 *    Controlling terminal is supported in part by maintaining the tty pointer
 * in the task structure (if the controlling terminal is local to the process)
 * and by having supplemental fields in the clusterproc structure if the
 * controlling terminal is managed on another node.  Hooks are needed in a few
 * places to interrogate or clear the supplemental clusterproc fields.
 *    Many process-related operations are protected under the tasklist_lock, a
 * base spin-lock.  Clusterproc hook routines are documented as to whether
 * they expect to be called with the tasklist_lock and whether they might 
 * release it to complete their operation.  If it is released, it may be
 * reacquired before returning and in some cases special return codes are 
 * presented to the base code to avoid continuing to follow link list chains.
 * To deal with the fact that some operations (eg. exit, setpgid, setsid) may 
 * not always be atomic w.r.t. the tasklist_lock, the proc_lock is made a sleep
 * lock when clusterproc is configured.  Thus setpgid, setsid and exit don't
 * collide (also used in process migration).  A pgrp_list_sleep_lock and
 * session_list_sleep_lock are also introduced.  These are locked when 
 * operations (like kill_pg_info()) are walking the list of remote process
 * group members and when processes are migrating.  As a result, processes 
 * don't miss actions as a result of migrations. 
 * Finally, we have two clusterwide sleep locks - 
 * clusterwide_migrate_lock and task_capability_lock.  The migrate lock 
 * interlocks process migration with 4 clusterwide activities - kill(-1), 
 * setpriority(PRIO_USER), getpriority(PRIO_USER) and capset(-1).  The 
 * capability lock (which is a spinlock if clusterproc is not configured)
 * ensures only one capability operations is ongoing in the 
 * cluster at once.
 * The hooks are currently quite complete except w.r.t. /proc.
 */

This page last updated on Tue May 10 20:45:41 2005 GMT
privacy and legal statement