cluster-commits July 2010

cluster-commits@lists.fedorahosted.org

9 participants
175 discussions

cluster: RHEL56 - rgmanager: Document failover domains
by Lon Hohberger 29 Jul '10

29 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=82… Commit: 8270fa123369c509fda7ef31db75dfe93d2d3749 Parent: 97f2d5b9bc72174c8ecfbe26dbbf89aabd4958ab Author: Lon Hohberger <lhh(a)redhat.com> AuthorDate: Mon Jul 12 16:41:38 2010 -0400 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Thu Jul 29 13:48:32 2010 -0400 rgmanager: Document failover domains This is a backport of the following commit from STABLE3: 3e34a1933d625f315dd0cc9778ef603dc17b4c59 Resolves: rhbz#545229 Signed-off-by: Lon Hohberger <lhh(a)redhat.com> --- rgmanager/man/clurgmgrd.8 | 356 ++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 350 insertions(+), 6 deletions(-) diff --git a/rgmanager/man/clurgmgrd.8 b/rgmanager/man/clurgmgrd.8 index 7b43925..af4d324 100644 --- a/rgmanager/man/clurgmgrd.8 +++ b/rgmanager/man/clurgmgrd.8 @@ -1,9 +1,9 @@ -.TH "clusvcmgrd" "8" "Jan 2005" "" "Red Hat Cluster Suite" +.TH "clurgmgrd" "8" "Jul 2010" "" "Red Hat Cluster Suite" .SH "NAME" -Clurgmgrd \- Resource Group (Cluster Service) Manager Daemon +clurgmgrd \- Cluster Resource Group (Service) Manager Daemon .SH "DESCRIPTION" .PP -.B Clurgmgrd +.B rgmanager handles management of user-defined cluster services (also known as resource groups). This includes handling of user requests including service start, service disable, service relocate, and service restart. The service @@ -12,8 +12,8 @@ event of failures. .SH "HOW IT WORKS" .PP The service manager is spawned by an init script after the cluster -infrastructure has been started and ccsd has been spawned, and only -functions when the cluster is quorate. +infrastructure has been started and only functions when the cluster +is quorate and locks are working. .LP During initialization, the service manager runs scripts which ensure that all services are clear to be started. After that, it determines which services @@ -26,11 +26,353 @@ the member has been fenced whenever fencing is available. When a cluster member determines that it is no longer in the cluster quorum, the service manager stops all services and waits for a new quorum to form. +.SH "CONFIGURATION" +.PP +Rgmanager is configured via cluster.conf. With the exception of logging, +all of rgmanager's configuration resides with the +.B <rm> +tag. The general parameters for rgmanager are as follows: +.LP +.B central_processing +- Enable central processing mode (requires cluster-wide shut down and +restart of rgmanager). This alternative mode of handling failures +externalizes most of rgmanager's features into a user-editable script. +This mode is disabled by default. +.LP +.B status_poll_interval +- This defines the amount of time, in seconds, rgmanager waits +between resource tree scans for status checks. Decreasing this value +may improve rgmanager's ability to detect failures in services, but +at a cost of decreased performance and increased system utilization. +The default is 10 seconds. +.LP +.B status_child_max +- Maximum number of status check threads (default = 5). It is not +recommended that this ever be changed. This simply controls how +many instances of clustat queries may be outstanding on a single +node at any given time. +.LP +.B transition_throttling +- This is the amount of time the event processing thread stays alive +after the last event has been processed. The default is 5 seconds. +It is not recommended that this ever be changed. +.LP +.B log_level +- DEPRECATED; DO NOT USE. Controls log level filtering to syslog. +Default is 5; valid values range from 0-7. See cluster.conf(5) +for the current method to configure logging. +.LP +.B log_facility +- DEPRECATED; DO NOT USE. Controls log level facility when sending +messages to syslog. Default is "daemon". See cluster.conf(5) +for the current method to configure logging. + +.SH "RESOURCE AGENTS" +.PP +.B Resource agents +define resource classes rgmanager can manage. Rgmanager follows the Open +Cluster Framework Resource Agent API v1.0 (draft) standard, with the following +two notable exceptions: +.LP +.in 8 +* Rgmanager does not call \fImonitor\fP; it only calls \fIstatus\fP +.in +.in 8 +* Rgmanager looks for resource agets in /usr/share/cluster +.in +.LP +Rgmanager uses the metadata from resource agents to determine what +parameters to look for in cluster.conf for a each resource type. Viewing +the resource agent metadata is the best way to understand all the various +resource agent parameters. + +.SH "SERVICES / RESOURCE GROUPS" +.PP +A +.B service +or +.B resource group +is a collection of resources defined in cluster.conf for rgmanager's +use. Resource groups are also called +.B resource trees. +.LP +A resource group is the atomic unit of failover in rgmanager. That +is, even though rgmanager calls out to various resource agents +individually in order to start or stop various resources, everything +in the resource group is always moved around together +in the event of a relocation or failover. + +.SH "STARTUP POLICIES" +.PP +Rgmanager supports only two startup policies, +.LP +.B autostart +- if set to 1 (the default), the service is started when a quorum +forms. If set to 0, the service is not automatically started. +.LP +Startup Policy Configuration: +Recovery Configuration: +.in 8 +<rm> +.in 10 +<service name="service1" autostart="[0|1]" .../> +.in 8 +.in 10 + ... +.in 8 +</rm> + +.SH "RECOVERY POLICIES" +.PP +Rgmanager supports three recovery policies for services; this is +configured by the +.B +recovery +parameter in the service definition. +.LP +.B restart +- means to attempt to restart the resource group in place in the +event of one or more failures of individual resources. This can +further be augmented by the +.B max_restarts +and +.B restart_expire_time +parameters, which define a tolerance for the amount of service +restarts over the given amount of time. +.LP +.B relocate +- means to move the resource group to another host in the cluster +instead of restarting on the same host. +.LP +.B disable +- means to not try to recover the resource group. Instead, just +place it in to the disabled state. +.LP +Recovery Configuration: +.in 8 +<rm> +.in 10 +<service name="service1" recovery="[restart|relocate|disable]" .../> +.in 8 +.in 10 + ... +.in 8 +</rm> + +.SH "FAILOVER DOMAINS" +.PP +A failover domain is an ordered subset of members to which a +service may be bound. The following is a list of semantics +governing the options as to how the different configuration +options affect the behavior of a failover domain: +.LP +.B preferred node +or +.B preferred member +: The preferred node was the member designated to run a given +service if the member is online. We can emulate this behavior +by specifying an unordered, unrestricted failover domain of +exactly one member. +.LP +.B restricted domain +: Services bound to the domain may only run on cluster members +which are also members of the failover domain. If no members +of the failover domain are available, the service is placed +in the stopped state. +.LP +.B unrestricted domain +: Services bound to this domain may run on all cluster members, +but will run on a member of the domain whenever one is +available. This means that if a service is running outside of +the domain and a member of the domain comes online, the +service will migrate to that member. +.LP +.B ordered domain +: The order specified in the configuration dictates the order +of preference of members within the domain. The +highest-ranking member of the domain will run the service +whenever it is online. This means that if member A has a +higher rank than member B, the service will migrate to A if it +was running on B if A transitions from offline to online. +.LP +.B unordered domain +: Members of the domain have no order of preference; any +member may run the service. Services will always migrate to +members of their failover domain whenever possible, however, +in an unordered domain. +.LP +.B nofailback +: Enabling this option for an ordered failover domain will +prevent automated fail-back after a more-preferred node +rejoins the cluster. Consequently, nofailback requires an +ordered domain in order to be meaningful. When nofailback +is used, the following two behaviors should be noted: +.in 8 +* If a subset of cluster nodes forms a quorum, the node +with the highest priority in the failover domain is selected +to run a service bound to the domain. After this point, a +higher priority member joining the cluster will not trigger a +relocation. +.in +.in 8 +* When a service is running outside of its unrestricted +failover domain and a cluster member boots which is a part +of the service's failover domain, the service will relocate +to that member. That is, nofailback does not prevent +transitions from outside of a failover domain to inside a +failover domain. After this point, a higher priority member +joining the cluster will not trigger a relocation. +.in +.LP +Ordering, restriction, and nofailback are flags and may +be combined in almost any way (ie, ordered+restricted, +unordered+unrestricted, etc.). These combinations affect both +where services start after initial quorum formation and which +cluster members will take over services in the event that +the service has failed. +.LP +Failover Domain Configuration: +.in 8 +<rm> +.in 10 +<failoverdomains> +.in 12 +<failoverdomain name="NAME" ordered="[0|1]" restricted="[0|1]" nofailback="[0|1" > +.in 14 +<failoverdomainnode name="node1" priority="[1..100]" /> +.in 12 +.in 14 + ... +.in 12 +</failoverdomain> +.in 10 +</failoverdomains> +.in 8 +.in 10 + ... +.in 8 +</rm> + +.SH "SERVICE OPERATIONS" +.PP +These are how the basic user-initiated service operations +(via +.B clusvcadm +) work. +.LP +.B enable +- start the service, optionally on a preferred target and +optionally according to failover domain rules. In absence +of either, the local host where clusvcadm is run will start +the service. If the original start fails, the service behaves +as though a relocate operation was requested (see below). If +the operation succeeds, the service is placed in the started state. +.LP +.B disable +- stop the service and place into the disabled state. This +is the only permissible operation when a service is in the failed state. +.LP +.B relocate +- move the service to another node. Optionally, the +administrator may specify a preferred node to receive the +service, but the inability for the service to run on that +host (e.g. if the service fails to start or the host is offline) +does not prevent relocation, and another node is chosen. +Rgmanager attempts to start the service on every permissible node +in the cluster. If no permissible target node in the cluster +successfully starts the service, the relocation fails and the +service is attempted to be restarted on the original owner. +If the original owner can not restart the service, the service is +placed in the stopped state. +.LP +.B stop +- stop the service and place into the stopped state. +.LP +.B migrate +- migrate the virtual machine to another node. The administrator +must specify a target node. Depending on the failure, a failure +to migrate may result with the virtual machine in the failed state +or in the started state on the original owner. +.LP +.B freeze +- freeze the service or virtual machine in place and prevent +status checks from occurring. Administrators may do this in order +to perform maintenance on one or more parts of a given service +without having rgmanager interfere. It is very important that +the administrator unfreezes the service once maintenance is +complete, as a frozen service will not fail over. Freezing +a service does NOT affect is operational state. For example, +it does not 'pause' virtual machines or suspend them to disk. +.LP +.B unfreeze +- unfreeze (thaw) the service or virtual machine. This command +makes rgmanager perform status checks on the service again. + +.SH "SERVICE STATES" +.PP +These are the most common service states. +.LP +.B disabled +- The service will remain in the disabled state until either an +administrator re-enables the service or the cluster loses quorum +(when the cluster regains quorum, the autostart parameter is +evaluated). An administrator may enable the service from this state. +.LP +.B failed +- The service is presumed dead. A service is placed in to this +state whenever a resource's stop operation fails. After a service +is placed in to this state, the administrator must verify that there +are no allocated resources (mounted file systems, etc.) prior to +issuing a disable request. The only operation which can take place +when a service has entered this state is a disable. +.LP +.B stopped +- When in the stopped state, the service will be evaluated for +starting after the next service or node transition. This is considered +a temporary state. An administrator may disable or enable the service +from this state. +.LP +.B recovering +- The cluster is trying to recover the service. An administrator may +disable the service to prevent recovery if desired. +.LP +.B started +- If a service status check fails, recover it according to the service +recovery policy. If the host running the service fails, recover it +following failover domain & exclusive service rules. An +administrator may relocate, stop, disable, and (with virtual +machines) migrate the service from this state. + +.SH "VIRTUAL MACHINE FEATURES" +.PP +Apart from what is noted in the VM resource agent, rgmanager +provides a few convenience features when dealing with virtual machines. +.in 8 +* it will use live migration when transferring a virtual machine +to a more-preferred host in the cluster as a consequence of +failover domain operation +.in +.in 8 +* it will search the other instances of rgmanager in the cluster +in the case that a user accidentally moves a virtual machine +using other management tools +.in +.in 8 +* unlike services, adding a virtual machine to rgmanager's +configuration will not cause the virtual machine to be restarted +.in +.in 8 +* removing a virtual machine from rgmanager's configuration +will leave the virtual machine running. +.in + .SH "COMMAND LINE OPTIONS" .IP \-f Run in the foreground (do not fork). .IP \-d Enable debug-level logging. +.IP \-w +Disable internal process monitoring (for debugging). .IP \-N Do not perform stop-before-start. Combined with the .I -Z @@ -38,4 +380,6 @@ flag to clusvcadm, this can be used to allow rgmanager to be upgraded without stopping a given user service or set of services. .SH "SEE ALSO" -clusvcadm(8), ccsd(8) +http://sources.redhat.com/cluster/wiki/RGManager + +clusvcadm(8), cluster.conf(5)

1 0

cluster: RHEL56 - resource-agents: Fix vxfs support
by Lon Hohberger 29 Jul '10

29 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=97… Commit: 97f2d5b9bc72174c8ecfbe26dbbf89aabd4958ab Parent: 9c2ff30de99fa340feb5b86c801829ffd1638f69 Author: Lon Hohberger <lhh(a)redhat.com> AuthorDate: Mon Nov 9 17:10:46 2009 -0500 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Thu Jul 29 13:48:10 2010 -0400 resource-agents: Fix vxfs support Resolves: rhbz#531843 Signed-off-by: Lon Hohberger <lhh(a)redhat.com> --- rgmanager/src/resources/fs.sh | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/rgmanager/src/resources/fs.sh b/rgmanager/src/resources/fs.sh index 8214079..900dca5 100755 --- a/rgmanager/src/resources/fs.sh +++ b/rgmanager/src/resources/fs.sh @@ -1017,6 +1017,7 @@ Cannot mount $dev on $mp, the device or mount point is already in use!" ext3) typeset fsck_needed="" ;; jfs) typeset fsck_needed="" ;; xfs) typeset fsck_needed="" ;; + vxfs) typeset fsck_needed="" ;; ext2) typeset fsck_needed=yes ;; minix) typeset fsck_needed=yes ;; vfat) typeset fsck_needed=yes ;;

1 0

cluster: RHEL56 - resource-agents: Fix samba netbios name
by Lon Hohberger 29 Jul '10

29 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=9c… Commit: 9c2ff30de99fa340feb5b86c801829ffd1638f69 Parent: 72ec71ba0a79be20edf1093d0185c67dd01970da Author: Lon Hohberger <lhh(a)redhat.com> AuthorDate: Mon Nov 9 17:12:59 2009 -0500 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Thu Jul 29 13:48:00 2010 -0400 resource-agents: Fix samba netbios name Spaces should not be allowed in the NetBIOS name. Resolves: rhbz#531843 Signed-off-by: Lon Hohberger <lhh(a)redhat.com> --- rgmanager/src/resources/samba.sh | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/rgmanager/src/resources/samba.sh b/rgmanager/src/resources/samba.sh index da9a9e6..857555c 100755 --- a/rgmanager/src/resources/samba.sh +++ b/rgmanager/src/resources/samba.sh @@ -92,7 +92,7 @@ generate_config_file() echo "pid directory = \"$SAMBA_pid_dir\"" >> "$generated_file" echo "interfaces = $ip_addresses" >> "$generated_file" echo "bind interfaces only = Yes" >> "$generated_file" - echo "netbios name = \"$OCF_RESKEY_name\"" >> "$generated_file" + echo "netbios name = ${OCF_RESKEY_name/ /_}" >> "$generated_file" echo >> "$generated_file" sed 's/^[[:space:]]*pid directory/### pid directory/i;s/^[[:space:]]*interfaces/### interfaces/i;s/^[[:space:]]*bind interfaces only/### bind interfaces only/i;s/^[[:space:]]*netbios name/### netbios name/i' \ < "$original_file" >> "$generated_file"

1 0

cluster: STABLE3 - resource-agents: Remove nfs service temp directories
by Lon Hohberger 29 Jul '10

29 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=db… Commit: db1a93f32b36da95a19716ed1b6832b04f4c68dd Parent: 0697d629bf2908b0ea09caf974352410762a566e Author: Carlos Eduardo Maiolino <cmaiolin(a)redhat.com> AuthorDate: Thu Jul 29 13:45:29 2010 -0400 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Thu Jul 29 13:46:38 2010 -0400 resource-agents: Remove nfs service temp directories Resolves: rhbz#595455 Signed-off-by: Lon Hohberger <lhh(a)redhat.com> --- rgmanager/src/resources/svclib_nfslock | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/rgmanager/src/resources/svclib_nfslock b/rgmanager/src/resources/svclib_nfslock index f69a57c..6ea3f42 100644 --- a/rgmanager/src/resources/svclib_nfslock +++ b/rgmanager/src/resources/svclib_nfslock @@ -19,7 +19,7 @@ # nfslock_statd_notify() { - declare tmpdir=$(mktemp -d /tmp/statd-$2.XXXXXX) + declare tmpdir declare nl_dir=$1 declare nl_ip=$2 declare command # Work around bugs in rpc.statd @@ -35,6 +35,8 @@ nfslock_statd_notify() ocf_log debug "No hosts to notify" return 0 fi + + tmpdir=$(mktemp -d /tmp/statd-$2.XXXXXX) # Ok, copy the HA directory to something we can use. mkdir -p $tmpdir/sm

1 0

cluster: RHEL55 - cman: fix consensus calculation
by Lon Hohberger 28 Jul '10

28 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=bb… Commit: bbb7969d9cdcfa2b5a129b69148dde8034d071df Parent: bb581356afc53b8b1dba8594cae5ee8970f8ce58 Author: Lon Hohberger <lhh(a)redhat.com> AuthorDate: Wed Jul 28 15:49:12 2010 -0400 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Wed Jul 28 19:02:48 2010 -0400 cman: fix consensus calculation This is a backport of Fabio's patch: 043c603d46ab401e69cb8e09a3a818e2006134c5 Instead of using the object database, it simply queries CCS to gather the node count for consensus calculation later. Resolves: rhbz#618639 Signed-off-by: Lon Hohberger <lhh(a)redhat.com> --- cman/daemon/ais.c | 28 ++++++++++++++++++++++------ cman/daemon/config.c | 25 +++++++++++++++++++++++++ 2 files changed, 47 insertions(+), 6 deletions(-) diff --git a/cman/daemon/ais.c b/cman/daemon/ais.c index ab51b12..c0d7525 100644 --- a/cman/daemon/ais.c +++ b/cman/daemon/ais.c @@ -52,7 +52,6 @@ extern char *key_filename; extern unsigned int quorumdev_poll; extern unsigned int ccsd_poll_interval; extern unsigned int shutdown_timeout; -extern int two_node; extern int init_config(struct objdb_iface_ver0 *objdb); struct totem_ip_address mcast_addr[MAX_INTERFACES]; @@ -60,6 +59,7 @@ struct totem_ip_address ifaddrs[MAX_INTERFACES]; int num_interfaces; uint64_t incarnation; int num_ais_nodes; +unsigned int node_count = 0; static int config_run; static int startup_pipe; @@ -522,9 +522,20 @@ static int comms_init_ais(struct objdb_iface_ver0 *objdb) "60", strlen("60")+1); } - /* bz#611391 - * consensus should be 1.2*token or for 0.2*token for two_node clusters + /* + * consensus should be: + * 2 nodes - 200 ms <= consensus = token * 0.2 <= 2000 + * > 2 nodes - consensus = token + 2000 + * + * autoconfig clusters will work as > 2 nodes + * + * See 611391#c19 */ + + /* if we are running in autoconfig or we can't count the nodes, then play safe */ + if ((getenv("CMAN_NOCONFIG")) || (node_count == 0)) + node_count=3; + if (objdb_get_string(objdb, object_handle, "consensus", &value)) { unsigned int token=0; unsigned int consensus; @@ -532,10 +543,15 @@ static int comms_init_ais(struct objdb_iface_ver0 *objdb) objdb_get_int(objdb, object_handle, "token", &token); - if (two_node) + if (node_count > 2) { + consensus = (float)token+2000; + } else { consensus = (float)token*0.2; - else - consensus = (float)token*1.2; + if (consensus < 200) + consensus = 200; + if (consensus > 2000) + consensus = 2000; + } snprintf(calc_consensus, sizeof(calc_consensus), "%d", consensus); objdb->object_key_create(object_handle, "consensus", strlen("consensus"), diff --git a/cman/daemon/config.c b/cman/daemon/config.c index 86ea2fe..29049d7 100644 --- a/cman/daemon/config.c +++ b/cman/daemon/config.c @@ -22,6 +22,8 @@ #define MAXXMLNODES 1024 #endif +extern int node_count; + static int read_config_for(int ccs_fd, struct objdb_iface_ver0 *objdb, unsigned int parent, char *object, char *key, int always_create) { @@ -128,6 +130,27 @@ static int read_config_for(int ccs_fd, struct objdb_iface_ver0 *objdb, unsigned return gotcount; } +static int count_clusternodes(int cd) +{ + char path[256]; + int count = 1; + char *val; + + do { + snprintf(path, sizeof(path), + "/cluster/clusternodes/clusternode[%d]/@name", + count); + + if (ccs_get(cd, path, &val) != 0) + break; + + free(val); + ++count; + } while (1); + + return count-1; +} + int init_config(struct objdb_iface_ver0 *objdb) { int cd, err; @@ -136,6 +159,8 @@ int init_config(struct objdb_iface_ver0 *objdb) if (cd < 0) return -1; + node_count = count_clusternodes(cd); + /* These first few are just versions of openais.conf */ err = read_config_for(cd, objdb, OBJECT_PARENT_HANDLE, "totem", "totem", 1); if (err < 0)

1 0

cluster: RHEL55 - cman: Reduce consensus value
by Lon Hohberger 28 Jul '10

28 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=bb… Commit: bb581356afc53b8b1dba8594cae5ee8970f8ce58 Parent: 9e4ac14c0c65116a05e8b6a7dfd5f7f8ce8f0742 Author: Christine Caulfield <ccaulfie(a)redhat.com> AuthorDate: Tue Jul 27 09:06:18 2010 +0100 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Wed Jul 28 19:02:08 2010 -0400 cman: Reduce consensus value Make consensus 1.2*totem under normal circumstances or 0.2*totem for a two_node cluster Resolves: rhbz#618639 Signed-off-by: Christine Caulfield <ccaulfie(a)redhat.com> --- cman/daemon/ais.c | 33 ++++++++++++++++++++++----------- 1 files changed, 22 insertions(+), 11 deletions(-) diff --git a/cman/daemon/ais.c b/cman/daemon/ais.c index 787f0bd..ab51b12 100644 --- a/cman/daemon/ais.c +++ b/cman/daemon/ais.c @@ -52,6 +52,7 @@ extern char *key_filename; extern unsigned int quorumdev_poll; extern unsigned int ccsd_poll_interval; extern unsigned int shutdown_timeout; +extern int two_node; extern int init_config(struct objdb_iface_ver0 *objdb); struct totem_ip_address mcast_addr[MAX_INTERFACES]; @@ -515,21 +516,31 @@ static int comms_init_ais(struct objdb_iface_ver0 *objdb) "20", strlen("20")+1); } - /* Extend consensus & join timeouts per bz#214290 */ + /* Extend join timeout per bz#214290 */ if (objdb_get_string(objdb, object_handle, "join", &value)) { global_objdb->object_key_create(object_handle, "join", strlen("join"), "60", strlen("60")+1); } - /* consensus should be 2*token, see bz#544482*/ - if (objdb_get_string(objdb, object_handle, "consensus", &value)) { - unsigned int token=0; - char calc_consensus[32]; - - objdb_get_int(objdb, object_handle, "token", &token); - sprintf(calc_consensus, "%d", token*2); - objdb->object_key_create(object_handle, "consensus", strlen("consensus"), - calc_consensus, strlen(calc_consensus)+1); - } + + /* bz#611391 + * consensus should be 1.2*token or for 0.2*token for two_node clusters + */ + if (objdb_get_string(objdb, object_handle, "consensus", &value)) { + unsigned int token=0; + unsigned int consensus; + char calc_consensus[32]; + + objdb_get_int(objdb, object_handle, "token", &token); + + if (two_node) + consensus = (float)token*0.2; + else + consensus = (float)token*1.2; + + snprintf(calc_consensus, sizeof(calc_consensus), "%d", consensus); + objdb->object_key_create(object_handle, "consensus", strlen("consensus"), + calc_consensus, strlen(calc_consensus)+1); + } /* Set RRP mode appropriately */ if (num_interfaces > 1) {

1 0

cluster: RHEL56 - cman: fix consensus calculation
by Lon Hohberger 28 Jul '10

28 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=72… Commit: 72ec71ba0a79be20edf1093d0185c67dd01970da Parent: 7dfc82bc15db1f2b92526010eb17b46efecb19f1 Author: Lon Hohberger <lhh(a)redhat.com> AuthorDate: Wed Jul 28 15:49:12 2010 -0400 Committer: Lon Hohberger <lhh(a)redhat.com> CommitterDate: Wed Jul 28 18:45:30 2010 -0400 cman: fix consensus calculation This is a backport of Fabio's patch: 043c603d46ab401e69cb8e09a3a818e2006134c5 Instead of using the object database, it simply queries CCS to gather the node count for consensus calculation later. Resolves: rhbz#611391 Signed-off-by: Lon Hohberger <lhh(a)redhat.com> --- cman/daemon/ais.c | 28 ++++++++++++++++++++++------ cman/daemon/config.c | 25 +++++++++++++++++++++++++ 2 files changed, 47 insertions(+), 6 deletions(-) diff --git a/cman/daemon/ais.c b/cman/daemon/ais.c index efb2b44..b6f23c4 100644 --- a/cman/daemon/ais.c +++ b/cman/daemon/ais.c @@ -52,7 +52,6 @@ extern char *key_filename; extern unsigned int quorumdev_poll; extern unsigned int ccsd_poll_interval; extern unsigned int shutdown_timeout; -extern int two_node; extern int init_config(struct objdb_iface_ver0 *objdb); struct totem_ip_address mcast_addr[MAX_INTERFACES]; @@ -60,6 +59,7 @@ struct totem_ip_address ifaddrs[MAX_INTERFACES]; int num_interfaces; uint64_t incarnation; int num_ais_nodes; +unsigned int node_count = 0; static int config_run; static int startup_pipe; @@ -527,9 +527,20 @@ static int comms_init_ais(struct objdb_iface_ver0 *objdb) "2500", strlen("2500")+1); } - /* bz#611391 - * consensus should be 1.2*token or for 0.2*token for two_node clusters + /* + * consensus should be: + * 2 nodes - 200 ms <= consensus = token * 0.2 <= 2000 + * > 2 nodes - consensus = token + 2000 + * + * autoconfig clusters will work as > 2 nodes + * + * See 611391#c19 */ + + /* if we are running in autoconfig or we can't count the nodes, then play safe */ + if ((getenv("CMAN_NOCONFIG")) || (node_count == 0)) + node_count=3; + if (objdb_get_string(objdb, object_handle, "consensus", &value)) { unsigned int token=0; unsigned int consensus; @@ -537,10 +548,15 @@ static int comms_init_ais(struct objdb_iface_ver0 *objdb) objdb_get_int(objdb, object_handle, "token", &token); - if (two_node) + if (node_count > 2) { + consensus = (float)token+2000; + } else { consensus = (float)token*0.2; - else - consensus = (float)token*1.2; + if (consensus < 200) + consensus = 200; + if (consensus > 2000) + consensus = 2000; + } snprintf(calc_consensus, sizeof(calc_consensus), "%d", consensus); objdb->object_key_create(object_handle, "consensus", strlen("consensus"), diff --git a/cman/daemon/config.c b/cman/daemon/config.c index 86ea2fe..29049d7 100644 --- a/cman/daemon/config.c +++ b/cman/daemon/config.c @@ -22,6 +22,8 @@ #define MAXXMLNODES 1024 #endif +extern int node_count; + static int read_config_for(int ccs_fd, struct objdb_iface_ver0 *objdb, unsigned int parent, char *object, char *key, int always_create) { @@ -128,6 +130,27 @@ static int read_config_for(int ccs_fd, struct objdb_iface_ver0 *objdb, unsigned return gotcount; } +static int count_clusternodes(int cd) +{ + char path[256]; + int count = 1; + char *val; + + do { + snprintf(path, sizeof(path), + "/cluster/clusternodes/clusternode[%d]/@name", + count); + + if (ccs_get(cd, path, &val) != 0) + break; + + free(val); + ++count; + } while (1); + + return count-1; +} + int init_config(struct objdb_iface_ver0 *objdb) { int cd, err; @@ -136,6 +159,8 @@ int init_config(struct objdb_iface_ver0 *objdb) if (cd < 0) return -1; + node_count = count_clusternodes(cd); + /* These first few are just versions of openais.conf */ err = read_config_for(cd, objdb, OBJECT_PARENT_HANDLE, "totem", "totem", 1); if (err < 0)

1 0

cluster: RHEL56 - gfs-kernel: assertion "!get_transaction" fails on mmaps between gfs filesystems
by Benjamin Marzinski 28 Jul '10

28 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=7d… Commit: 7dfc82bc15db1f2b92526010eb17b46efecb19f1 Parent: 52965892810bc1af110f18a891c09fc214f82cee Author: Benjamin Marzinski <bmarzins(a)redhat.com> AuthorDate: Wed Jul 28 17:17:04 2010 -0500 Committer: Benjamin Marzinski <bmarzins(a)redhat.com> CommitterDate: Wed Jul 28 17:17:04 2010 -0500 gfs-kernel: assertion "!get_transaction" fails on mmaps between gfs filesystems The gfs walk_vm() code only checked that the buffer which got passed in was not mmaped to a file on the same filesystem as the file being written to. However, if that buffer was from a file on another GFS filesystem, and that file did not have blocks allocated for all of the area mapped to the buffer, GFS would need to start a transaction when it tried to read from the buffer. Since gfs had already started a transaction, the !get_transaction assert would fail. Furthermore, gfs is not able to safely hold glocks on files in two seperate filesystems at the same time, since inode numbers are used for lock ordering, and these are not unique across filesystems. Because of this there is no way to guarantee that the necessary blocks won't be removed from the mmaped file via a truncate before gfs needs to read them in. This fix simply reads the buffer in before doing any locking, if gfs notices that the buffer is mmaped to a file on another gfs filesystem. This should fix the problem except in corner cases, where the mmaped file is being truncated at the same time. Resolves: bz617339 Signed-off-by: Benjamin Marzinski <bmarzins(a)redhat.com> --- gfs-kernel/src/gfs/ops_file.c | 70 +++++++++++++++++++++++------------------ 1 files changed, 39 insertions(+), 31 deletions(-) diff --git a/gfs-kernel/src/gfs/ops_file.c b/gfs-kernel/src/gfs/ops_file.c index 455debb..3d748b6 100644 --- a/gfs-kernel/src/gfs/ops_file.c +++ b/gfs-kernel/src/gfs/ops_file.c @@ -192,6 +192,37 @@ walk_vm_hard(struct file *file, char *buf, size_t size, loff_t *offset, } /** + * grope_mapping - feel up a mapping that needs to be written + * @buf: the start of the memory to be written + * @size: the size of the memory to be written + * + * We do this after acquiring the locks on the mapping, + * but before starting the write transaction. We need to make + * sure that we don't cause recursive transactions if blocks + * need to be allocated to the file backing the mapping. + * + * Returns: errno + */ + +static int +grope_mapping(char *buf, size_t size) +{ + unsigned long start = (unsigned long)buf; + unsigned long stop = start + size; + char c; + + while (start < stop) { + if (copy_from_user(&c, (char *)start, 1)) + return -EFAULT; + + start += PAGE_CACHE_SIZE; + start &= PAGE_CACHE_MASK; + } + + return 0; +} + +/** * walk_vm - Walk the vmas associated with a buffer for read or write. * If any of them are gfs, pass the gfs inode down to the read/write * worker function so that locks can be acquired in the correct order. @@ -211,8 +242,11 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, struct kiocb *iocb, do_rw_t operation) { + int needs_groping = 0; + if (current->mm) { struct super_block *sb = file->f_dentry->d_inode->i_sb; + struct file_system_type *type = file->f_dentry->d_inode->i_sb->s_type; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long start = (unsigned long)buf; @@ -228,6 +262,9 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, if (vma->vm_file && vma->vm_file->f_dentry->d_inode->i_sb == sb) goto do_locks; + else if (vma->vm_file && + vma->vm_file->f_dentry->d_inode->i_sb->s_type == type) + needs_groping = 1; } if (!dumping) @@ -236,6 +273,8 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, { struct gfs_holder gh; + if (needs_groping) + grope_mapping(buf, size); return operation(file, buf, size, offset, iocb, 0, &gh); } @@ -284,37 +323,6 @@ do_read_readi(struct file *file, char *buf, size_t size, loff_t *offset, } /** - * grope_mapping - feel up a mapping that needs to be written - * @buf: the start of the memory to be written - * @size: the size of the memory to be written - * - * We do this after acquiring the locks on the mapping, - * but before starting the write transaction. We need to make - * sure that we don't cause recursive transactions if blocks - * need to be allocated to the file backing the mapping. - * - * Returns: errno - */ - -static int -grope_mapping(char *buf, size_t size) -{ - unsigned long start = (unsigned long)buf; - unsigned long stop = start + size; - char c; - - while (start < stop) { - if (copy_from_user(&c, (char *)start, 1)) - return -EFAULT; - - start += PAGE_CACHE_SIZE; - start &= PAGE_CACHE_MASK; - } - - return 0; -} - -/** * do_read_direct - Read bytes from a file * @file: The file to read from * @buf: The buffer to copy into

1 0

cluster: RHEL55 - Revert "gfs-kernel: assertion "!get_transaction" fails on mmaps between gfs filesystems"
by Benjamin Marzinski 28 Jul '10

28 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=9e… Commit: 9e4ac14c0c65116a05e8b6a7dfd5f7f8ce8f0742 Parent: 48194711384ab9a6af2701fd7313e05c8c0ce1d7 Author: Benjamin Marzinski <bmarzins(a)redhat.com> AuthorDate: Wed Jul 28 16:18:21 2010 -0500 Committer: Benjamin Marzinski <bmarzins(a)redhat.com> CommitterDate: Wed Jul 28 16:18:21 2010 -0500 Revert "gfs-kernel: assertion "!get_transaction" fails on mmaps between gfs filesystems" This reverts commit 48194711384ab9a6af2701fd7313e05c8c0ce1d7. --- gfs-kernel/src/gfs/ops_file.c | 70 ++++++++++++++++++----------------------- 1 files changed, 31 insertions(+), 39 deletions(-) diff --git a/gfs-kernel/src/gfs/ops_file.c b/gfs-kernel/src/gfs/ops_file.c index 44a73e5..150f328 100644 --- a/gfs-kernel/src/gfs/ops_file.c +++ b/gfs-kernel/src/gfs/ops_file.c @@ -192,37 +192,6 @@ walk_vm_hard(struct file *file, char *buf, size_t size, loff_t *offset, } /** - * grope_mapping - feel up a mapping that needs to be written - * @buf: the start of the memory to be written - * @size: the size of the memory to be written - * - * We do this after acquiring the locks on the mapping, - * but before starting the write transaction. We need to make - * sure that we don't cause recursive transactions if blocks - * need to be allocated to the file backing the mapping. - * - * Returns: errno - */ - -static int -grope_mapping(char *buf, size_t size) -{ - unsigned long start = (unsigned long)buf; - unsigned long stop = start + size; - char c; - - while (start < stop) { - if (copy_from_user(&c, (char *)start, 1)) - return -EFAULT; - - start += PAGE_CACHE_SIZE; - start &= PAGE_CACHE_MASK; - } - - return 0; -} - -/** * walk_vm - Walk the vmas associated with a buffer for read or write. * If any of them are gfs, pass the gfs inode down to the read/write * worker function so that locks can be acquired in the correct order. @@ -242,11 +211,8 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, struct kiocb *iocb, do_rw_t operation) { - int needs_groping = 0; - if (current->mm) { struct super_block *sb = file->f_dentry->d_inode->i_sb; - struct file_system_type *type = file->f_dentry->d_inode->i_sb->s_type; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long start = (unsigned long)buf; @@ -262,9 +228,6 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, if (vma->vm_file && vma->vm_file->f_dentry->d_inode->i_sb == sb) goto do_locks; - else if (vma->vm_file && - vma->vm_file->f_dentry->d_inode->i_sb->s_type == type) - needs_groping = 1; } if (!dumping) @@ -273,8 +236,6 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, { struct gfs_holder gh; - if (needs_groping) - grope_mapping(buf, size); return operation(file, buf, size, offset, iocb, 0, &gh); } @@ -323,6 +284,37 @@ do_read_readi(struct file *file, char *buf, size_t size, loff_t *offset, } /** + * grope_mapping - feel up a mapping that needs to be written + * @buf: the start of the memory to be written + * @size: the size of the memory to be written + * + * We do this after acquiring the locks on the mapping, + * but before starting the write transaction. We need to make + * sure that we don't cause recursive transactions if blocks + * need to be allocated to the file backing the mapping. + * + * Returns: errno + */ + +static int +grope_mapping(char *buf, size_t size) +{ + unsigned long start = (unsigned long)buf; + unsigned long stop = start + size; + char c; + + while (start < stop) { + if (copy_from_user(&c, (char *)start, 1)) + return -EFAULT; + + start += PAGE_CACHE_SIZE; + start &= PAGE_CACHE_MASK; + } + + return 0; +} + +/** * do_read_direct - Read bytes from a file * @file: The file to read from * @buf: The buffer to copy into

1 0

cluster: RHEL55 - gfs-kernel: assertion "!get_transaction" fails on mmaps between gfs filesystems
by Benjamin Marzinski 28 Jul '10

28 Jul '10

Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=48… Commit: 48194711384ab9a6af2701fd7313e05c8c0ce1d7 Parent: e44f2175836fbc9ec95b761781911ca2a6cc570a Author: Benjamin Marzinski <bmarzins(a)redhat.com> AuthorDate: Wed Jul 28 15:54:25 2010 -0500 Committer: Benjamin Marzinski <bmarzins(a)redhat.com> CommitterDate: Wed Jul 28 15:54:25 2010 -0500 gfs-kernel: assertion "!get_transaction" fails on mmaps between gfs filesystems The gfs walk_vm() code only checked that the buffer which got passed in was not mmaped to a file on the same filesystem as the file being written to. However, if that buffer was from a file on another GFS filesystem, and that file did not have blocks allocated for all of the area mapped to the buffer, GFS would need to start a transaction when it tried to read from the buffer. Since gfs had already started a transaction, the !get_transaction assert would fail. Furthermore, gfs is not able to safely hold glocks on files in two seperate filesystems at the same time, since inode numbers are used for lock ordering, and these are not unique across filesystems. Because of this there is no way to guarantee that the necessary blocks won't be removed from the mmaped file via a truncate before gfs needs to read them in. This fix simply reads the buffer in before doing any locking, if gfs notices that the buffer is mmaped to a file on another gfs filesystem. This should fix the problem except in corner cases, where the mmaped file is being truncated at the same time. Resolves: bz617339 Signed-off-by: Benjamin Marzinski <bmarzins(a)redhat.com> --- gfs-kernel/src/gfs/ops_file.c | 70 +++++++++++++++++++++++------------------ 1 files changed, 39 insertions(+), 31 deletions(-) diff --git a/gfs-kernel/src/gfs/ops_file.c b/gfs-kernel/src/gfs/ops_file.c index 150f328..44a73e5 100644 --- a/gfs-kernel/src/gfs/ops_file.c +++ b/gfs-kernel/src/gfs/ops_file.c @@ -192,6 +192,37 @@ walk_vm_hard(struct file *file, char *buf, size_t size, loff_t *offset, } /** + * grope_mapping - feel up a mapping that needs to be written + * @buf: the start of the memory to be written + * @size: the size of the memory to be written + * + * We do this after acquiring the locks on the mapping, + * but before starting the write transaction. We need to make + * sure that we don't cause recursive transactions if blocks + * need to be allocated to the file backing the mapping. + * + * Returns: errno + */ + +static int +grope_mapping(char *buf, size_t size) +{ + unsigned long start = (unsigned long)buf; + unsigned long stop = start + size; + char c; + + while (start < stop) { + if (copy_from_user(&c, (char *)start, 1)) + return -EFAULT; + + start += PAGE_CACHE_SIZE; + start &= PAGE_CACHE_MASK; + } + + return 0; +} + +/** * walk_vm - Walk the vmas associated with a buffer for read or write. * If any of them are gfs, pass the gfs inode down to the read/write * worker function so that locks can be acquired in the correct order. @@ -211,8 +242,11 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, struct kiocb *iocb, do_rw_t operation) { + int needs_groping = 0; + if (current->mm) { struct super_block *sb = file->f_dentry->d_inode->i_sb; + struct file_system_type *type = file->f_dentry->d_inode->i_sb->s_type; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long start = (unsigned long)buf; @@ -228,6 +262,9 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, if (vma->vm_file && vma->vm_file->f_dentry->d_inode->i_sb == sb) goto do_locks; + else if (vma->vm_file && + vma->vm_file->f_dentry->d_inode->i_sb->s_type == type) + needs_groping = 1; } if (!dumping) @@ -236,6 +273,8 @@ walk_vm(struct file *file, char *buf, size_t size, loff_t *offset, { struct gfs_holder gh; + if (needs_groping) + grope_mapping(buf, size); return operation(file, buf, size, offset, iocb, 0, &gh); } @@ -284,37 +323,6 @@ do_read_readi(struct file *file, char *buf, size_t size, loff_t *offset, } /** - * grope_mapping - feel up a mapping that needs to be written - * @buf: the start of the memory to be written - * @size: the size of the memory to be written - * - * We do this after acquiring the locks on the mapping, - * but before starting the write transaction. We need to make - * sure that we don't cause recursive transactions if blocks - * need to be allocated to the file backing the mapping. - * - * Returns: errno - */ - -static int -grope_mapping(char *buf, size_t size) -{ - unsigned long start = (unsigned long)buf; - unsigned long stop = start + size; - char c; - - while (start < stop) { - if (copy_from_user(&c, (char *)start, 1)) - return -EFAULT; - - start += PAGE_CACHE_SIZE; - start &= PAGE_CACHE_MASK; - } - - return 0; -} - -/** * do_read_direct - Read bytes from a file * @file: The file to read from * @buf: The buffer to copy into

1 0

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

cluster-commits July 2010