TipsOnHA

Aug 1, 2017
Technology

Network Preparation

libvirt network preparation:

$ cat internal.xml
<network>
	<name>internal</name>
	<bridge name='virbr8'/>
</network>
$ cat external.xml
<network>
	<name>external</name>
	<bridge name='virbr9'/>
</network>
$ cat management.xml
<network>
	<name>management</name>
	<bridge name='virbr7'/>
	<ip address='192.168.3.1' netmask='255.255.255.0'>
	</ip>
</network>
$ cat heartbeat.xml
<network>
	<name>heartbeat</name>
	<bridge name='virbr6'/>
</network>

Define all of the networking, take heartbeat networking for example:

$ sudo virsh net-define heartbeat.xml
$ sudo virsh net-autostart heartbeat
$ sudo virsh net-start heartbeat

iscsi node

Create a new machine(192.168.122.200), CentOS6.9, use local iso for installation:

First you have to add one network card(192.168.3.200), and disable selinux, then you do following steps:

# yum install -y scsi-target-utils
# mkdir -p /var/lib/tgtd/cluster01
# cd /var/lib/tgtd/cluster01/
# dd if=/dev/zero of=volume01.img bs=1M count=100
# dd if=/dev/zero of=volume02.img bs=1M count=1000

Edit the tgtd configuration:

# vim /etc/tgt/targets.conf
<target iqn.2011-10.com.example.kvmhost01:tgt01>
    backing-store /var/lib/tgtd/cluster01/volume01.img
    backing-store /var/lib/tgtd/cluster01/volume02.img
</target>
# chkconfig tgtd on
# service tgtd start
# tgt-admin -s
# chkconfig iptables off
# service iptables stop

node01/node02

Take node01 for example:

[root@node01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 
DEVICE=eth1
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=static
IPADDR=192.168.3.201
NETMASK=255.255.255.0

[root@node01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE=eth2
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=static
IPADDR=192.168.4.201
NETMASK=255.255.255.0

node02 for example:

[root@node02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 
DEVICE=eth1
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=static
IPADDR=192.168.3.202
NETMASK=255.255.255.0

[root@node02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE=eth2
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=static
IPADDR=192.168.4.202
NETMASK=255.255.255.0

Define its /etc/hosts:

127.0.0.1   localhost
192.168.122.201	node01
192.168.122.202	node02
192.168.3.201	node01m
192.168.3.202	node02m
192.168.4.201	node01h
192.168.4.202	node02h

Also disable iptables.

ssh-keygen for ssh key-pairs and let them login without password:

# ssh-keygen -N ""
# ssh-copy-id node01
# ssh-copy-id node02

Find iscsi

In node01/node02, do following:

# yum install -y iscsi-initiator-utils
# chkconfig iscsi on
# iscsiadm -m discovery --type sendtargets --portal 192.168.3.200
# service iscsi start

The newly added disk are named as /dev/sda, /dev/sdb.

HA Add-On

In node01/node02, install the package group via:

# yum groupinstall -y "High Availability"

Start ricci service, and set the service status for cman and rgmanager:

# chkconfig ricci on; service ricci start
# passwd ricci
# chkconfig cman off; chkconfig rgmanager off

Install httpd in both node:

# yum install -y httpd

Node01

Quorum Disk:

[root@node01 ~]# mkqdisk -c /dev/sda -l qdisk01
mkqdisk v3.0.12.1

Writing new quorum disk label 'qdisk01' to /dev/sda.
WARNING: About to destroy all data on /dev/sda; proceed [N/y] ? y
Initializing status block for node 1...
Initializing status block for node 2...
Initializing status block for node 3...
Initializing status block for node 4...
Initializing status block for node 5...
Initializing status block for node 6...
Initializing status block for node 7...
Initializing status block for node 8...
Initializing status block for node 9...
Initializing status block for node 10...
Initializing status block for node 11...
Initializing status block for node 12...
Initializing status block for node 13...
Initializing status block for node 14...
Initializing status block for node 15...
Initializing status block for node 16...

Then format the /dev/sdb, and use this filesystem for saving the apache content:

# mkfs.ext4 /dev/sdb
# mount /dev/sdb /mnt
# cp -ar /var/www/* /mnt/
# umount /mnt

Cluster Configuration

/etc/cluster/cluster.conf:

<?xml version="1.0"?>
<cluster config_version="1" name="cluster01">
  <cman expected_votes="3"/>

  <clusternodes>
    <clusternode name="node01h" nodeid="1" votes="1">
      <fence>   
        <method name="virsh_reboot">
          <device name="kvmhost01" port="node1"/>
        </method>     
      </fence>
    </clusternode>

    <clusternode name="node02h" nodeid="2" votes="1">
      <fence>   
        <method name="virsh_reboot">
          <device name="kvmhost01" port="node2"/>
        </method>     
      </fence>
    </clusternode>
  </clusternodes>

  <totem token="20000"/>
  <quorumd interval="1" label="qdisk01" master_wins="1" tko="10" votes="1"/>

  <fencedevices>
    <fencedevice name="kvmhost01" agent="fence_virsh" ipaddr="192.168.3.1" login="root" passwd="gwoguwoguoeg" option="reboot"/>
  </fencedevices>

  <rm>
    <failoverdomains>
      <failoverdomain name="dom01">
        <failoverdomainnode name="node01h"/>
        <failoverdomainnode name="node02h"/>
      </failoverdomain>
    </failoverdomains>

    <service autostart="0" domain="dom01" name="service01">
      <ip address="192.168.122.209" monitor_link="on">
        <fs name="webdata01" device="/dev/sdb" fstype="ext4" mountpoint="/var/www" self_fence="1">
          <apache name="webserver01"/>
        </fs>     
      </ip> 
    </service>
  </rm>
</cluster>

Save the configuration and scp it to node02:

# ccs_config_validate
# scp ./cluster.conf  node02:/etc/cluster/

Start service/Stop Service scripts:

[root@node01 ~]# cd /usr/local/bin/
[root@node01 bin]# ls
clstart  clstart_all  clstop  clstop_all
[root@node01 bin]# pwd
/usr/local/bin
[root@node01 bin]# cat clstart
#!/bin/sh
service cman start
service rgmanager start
[root@node01 bin]# cat clstart_all 
#!/bin/sh
ssh node01 /usr/local/bin/clstart &
ssh node02 /usr/local/bin/clstart &
wait
[root@node01 bin]# cat clstop
#!/bin/sh
service rgmanager stop
service cman stop
[root@node01 bin]# cat clstop_all
#!/bin/sh
ssh node01 /usr/local/bin/clstop &
ssh node02 /usr/local/bin/clstop &
wait

Now start the service:

# clstart_all
# clusvcadm -e service01 -m node01h

View the service status:

[root@node01 bin]# clustat
Cluster Status for cluster01 @ Tue Aug  1 15:57:18 2017
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node01h                                                             1 Online, Local, rgmanager
 node02h                                                             2 Online, rgmanager
 /dev/block/8:0                                                      0 Online, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:service01                                                node01h                                                          started

View ip addr on node01, you could see the 2 address attached to eth0.

Error

Emulate an error via:

# pkill -9 corosync

Now the node2 will try to detect the heartbeat, if not, it will finally reboot the node01.

$ tail -f /var/log/message
Aug  1 15:58:21 node02 corosync[4089]:   [CMAN  ] quorum device re-registered
Aug  1 15:58:21 node02 corosync[4089]:   [QUORUM] Members[2]: 1 2
Aug  1 15:58:21 node02 qdiskd[4148]: Assuming master role
Aug  1 15:58:21 node02 qdiskd[4148]: Writing eviction notice for node 1
Aug  1 15:58:22 node02 qdiskd[4148]: Node 1 evicted
Aug  1 15:58:24 node02 corosync[4089]:   [TOTEM ] A processor failed, forming new configuration.
Aug  1 15:58:26 node02 corosync[4089]:   [QUORUM] Members[1]: 2
Aug  1 15:58:26 node02 corosync[4089]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  1 15:58:26 node02 kernel: dlm: closing connection to node 1
Aug  1 15:58:26 node02 corosync[4089]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.4.202) ; members(old:2 left:1)
Aug  1 15:58:26 node02 corosync[4089]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  1 15:58:26 node02 rgmanager[4511]: State change: node01h DOWN
Aug  1 15:58:26 node02 fenced[4332]: fencing node node01h
Aug  1 15:58:29 node02 fenced[4332]: fence node01h success
Aug  1 15:58:29 node02 rgmanager[4511]: Taking over service service:service01 from down member node01h
Aug  1 15:58:29 node02 rgmanager[5640]: [ip] Adding IPv4 address 192.168.122.209/24 to eth0
Aug  1 15:58:33 node02 rgmanager[5755]: [fs] mounting /dev/sdb on /var/www
Aug  1 15:58:33 node02 kernel: EXT4-fs (sdb): recovery complete
Aug  1 15:58:33 node02 kernel: EXT4-fs (sdb): mounted filesystem with ordered data mode. Opts: 
Aug  1 15:58:33 node02 rgmanager[5923]: [apache] Checking Existence Of File /var/run/cluster/apache/apache:webserver01.pid [apache:webserver01] > Failed
Aug  1 15:58:33 node02 rgmanager[5945]: [apache] Monitoring Service apache:webserver01 > Service Is Not Running
Aug  1 15:58:33 node02 rgmanager[5967]: [apache] Starting Service apache:webserver01
Aug  1 15:58:34 node02 rgmanager[4511]: Service service:service01 started

After reboot, in node01 run clstart to start the cluster.

Recover the service to node01:

[root@node01 ~]# clustat 
Cluster Status for cluster01 @ Tue Aug  1 16:02:23 2017
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node01h                                                             1 Online, Local, rgmanager
 node02h                                                             2 Online, rgmanager
 /dev/block/8:0                                                      0 Online, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:service01                                                node02h                                                          started       
[root@node01 ~]# clusvcadm -r service01 -m node01h
Trying to relocate service:service01 to node01h...Success
service:service01 is now running on node01h
[root@node01 ~]# clustat 
Cluster Status for cluster01 @ Tue Aug  1 16:03:38 2017
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node01h                                                             1 Online, Local, rgmanager
 node02h                                                             2 Online, rgmanager
 /dev/block/8:0                                                      0 Online, Quorum Disk

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:service01                                                node01h                                                          started

Configuration Modify

Use cman_tool version -r command. but not all of the services could be applied in this way.

很惭愧，就做了一点微小的工作