[Ocfs2-users] 2 node cluster with shared LUN via FC

Thu Nov 4 11:06:54 PDT 2010

It seems that the initialization order is reversed.

The init scripts should start o2net and then try to mount the
filesystem (I think o2cb init script should do it). But in your case,
the scripts are trying to mount the filesystem, then start the cluster
stack:

> Nov  4 17:27:37 localhost kernel: [  487.105327] ocfs2: Mounting
> device (8,49) on (node 0, slot 0) with ordered data mode.
> Nov  4 17:28:11 localhost kernel: [  521.163897] o2net: accepted
> connection from node xen02b (num 1) at 192.168.100.101:7777

So when the second node brings up, it checks the disk heartbeat and
detect the first node, but as the network stack isn't up yet, it
"thinks" that the first node is dead, due to the lack of network
connectivity and start the recovery procedure.

Are you starting the ocfs2 script and then the o2cb script? If so,
change the order and bring up the o2cb first (heartbeat+network stack).
If not, check the actual init scripts and make sure that the latest
thing that the ocfs2 script do is mount the filesystem.

Regards,
Sérgio

Em Thu, 04 Nov 2010 17:31:41 +0100
Manuel Bogner <manuel.bogner at geizhals.at> escreveu:

> Hi,
> 
> I just upgraded to a bpo kernel 2.6.32-bpo.5-amd64 and now it logs the
> following:
> 
> Nov  4 17:27:37 localhost kernel: [  487.098196] ocfs2_dlm: Nodes in
> domain ("8CEAFACAAE3B4A9BB6AAC6A7664EE094"): 0
> Nov  4 17:27:37 localhost kernel: [  487.105327] ocfs2: Mounting
> device (8,49) on (node 0, slot 0) with ordered data mode.
> Nov  4 17:28:11 localhost kernel: [  521.163897] o2net: accepted
> connection from node xen02b (num 1) at 192.168.100.101:7777
> 
> 
> Nov  4 17:27:59 localhost kernel: [  577.338311] ocfs2_dlm: Nodes in
> domain ("8CEAFACAAE3B4A9BB6AAC6A7664EE094"): 1
> Nov  4 17:27:59 localhost kernel: [  577.351868] ocfs2: Mounting
> device (8,49) on (node 1, slot 1) with ordered data mode.
> Nov  4 17:27:59 localhost kernel: [  577.352241]
> (2287,2):ocfs2_replay_journal:1607 Recovering node 0 from slot 0 on
> device (8,49)
> Nov  4 17:28:00 localhost kernel: [  578.505783]
> (2287,0):ocfs2_begin_quota_recovery:376 Beginning quota recovery in
> slot 0 Nov  4 17:28:00 localhost kernel: [  578.569121]
> (2241,0):ocfs2_finish_quota_recovery:569 Finishing quota recovery in
> slot 0 Nov  4 17:28:11 localhost kernel: [  589.359996] o2net:
> connected to node xen02a (num 0) at 192.168.100.100:7777
> 
> process description for the log:
> 
> node1: mount
> node2: mount
> 
> still the same but now it logs something about the quota.
> 
> (i also changed the network port for the traffic. now they are
> directly attached to each other.)
> 
> regards,
> Manuel
> 
> 
> Am 2010-11-04 15:49, schrieb Manuel Bogner:
> > Hi,
> > 
> > this could also be interesting. I tried mount /dev/sdd1 /shared/ on
> > both nodes at the same time with the following log result:
> > 
> > [  331.158166] OCFS2 1.5.0
> > [  336.155577] ocfs2_dlm: Nodes in domain
> > ("55A9D0B0050C484F97257788A3B9DDE0"): 0
> > [  336.166327] kjournald starting.  Commit interval 5 seconds
> > [  336.166327] ocfs2: Mounting device (8,49) on (node 0, slot 1)
> > with ordered data mode.
> > [  336.166664] (3239,0):ocfs2_replay_journal:1149 Recovering node 1
> > from slot 0 on device (8,49)
> > [  337.350942] kjournald starting.  Commit interval 5 seconds
> > [  351.142229] o2net: accepted connection from node xen02b (num 1)
> > at 10.0.0.102:7777
> > [  495.059065] o2net: no longer connected to node xen02b (num 1) at
> > 10.0.0.102:7777
> > 
> > 
> > [ 4841.036991] ocfs2_dlm: Nodes in domain
> > ("55A9D0B0050C484F97257788A3B9DDE0"): 1
> > [ 4841.039225] kjournald starting.  Commit interval 5 seconds
> > [ 4841.039997] ocfs2: Mounting device (8,49) on (node 1, slot 0)
> > with ordered data mode.
> > [ 4862.033837] o2net: connected to node xen02a (num 0) at
> > 10.0.0.168:7777 [ 5005.996422] o2net: no longer connected to node
> > xen02a (num 0) at 10.0.0.168:7777
> > [ 5005.998393] ocfs2: Unmounting device (8,49) on (node 1)
> > 
> > 
> > at the end xen02a was the only one that had it mounted.
> > 
> > regards,
> > Manuel
> > 
> > 
> > Am 2010-11-04 15:14, schrieb Manuel Bogner:
> >> Hi Sérgio,
> >>
> >> thanks for your quick answere.
> >>
> >> There are such lines after waiting a little bit, but still the same
> >> behavior.
> >>
> >> [ 2063.720211] o2net: connected to node xen02a (num 0) at
> >> 10.0.0.168:7777
> >>
> >> [ 1979.611076] o2net: accepted connection from node xen02b (num 1)
> >> at 10.0.0.102:7777
> >>
> >>
> >> xen02a:~# lsmod | egrep 'jbd|ocfs2|configfs'
> >> ocfs2                 395816  1
> >> ocfs2_dlmfs            23696  1
> >> ocfs2_stack_o2cb        9088  1
> >> ocfs2_dlm             197824  2 ocfs2_dlmfs,ocfs2_stack_o2cb
> >> ocfs2_nodemanager     208744  8
> >> ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
> >> ocfs2_stackglue        16432  2 ocfs2,ocfs2_stack_o2cb
> >> configfs               29736  2 ocfs2_nodemanager
> >> jbd                    54696  2 ocfs2,ext3
> >>
> >> xen02a:~# netstat -an | grep 7777
> >> tcp        0      0 10.0.0.168:7777         0.0.0.0:*
> >> LISTEN
> >> tcp        0      0 10.0.0.168:7777         10.0.0.102:47547
> >> ESTABLISHED
> >>
> >> xen02b:~# lsmod | egrep 'jbd|ocfs2|configfs'
> >> ocfs2                 395816  1
> >> ocfs2_dlmfs            23696  1
> >> ocfs2_stack_o2cb        9088  1
> >> ocfs2_dlm             197824  2 ocfs2_dlmfs,ocfs2_stack_o2cb
> >> ocfs2_nodemanager     208744  8
> >> ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
> >> ocfs2_stackglue        16432  2 ocfs2,ocfs2_stack_o2cb
> >> configfs               29736  2 ocfs2_nodemanager
> >> jbd                    54696  2 ocfs2,ext3
> >>
> >> xen02b:~# netstat -an | grep 7777
> >> tcp        0      0 10.0.0.102:7777         0.0.0.0:*
> >> LISTEN
> >> tcp        0      0 10.0.0.102:47547        10.0.0.168:7777
> >> ESTABLISHED
> >>
> >> There are no iptables-entries on both nodes as they are just
> >> test-servers.
> >>
> >> xen02a:~# uname -a
> >> Linux xen02a 2.6.26-2-xen-amd64 #1 SMP Thu Sep 16 16:32:15 UTC 2010
> >> x86_64 GNU/Linux
> >>
> >> xen02b:~# uname -a
> >> Linux xen02b 2.6.26-2-xen-amd64 #1 SMP Thu Sep 16 16:32:15 UTC 2010
> >> x86_64 GNU/Linux
> >>
> >> xen02b:~# cat /etc/default/o2cb
> >> #
> >> # This is a configuration file for automatic startup of the O2CB
> >> # driver.  It is generated by running /etc/init.d/o2cb configure.
> >> # On Debian based systems the preferred method is running
> >> # 'dpkg-reconfigure ocfs2-tools'.
> >> #
> >>
> >> # O2CB_ENABLED: 'true' means to load the driver on boot.
> >> O2CB_ENABLED=true
> >>
> >> # O2CB_STACK: The name of the cluster stack backing O2CB.
> >> O2CB_STACK=o2cb
> >>
> >> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
> >> O2CB_BOOTCLUSTER=ocfs2
> >>
> >> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered
> >> dead. O2CB_HEARTBEAT_THRESHOLD=31
> >>
> >> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
> >> considered dead.
> >> O2CB_IDLE_TIMEOUT_MS=30000
> >>
> >> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
> >> packet is sent O2CB_KEEPALIVE_DELAY_MS=2000
> >>
> >> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
> >> attempts O2CB_RECONNECT_DELAY_MS=2000
> >>
> >>
> >> xen02b:~# mount
> >> /dev/sda1 on / type ext3 (rw,errors=remount-ro)
> >> tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
> >> proc on /proc type proc (rw,noexec,nosuid,nodev)
> >> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
> >> procbususb on /proc/bus/usb type usbfs (rw)
> >> udev on /dev type tmpfs (rw,mode=0755)
> >> tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
> >> devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
> >> configfs on /sys/kernel/config type configfs (rw)
> >> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
> >> /dev/sdd1 on /shared type ocfs2 (rw,_netdev,heartbeat=local)
> >>
> >>
> >> regards,
> >> Manuel
> >>
> >> Am 2010-11-04 15:03, schrieb Sérgio Surkamp:
> >>> It seems that the o2net (network stack) is not running as you
> >>> should see the network messages in dmesg. Something like:
> >>>
> >>> xen02a kernel: o2net: connected to node xen02b (num 0) at
> >>> 10.0.0.102:7777
> >>>
> >>> Check your firewall and network configurations, also check if
> >>> [o2net] kernel thread is running and the tcp port 7777 is
> >>> listening in both nodes. If the thread is not running, check if
> >>> you have all needed kernel modules loaded:
> >>>
> >>> ocfs2
> >>> jbd
> >>> ocfs2_dlm
> >>> ocfs2_dlmfs
> >>> ocfs2_nodemanager
> >>> configfs
> >>>
> >>> Regards,
> >>> Sérgio
> >>>
> >>> Em Thu, 04 Nov 2010 14:12:11 +0100
> >>> Manuel Bogner <manuel.bogner at geizhals.at> escreveu:
> >>>
> >>>> sorry for the repost, but just saw that i mixed german and
> >>>> english... here is the corrected version:
> >>>>
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm trying to create a cluster out of 2 nodes. Both systems
> >>>> share the same LUN via FC and see it as /dev/sdd.
> >>>>
> >>>> /dev/sdd has one partition
> >>>>
> >>>> Disk /dev/sdd: 21.4 GB, 21474836480 bytes
> >>>> 64 heads, 32 sectors/track, 20480 cylinders
> >>>> Units = cylinders of 2048 * 512 = 1048576 bytes
> >>>> Disk identifier: 0xc29cb93d
> >>>>
> >>>>    Device Boot      Start         End      Blocks   Id  System
> >>>> /dev/sdd1               1       20480    20971504   83  Linux
> >>>>
> >>>> which is formated with
> >>>>
> >>>>   mkfs.ocfs2 -L ocfs2 /dev/sdd1
> >>>>
> >>>>
> >>>> Here is my /etc/ocfs2/cluster.conf
> >>>>
> >>>> node:
> >>>>     ip_port = 7777
> >>>>     ip_address = 10.0.0.168
> >>>>     number = 0
> >>>>     name = xen02a
> >>>>     cluster = ocfs2
> >>>>
> >>>> node:
> >>>>     ip_port = 7777
> >>>>     ip_address = 10.0.0.102
> >>>>     number = 1
> >>>>     name = xen02b
> >>>>     cluster = ocfs2
> >>>>
> >>>> cluster:
> >>>>     node_count = 2
> >>>>     name = ocfs2
> >>>>
> >>>>
> >>>> Everything seems to be fine:
> >>>>
> >>>> xen02a:~# /etc/init.d/o2cb status
> >>>> Driver for "configfs": Loaded
> >>>> Filesystem "configfs": Mounted
> >>>> Stack glue driver: Loaded
> >>>> Stack plugin "o2cb": Loaded
> >>>> Driver for "ocfs2_dlmfs": Loaded
> >>>> Filesystem "ocfs2_dlmfs": Mounted
> >>>> Checking O2CB cluster ocfs2: Online
> >>>> Heartbeat dead threshold = 31
> >>>>   Network idle timeout: 30000
> >>>>   Network keepalive delay: 2000
> >>>>   Network reconnect delay: 2000
> >>>> Checking O2CB heartbeat: Active
> >>>>
> >>>> And mounting the fs on each node works fine:
> >>>>
> >>>> /dev/sdd1 on /shared type ocfs2 (rw,_netdev,heartbeat=local)
> >>>>
> >>>> Both nodes can ping each other.
> >>>>
> >>>>
> >>>> xen02a:~# mounted.ocfs2 -d
> >>>> Device                FS     UUID
> >>>> Label /dev/sdd1             ocfs2
> >>>> 55a9d0b0-050c-484f-9725-7788a3b9dde0  ocfs2
> >>>>
> >>>> xen02b:~# mounted.ocfs2 -d
> >>>> Device                FS     UUID
> >>>> Label /dev/sdd1             ocfs2
> >>>> 55a9d0b0-050c-484f-9725-7788a3b9dde0  ocfs2
> >>>>
> >>>>
> >>>> Now the problem:
> >>>>
> >>>> I first mount the device on node1:
> >>>>
> >>>>  xen02a:~# mount -L ocfs2 /shared/
> >>>> => /dev/sdd1 on /shared type ocfs2 (rw,_netdev,heartbeat=local)
> >>>> without any errors.
> >>>>
> >>>> dmesg says:
> >>>>
> >>>> [   97.244054] ocfs2_dlm: Nodes in domain
> >>>> ("55A9D0B0050C484F97257788A3B9DDE0"): 0
> >>>> [   97.245869] kjournald starting.  Commit interval 5 seconds
> >>>> [   97.247045] ocfs2: Mounting device (8,49) on (node 0, slot 0)
> >>>> with ordered data mode.
> >>>>
> >>>> xen02a:~# mounted.ocfs2 -f
> >>>> Device                FS     Nodes
> >>>> /dev/sdd1             ocfs2  xen02a
> >>>>
> >>>> xen02a:~# echo "slotmap" | debugfs.ocfs2 -n /dev/sdd1
> >>>> 	Slot#   Node#
> >>>> 	    0       0
> >>>>
> >>>>
> >>>> Now I mount the device on the second node:
> >>>>
> >>>> xen02b:~# mount -L ocfs2 /shared/
> >>>> => /dev/sdd1 on /shared type ocfs2 (rw,_netdev,heartbeat=local)
> >>>>
> >>>> [  269.741168] OCFS2 1.5.0
> >>>> [  269.765171] ocfs2_dlm: Nodes in domain
> >>>> ("55A9D0B0050C484F97257788A3B9DDE0"): 1
> >>>> [  269.779620] kjournald starting.  Commit interval 5 seconds
> >>>> [  269.779620] ocfs2: Mounting device (8,49) on (node 1, slot 1)
> >>>> with ordered data mode.
> >>>> [  269.779620] (2953,0):ocfs2_replay_journal:1149 Recovering
> >>>> node 0 from slot 0 on device (8,49)
> >>>> [  270.950540] kjournald starting.  Commit interval 5 seconds
> >>>>
> >>>> xen02b:~# echo "slotmap" | debugfs.ocfs2 -n /dev/sdd1
> >>>> 	Slot#   Node#
> >>>> 	    1       1
> >>>>
> >>>> xen02b:~# mounted.ocfs2 -f
> >>>> Device                FS     Nodes
> >>>> /dev/sdd1             ocfs2  xen02b
> >>>>
> >>>>
> >>>> So the first mount seems to be gone and any changes on the fs on
> >>>> that node are not distributed.
> >>>>
> >>>> At the moment I have no idea what this could be. I hope someone
> >>>> can help me.
> >>>>
> >>>> regards,
> >>>> Manuel
> >>>>
> >>>> _______________________________________________
> >>>> Ocfs2-users mailing list
> >>>> Ocfs2-users at oss.oracle.com
> >>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>
> > 
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br