[Ocfs2-users] Unstable Cluster Node

Fri Nov 30 03:25:27 PST 2007

Hi,

I have a 2-Node OCFS2 Cluster on top of DRBD 8.0.4. 
The kernel version I use is:

uname -a
Linux webhost1 2.6.18-028stab039 #2 SMP Tue Aug 21 17:49:05 UTC 2007 i686 GNU/Linux

Both nodes are in the same bladecenter an directly connected with 1Gbit/s by the baldecenters internal ethernet switch.

One of the nodes stops working at least once a day with the following messages:

Nov 23 19:05:02 webhost2 kernel: (4424,3):o2net_sendpage:827 ERROR: sendpage of size 24 to node webhost1 (num 0) at 10.2.0.70:7777 failed with 4294967264
Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_send_remote_convert_request:395 ERROR: status = -107
Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_send_remote_convert_request:395 ERROR: status = -107
Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0
Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of node 0

After that the node hangs and even does not reboot although /proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops are set to 1.

Can anybody please help me to understand the error messages and make that node more stable?

Thanks,
- Rainer

      ____________________________________________________________________________________
Be a better pen pal. 
Text or chat with friends inside Yahoo! Mail. See how.  http://overview.mail.yahoo.com/