[Ocfs2-users] Debugging help / Guidance on architecture

Fri May 15 22:06:04 PDT 2009

Sunil Mushran wrote:

> Damon Miller wrote:

> > We're running a 3-node OCFS2 1.2.9 cluster with a 5-TB iSCSI block
> > device as the backing store.  All machines are running CentOS, with the
> > iSCSI target running CentOS 5.2 and the initiators running CentOS 4.7.
> > The purpose of the cluster is to evaluate alternatives to our current
> > solution for replicating audio files which are generated from multiple PBX
> > servers running Asterisk.

> > We currently use Unison for file-level replication to and from a
> > dedicated machine such that there are multiple copies of the audio tree--
> > one per PBX server.  This allows us to quickly and easily move customers
> > among our servers for load-balancing and disaster recovery purposes.
> > Unfortunately, we're encountering scalability problems with the Unison-
> > based approach, e.g. conflicts, slow propogation time, etc.

> > The hope was that moving to a clustered filesystem would improve
> > propogation time, reduce conflicts, and allow us to scale more
> > effectively.  I chose OCFS2 because it seemed the simplest solution
> > architecturally and because of its certification by Oracle for use with
> > the database product.  (My thought was that Oracle's certification
> > requirements would likely supercede those of a general-purpose filesystem,
> > though please correct me if this was naïve or misguided.)

> Oracle cert requirements are based on the Oracle db workload. General
> purpose is all encompassing. There is no one certification that can be used for
> general purpose as it is hard to capture the essence of all possible workloads.
> Having said that, we have many users who are using it many different
> environments for many years now. So you are not breaking any new ground.

Sunil, thanks for your quick reply.  This makes perfect sense.

> > Having said all that, this morning around 7:00am EDT we began seeing
> > OCFS2-related errors in one of our server's syslog.  Specifically:
> >
> > --
> >
> > May 15 07:08:00 cam-c6 kernel: o2net: no longer connected to node cam-p1 (num 1) at 10.10.89.110:7777
> > May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_broadcast_vote:731 ERROR: status = -112
> > May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_do_request_vote:804 ERROR: status = -112
> > May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_rename:1207 ERROR: status = -112
> > May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_broadcast_vote:731 ERROR: status = -107
> > May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_do_request_vote:804 ERROR: status = -107
> > May 15 07:08:00 cam-c6 kernel: (17170,0):ocfs2_rename:1103 ERROR: status = -107
> >
> > [last message repeated many times]
> >
> > May 15 07:08:30 cam-c6 kernel: (4335,0):o2net_connect_expired:1585
> > ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
> >
> > ...
> >
> > May 15 09:22:29 cam-c6 kernel: (4335,0):o2net_connect_expired:1585
> > ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
> >
> > --
> >
> >
> > This continued until 9:22am EDT, at which point one of our engineers
> > manually rebooted the machine in an attempt to remedy the voicemail
> > problems in response to Asterisk complaining of read/write problems to its
> > voicemail tree.

> > I was surprised OCFS2 didn't panic the kernel and automatically reboot
> > the machine after the 30-second timeout.  I thought this was the default
> > behavior and in fact I forced this condition by manually stopping the
> > iSCSI daemon during preliminary testing.  Instead, the kernel complained
> > for over two hours before someone manually rebooted the machine, at which
> > point the cluster reconnected and resumed operation.  Is this expected?

> Connection between two nodes can snap. But it not reconnecting is strange.
> One would think that 30 secs would be more than adequate for two nodes to
> make a tcp connect. Do you have any firewalls in-between that could be
> interfering?

The two servers are actually connected to the same switch.  We are using iptables for basic packet filtering on all of our hosts, but TCP/7777 is open on all machines participating in the cluster.  iSCSI is also enabled on TCP/3260.  Here are the relevant excerpts from 'iptables -L -n' on the iSCSI target:

ACCEPT     tcp  --  10.10.89.0/24        0.0.0.0/0           state NEW tcp dpt:3260 
ACCEPT     tcp  --  10.10.89.0/24        0.0.0.0/0           state NEW tcp dpt:7777

> > According to the relevant switch (a managed Cisco) there was no
> > interruption in network connectivity between these two machines.  Neither
> > server logged anything related to a network link failure so the only real
> > information I have is from OCFS2.  Frankly I'm not sure how to proceed
> > from here but I obviously want to address the reliability concerns this
> > problem raises since we're considering OCFS2 for replacing our existing
> > solution throughout our datacenters.

> If a firewall (iptables) is responsible, then it will not show up as
> a link failure.

> > I tried to map the numerical error codes -112 and -107 to specific
> > problems based on the code ('tcp.c' and 'vote.c' in particular) but I was
> > unsuccessful.

> ENOTCONN.

Ah, thank you!

> > In general, I suppose I'm curious if anyone has high-level feedback on
> > the planned use of OCFS2 in this scenario.  Am I overcomplicating things?
> > Assuming the pilot works, we do plan to roll out a dedicated storage
> > network which will include redundant switching, NICs, iSCSI targets with
> > multiple paths to the physical storage, etc.  I just need to validate the
> > basic approach at present.

> Your email does not actually say how you are using the fs. You have
> mentioned the older replication method. I would imagine that that is not your
> concern now.  The qs is: how are the nodes accessing the fs now? How many files
> do you have in a dir? Are all nodes creating files in one dir? What other types of
> contention is there?

Excellent point.  In terms of the filesystem layout, we assign one directory to each of our customers (with a very few exceptions).  Within each of these directories is another set of directories representing each extension or DID (phone number) that receives voicemail.  Finally, each extension has its own directory hierarchy responsible for storing messages and greetings.  Here's a quick example showing one mailbox for a single customer (actual number and name obscured for privacy purposes):

cust
|-- 502___1667
|   |-- INBOX
|   |   |-- msg0000.txt
|   |   |-- msg0000.wav
|   |   |-- msg0001.txt
|   |   |-- msg0001.wav
|   |   |-- msg0002.txt
|   |   |-- msg0002.wav
|   |   |-- msg0003.txt
|   |   |-- msg0003.wav
|   |   |-- msg0004.txt
|   |   `-- msg0004.wav
|   |-- Old
|   |-- busy.wav
|   |-- greet
|   |-- greet.wav
|   |-- temp
|   |-- tmp
|   |-- unavail
|   `-- unavail.wav

[etc.]

The whole tree currently houses 194,713 files and 52,768 directories consuming just under 60 GB of storage.

In terms of utilization, I just scanned the tree for all messages created within the past 24 hours and came up with a total of 87,207 consuming nearly 50 GB (>80% of the total usage).  That's spread across at least seven different servers.  Usage varies considerably during that 24-hour period but we've done some basic estimation along these lines.

The average write I/O over the 24-hour period is 600 KB/s (again, spread across seven servers).  Peak utilization is on the order of 5-6 times this, or roughly 3.5 MB/s.  This number is a little misleading in the current configuration as messages are first stored locally and then propogated via Unison but it's certainly relevant for the shared storage approach.

We're also using OCFS2 for call recordings, though we first store them on a local ramdisk to ensure sufficient throughput and then copy them to persistent storage.  We are able to guarantee globally unique filenames to eliminate conflicts.

While I'm providing too much information I might as well describe the transition and remote replication plans.  :)

Assuming we can reach the necessary comfort level with OCFS2, the transition plan is to incrementally migrate each of our customer servers away from file-based replication by establishing an initial replica on the iSCSI target which would be mounted by members of the OCFS2 cluster.  Doing so would obviate the need for maintaining a file-based replica on these members, thus allowing us to slowly move away from the file-based solution.  However, until all customer servers are using shared storage we would need to continue sync'ing the iSCSI target to our current repository with Unison.  In practice this means mounting the OCFS2 volume through the iSCSI target's loopback address and running Unison against the current repository.

Lastly, we would deploy a replica of this OCFS2/iSCSI-based solution in our other primary datacenter and migrate its servers as described above.  Once completed, the plan is to use file-based between the two iSCSI targets in order to propogate changes across the WAN.  Conflicts could occur here if customers fail over to their tertiary server but these are infrequent and we're comfortable resolving them manually.

Hopefully this provides some context.

> OCFS2 can handle contention. The thing to remember is that contention even
> on a single node will affect the performance. It only affects more in a
> clustered setup.

Agreed, and this is an area I frankly do not fully understand.

One problem we're hoping to solve with OCFS2 is the frequent conflicts we see as a result of the current file-based approach.  Asterisk makes no attempt to ensure unique filenames for messages, thus each server effectively operates independently.  If a secondary server is used prior to file synchronization, it's quite possible that a new message will introduce a conflict.  Asterisk will simply increment the message number based on its local filesystem (e.g. "msg0001" -> "msg0002") and store the recording.  This will generate a conflict if these files exist on the primary server as a result of earlier, unsynchronized messages.  The hope is that OCFS2 will provide filesystem consistency across all member nodes such that filename-based conflicts as described above will be avoided.

In terms of actual I/O contention, we've established basic operating characteristics governing the number of customers we can place on a single server.  This is dictated by a combination of CPU and I/O capacity, the latter of these being impacted significantly by the file-based replication approach.  What I do not yet have is an understanding of how OCFS2 affects this recipe (ideally for the better!).

Thanks very much for your response.

Regards,

Damon