[Ocfs2-tools-devel] [PATCH 1/1] o2hbmonitor: Disk heartbeat monitor

Sunil Mushran sunil.mushran at oracle.com
Wed Oct 27 10:35:19 PDT 2010


So the next drop of the same is here. Srini will be shepherding the
test/review cycle as I will be out for 2 weeks.

http://oss.oracle.com/~smushran/elapsed/tools/0001-o2hbmonitor-Disk-heartbeat-monitor.patch

On 10/27/2010 04:35 AM, tristan wrote:
> I guess we'd betther have a signal handler in the program to
> deal with SIGINT/SIGTERM/SIGKILL correctly doing some cleanups,
> such as free_config(), especially for a dameon, which normally
> is going to be killed by an arbitrary 'kill -9'

A kill will cleanup those allocs. Though a signal handler would be cleaner code-wise.
Anycase, the new drop does not have that structure. The flaw in the previous
one was that it did not detect new hb regions. Only a stopped hbregion would
make it re-read the config.

>> +static char *do_strchomp(char *str)
>> +{
>> +    int len = strlen(str);
>> +    char *p;
>> +
>> +    if (!len)
>> +        return str;
>> +
>> +    p = str + len - 1;
>
> I guess you're trying to remove all trailing whitespace here,
> so if the string is terminated by a null-whitespace char, then
> we're safe to leave:
>
> if (!isspace(*p))
> return str;

The current return is 3 lines below. We don't gain anything.
Most compilers can do this optimization.

> +    while ((isspace(*p) || *p == '\n') && len--)
>
> Also there is a corner bug when the target string was nothing but all
> consists of whitespaces:when the 'len' reaches 0, '*p' is actually
> accessing the invalid byte which is ahead of string address. it should be:
>
> +    while ((len--) && (isspace(*p) || (*p == '\n')))

Yes. Make sure Srini fixes this. ;)

>> +static int get_value(char *path, char *value, int count)
>> +{
>> +    int fd = -1, ret = -1;
>> +    char *p = value;
>> +
>> +    fd = open(path, O_RDONLY);
>> +    if (fd > 0)
>> +        ret = read(fd, value, count);
> +        *p-- = '\0';
>
> else
> return ret;

Current flow also works.

>> +static int populate_elapsed_time(struct o2hb_config *oc)
>> +{
>> +    int i, ret;
> +
>
> ret = -1;

0 regions does not mean error.

>
> /* in case the directory was empty, except for '.' and '..' */
> if (!oc->num_regions)
> goto bail;

Code is no more.

> This light utility was just lack of a mechanism to log detailed log if user want, I mean not to
> pollute the syslog, just wanna let user know if syscalls in the program like open/read/readdir
> succeeded or not, a tricky way is to output this kind of log via socket, sending the UDP logs to the
> localhost, it's all up to the user to choose listen or not. while having said that, we're also going
> to get a good chance to avoid launching 2 or more 'o2hbmonitor' daemons in the same node, by judging
> if a UNIQUE port was using by the daemon, we usually don't want to run 2 'o2hbmonitor' at the same time,
> do we?

A monitor for a monitor is classic over-engineering. If open/read/readdir
are failing, it assumes that the cluster is not up or debugfs is not mounted.
Occam's Razor. If former, try again after some timeout. If latter, print
a discreet message.

We don't want multiple o2hbmonitors running. Easy to do using a semaphore.
I have left that for Srini. ;)




More information about the Ocfs2-tools-devel mailing list