[rds-devel] [PATCH 01/21] RDS: Socket interface

Wed Jan 28 20:02:49 PST 2009

On Tue, Jan 27, 2009 at 4:08 AM, Evgeniy Polyakov <zbr at ioremap.net> wrote:
> Hi Andy.

Hi Evgeniy thanks for your time in reviewing.

>> +/* this is just used for stats gathering :/ */
>
> Shouldn't this be some kind of per-cpu data?
> Global list of all sockets? This does not scale, maybe it should be
> groupped into hash table or be per-device?

sch mentioned this too... is socket creation often a bottleneck? If so
we can certainly improve scalability here.

In any case, this is in the code to support a listing of RDS sockets
via the rds-info utility. Instead of having our own custom program to
list rds sockets we probably want to export an interface so netstat
will list them. Unfortunately netstat seems to be hardcoded to look
for particular entries in /proc/net, so both rds and netstat would
need to be updated before this would work, and RDS's custom
socket-listing interface dropped.

>> +static int rds_release(struct socket *sock)
>> +{
>> +     struct sock *sk = sock->sk;
>> +     struct rds_sock *rs;
>> +     unsigned long flags;
>> +
>> +     if (sk == NULL)
>> +             goto out;
>> +
>> +     rs = rds_sk_to_rs(sk);
>> +
>> +     sock_orphan(sk);
>
> Why is it needed getting socket is about to be freed?

from the comments above that code:

 * We have to be careful about racing with the incoming path.  sock_orphan()
 * sets SOCK_DEAD and we use that as an indicator to the rx path that new
 * messages shouldn't be queued.

Is that an appropriate usage of sock_orphan()?

> Does RDS sockets work with high number of creation/destruction
> workloads?

I'd guess from your comments that performance probably wouldn't be great :)

>> +static unsigned int rds_poll(struct file *file, struct socket *sock,
>> +                          poll_table *wait)
>> +{
>> +     struct sock *sk = sock->sk;
>> +     struct rds_sock *rs = rds_sk_to_rs(sk);
>> +     unsigned int mask = 0;
>> +     unsigned long flags;
>> +
>> +     poll_wait(file, sk->sk_sleep, wait);
>> +
>> +     poll_wait(file, &rds_poll_waitq, wait);
>> +
>
> Are you absolutely sure that provided poll_table callback
> will not do the bad things here? It is quite unusual to add several
> different queues into the same head in the poll callback.
> And shouldn't rds_poll_waitq be lock protected here?

I don't know. I looked into the poll_wait code a little and it
appeared to be designed to allow multiple.

I'm not very strong in this area and would love some more expert input here.

>> +     read_lock_irqsave(&rs->rs_recv_lock, flags);
>> +     if (!rs->rs_cong_monitor) {
>> +             /* When a congestion map was updated, we signal POLLIN for
>> +              * "historical" reasons. Applications can also poll for
>> +              * WRBAND instead. */
>> +             if (rds_cong_updated_since(&rs->rs_cong_track))
>> +                     mask |= (POLLIN | POLLRDNORM | POLLWRBAND);
>> +     } else {
>> +             spin_lock(&rs->rs_lock);
>
> Is there a possibility to have lock iteraction problem with above
> rs_recv_lock read lock?

I didn't see anywhere where they were being acquired in reverse order,
or simultaneously. This is the kind of thing that lockdep would find
immediately, right? I think I've got that turned on but I'll double
check.

>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
>
> This should be dropped in the mainline tree.

yup.

> Hash table with the appropriate size will have faster lookup/access
> times btw.

No doubt. Definitely want to make this improvement at some point.

>> +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port,
>> +                                        struct rds_sock *insert)
>> +{
>> +     struct rb_node **p = &rds_bind_tree.rb_node;
>> +     struct rb_node *parent = NULL;
>> +     struct rds_sock *rs;
>> +     u64 cmp;
>> +     u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
>> +
>> +     while (*p) {
>> +             parent = *p;
>> +             rs = rb_entry(parent, struct rds_sock, rs_bound_node);
>> +
>> +             cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
>> +                   be16_to_cpu(rs->rs_bound_port);
>> +
>> +             if (needle < cmp)
>
> Should it use wrapping logic if some field overflows?

Sorry, please explain?

>> +     rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr),
>> +             ntohs(port));
>
> Iirc there is a new %pi4 or similar format id.

Yup, will do.

Thanks again.

Regards -- Andy