From metze at samba.org Tue Apr 1 08:19:05 2025 From: metze at samba.org (Stefan Metzmacher) Date: Tue, 1 Apr 2025 10:19:05 +0200 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: Message-ID: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > On 03/31, Stefan Metzmacher wrote: >> The motivation for this is to remove the SOL_SOCKET limitation >> from io_uring_cmd_getsockopt(). >> >> The reason for this limitation is that io_uring_cmd_getsockopt() >> passes a kernel pointer as optlen to do_sock_getsockopt() >> and can't reach the ops->getsockopt() path. >> >> The first idea would be to change the optval and optlen arguments >> to the protocol specific hooks also to sockptr_t, as that >> is already used for setsockopt() and also by do_sock_getsockopt() >> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >> >> But as Linus don't like 'sockptr_t' I used a different approach. >> >> @Linus, would that optlen_t approach fit better for you? > > [..] > >> Instead of passing the optlen as user or kernel pointer, >> we only ever pass a kernel pointer and do the >> translation from/to userspace in do_sock_getsockopt(). > > At this point why not just fully embrace iov_iter? You have the size > now + the user (or kernel) pointer. Might as well do > s/sockptr_t/iov_iter/ conversion? I think that would only be possible if we introduce proto[_ops].getsockopt_iter() and then convert the implementations step by step. Doing it all in one go has a lot of potential to break the uapi. I could try to convert things like socket, ip and tcp myself, but the rest needs to be converted by the maintainer of the specific protocol, as it needs to be tested. As there are crazy things happening in the existing implementations, e.g. some getsockopt() implementations use optval as in and out buffer. I first tried to convert both optval and optlen of getsockopt to sockptr_t, and that showed that touching the optval part starts to get complex very soon, see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 (note it didn't converted everything, I gave up after hitting sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe more are the ones also doing both copy_from_user and copy_to_user on optval) I come also across one implementation that returned -ERANGE because *optlen was too short and put the required length into *optlen, which means the returned *optlen is larger than the optval buffer given from userspace. Because of all these strange things I tried to do a minimal change in order to get rid of the io_uring limitation and only converted optlen and leave optval as is. In order to have a patchset that has a low risk to cause regressions. But as alternative introducing a prototype like this: int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_iter); That returns a non-negative value which can be placed into *optlen or negative value as error and *optlen will not be changed on error. optval_iter will get direction ITER_DEST, so it can only be written to. Implementations could then opt in for the new interface and allow do_sock_getsockopt() work also for the io_uring case, while all others would still get -EOPNOTSUPP. So what should be the way to go? metze From metze at samba.org Tue Apr 1 08:24:32 2025 From: metze at samba.org (Stefan Metzmacher) Date: Tue, 1 Apr 2025 10:24:32 +0200 Subject: [rds-devel] [RFC PATCH 3/4] net: pass a kernel pointer via 'optlen_t' to proto[ops].getsockopt() hooks In-Reply-To: <20250331224946.13899fcf@pumpkin> References: <20250331224946.13899fcf@pumpkin> Message-ID: <51bb66d4-eaf3-4247-ba11-d793b6f0d56c@samba.org> Am 31.03.25 um 23:49 schrieb David Laight: > On Mon, 31 Mar 2025 22:10:55 +0200 > Stefan Metzmacher wrote: > >> The motivation for this is to remove the SOL_SOCKET limitation >> from io_uring_cmd_getsockopt(). >> >> The reason for this limitation is that io_uring_cmd_getsockopt() >> passes a kernel pointer. >> >> The first idea would be to change the optval and optlen arguments >> to the protocol specific hooks also to sockptr_t, as that >> is already used for setsockopt() and also by do_sock_getsockopt() >> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >> >> But as Linus don't like 'sockptr_t' I used a different approach. >> >> Instead of passing the optlen as user or kernel pointer, >> we only ever pass a kernel pointer and do the >> translation from/to userspace in do_sock_getsockopt(). >> >> The simple solution would be to just remove the >> '__user' from the int *optlen argument, but it >> seems the compiler doesn't complain about >> '__user' vs. without it, so instead I used >> a helper struct in order to make sure everything >> compiles with a typesafe change. >> >> That together with get_optlen() and put_optlen() helper >> macros make it relatively easy to review and check the >> behaviour is most likely unchanged. > > I've looked into this before (and fallen down the patch rabbit hole). Yes, if you want to change the logic at the same time as changing the kind of argument variable, then it get messy quite fast. > I think the best (final) solution is to pass a validated non-negative > 'optlen' into all getsockopt() functions and to have them usually return > either -errno or the modified length. > This simplifies 99% of the functions. Yes, maybe not 99%, but a lot. > The problem case is functions that want to update the length and return > an error. > By best solution is to support return values of -errno << 20 | length > (as well as -errno and length). > > There end up being some slight behaviour changes. > - Some code tries to 'undo' actions if the length can't be updated. > I'm sure this is unnecessary and the recovery path is untested and > could be buggy. Provided the kernel data is consistent there is > no point trying to get code to recover from EFAULT. > The 'length' has been read - so would also need to be readonly > or unmapped by a second thread! > - A lot of getsockopt functions actually treat a negative length as 4. > I think this 'bug' needs to preserved to avoid breaking applications. > > The changes are mechanical but very widespread. > > They also give the option of not writing back the length if unchanged. See my other mail regarding proto[_ops].getsockopt_iter(), where implementation could be converted step by step. But we may still need to keep the current proto[ops].getsockopt() as proto[ops].getsockopt_legacy() in order to keep the insane uapi semantics alive. metze From leitao at debian.org Tue Apr 1 12:17:09 2025 From: leitao at debian.org (Breno Leitao) Date: Tue, 1 Apr 2025 05:17:09 -0700 Subject: [rds-devel] [RFC PATCH 1/4] net: introduce get_optlen() and put_optlen() helpers In-Reply-To: <156e83128747b2cf7c755bffa68f2519bd255f78.1743449872.git.metze@samba.org> References: <156e83128747b2cf7c755bffa68f2519bd255f78.1743449872.git.metze@samba.org> Message-ID: Hello Stefan, On Mon, Mar 31, 2025 at 10:10:53PM +0200, Stefan Metzmacher wrote: > --- a/include/linux/sockptr.h > +++ b/include/linux/sockptr.h > @@ -169,4 +169,26 @@ static inline int check_zeroed_sockptr(sockptr_t src, size_t offset, > return memchr_inv(src.kernel + offset, 0, size) == NULL; > } > > +#define __check_optlen_t(__optlen) \ > +({ \ > + int __user *__ptr __maybe_unused = __optlen; \ > + BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int)); \ > +}) I am a bit confused about this macro. I understand that this macro's goal is to check that __optlen is a pointer to an integer, otherwise failed to build. It is unclear to me if that is what it does. Let's suppose that __optlen is not an integer pointer. Then: > int __user *__ptr __maybe_unused = __optlen; This will generate a compile failure/warning due invalid casting, depending on -Wincompatible-pointer-types. > BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int)); Then this comparison will always false, since __ptr is a pointer to int, and you are comparing the size of its content with the sizeof(int). From metze at samba.org Tue Apr 1 12:22:50 2025 From: metze at samba.org (Stefan Metzmacher) Date: Tue, 1 Apr 2025 14:22:50 +0200 Subject: [rds-devel] [RFC PATCH 1/4] net: introduce get_optlen() and put_optlen() helpers In-Reply-To: References: <156e83128747b2cf7c755bffa68f2519bd255f78.1743449872.git.metze@samba.org> Message-ID: <90334e83-618b-41e0-a35c-9ce8b0d1d990@samba.org> Hello Breno, > On Mon, Mar 31, 2025 at 10:10:53PM +0200, Stefan Metzmacher wrote: >> --- a/include/linux/sockptr.h >> +++ b/include/linux/sockptr.h >> @@ -169,4 +169,26 @@ static inline int check_zeroed_sockptr(sockptr_t src, size_t offset, >> return memchr_inv(src.kernel + offset, 0, size) == NULL; >> } >> >> +#define __check_optlen_t(__optlen) \ >> +({ \ >> + int __user *__ptr __maybe_unused = __optlen; \ >> + BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int)); \ >> +}) > > I am a bit confused about this macro. I understand that this macro's > goal is to check that __optlen is a pointer to an integer, otherwise > failed to build. > > It is unclear to me if that is what it does. Let's suppose that __optlen > is not an integer pointer. Then: > >> int __user *__ptr __maybe_unused = __optlen; > > This will generate a compile failure/warning due invalid casting, > depending on -Wincompatible-pointer-types. > >> BUILD_BUG_ON(sizeof(*(__ptr)) != sizeof(int)); > > Then this comparison will always false, since __ptr is a pointer to int, > and you are comparing the size of its content with the sizeof(int). Yes, it redundant in the first patch, it gets little more useful in the 2nd and 3rd patch. metze From metze at samba.org Tue Apr 1 13:37:28 2025 From: metze at samba.org (Stefan Metzmacher) Date: Tue, 1 Apr 2025 15:37:28 +0200 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> Message-ID: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >> On 03/31, Stefan Metzmacher wrote: >>> The motivation for this is to remove the SOL_SOCKET limitation >>> from io_uring_cmd_getsockopt(). >>> >>> The reason for this limitation is that io_uring_cmd_getsockopt() >>> passes a kernel pointer as optlen to do_sock_getsockopt() >>> and can't reach the ops->getsockopt() path. >>> >>> The first idea would be to change the optval and optlen arguments >>> to the protocol specific hooks also to sockptr_t, as that >>> is already used for setsockopt() and also by do_sock_getsockopt() >>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>> >>> But as Linus don't like 'sockptr_t' I used a different approach. >>> >>> @Linus, would that optlen_t approach fit better for you? >> >> [..] >> >>> Instead of passing the optlen as user or kernel pointer, >>> we only ever pass a kernel pointer and do the >>> translation from/to userspace in do_sock_getsockopt(). >> >> At this point why not just fully embrace iov_iter? You have the size >> now + the user (or kernel) pointer. Might as well do >> s/sockptr_t/iov_iter/ conversion? > > I think that would only be possible if we introduce > proto[_ops].getsockopt_iter() and then convert the implementations > step by step. Doing it all in one go has a lot of potential to break > the uapi. I could try to convert things like socket, ip and tcp myself, but > the rest needs to be converted by the maintainer of the specific protocol, > as it needs to be tested. As there are crazy things happening in the existing > implementations, e.g. some getsockopt() implementations use optval as in and out > buffer. > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > and that showed that touching the optval part starts to get complex very soon, > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > (note it didn't converted everything, I gave up after hitting > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > more are the ones also doing both copy_from_user and copy_to_user on optval) > > I come also across one implementation that returned -ERANGE because *optlen was > too short and put the required length into *optlen, which means the returned > *optlen is larger than the optval buffer given from userspace. > > Because of all these strange things I tried to do a minimal change > in order to get rid of the io_uring limitation and only converted > optlen and leave optval as is. > > In order to have a patchset that has a low risk to cause regressions. > > But as alternative introducing a prototype like this: > > ??????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > ?????????????????????????????? struct iov_iter *optval_iter); > > That returns a non-negative value which can be placed into *optlen > or negative value as error and *optlen will not be changed on error. > optval_iter will get direction ITER_DEST, so it can only be written to. > > Implementations could then opt in for the new interface and > allow do_sock_getsockopt() work also for the io_uring case, > while all others would still get -EOPNOTSUPP. > > So what should be the way to go? Ok, I've added the infrastructure for getsockopt_iter, see below, but the first part I wanted to convert was tcp_ao_copy_mkts_to_user() and that also reads from userspace before writing. So we could go with the optlen_t approach, or we need logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one with ITER_DEST... So who wants to decide? Thanks! metze --- include/linux/net.h | 4 +++ include/net/sock.h | 64 +++++++++++++++++++++++++++++++++++++++++++++ net/core/sock.c | 12 +++++++-- net/socket.c | 12 +++++++-- 4 files changed, 88 insertions(+), 4 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index 0ff950eecc6b..ceb9f9ed84b9 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -194,6 +194,10 @@ struct proto_ops { unsigned int optlen); int (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); + int (*getsockopt_iter)(struct socket *sock, + int level, + int optname, + struct iov_iter *optval_iter); void (*show_fdinfo)(struct seq_file *m, struct socket *sock); int (*sendmsg) (struct socket *sock, struct msghdr *m, size_t total_len); diff --git a/include/net/sock.h b/include/net/sock.h index 8daf1b3b12c6..e741b219056e 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1249,6 +1249,11 @@ struct proto { int (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval, int __user *option); + int (*getsockopt_iter)(struct sock *sk, + int level, + int optname, + struct iov_iter *optval_iter); + void (*keepalive)(struct sock *sk, int valbool); #ifdef CONFIG_COMPAT int (*compat_ioctl)(struct sock *sk, @@ -1781,6 +1786,65 @@ int do_sock_setsockopt(struct socket *sock, bool compat, int level, int do_sock_getsockopt(struct socket *sock, bool compat, int level, int optname, sockptr_t optval, sockptr_t optlen); +#define __generic_wrap_getsockopt_iter(__s, __level, \ + __optname, __optval, __optlen, \ + __getsockopt_iter) \ +do { \ + struct iov_iter optval_iter; \ + struct kvec optval_kvec; \ + int len; \ + int err; \ + \ + if (unlikely(__getsockopt_iter == NULL)) \ + return -EOPNOTSUPP; \ + \ + if (copy_from_sockptr(&len, __optlen, sizeof(len))) \ + return -EFAULT; \ + \ + if (len < 0) \ + return -EINVAL; \ + \ + if (__optval.is_kernel) { \ + if (__optval.kernel == NULL && len != 0) \ + return -EFAULT; \ + \ + optval_kvec = (struct kvec) { \ + .iov_base = __optval.kernel, \ + .iov_len = len, \ + }; \ + \ + iov_iter_kvec(&optval_iter, ITER_DEST, \ + &optval_kvec, 1, optval_kvec.iov_len); \ + } else { \ + if (import_ubuf(ITER_DEST, __optval.user, len, &optval_iter)) \ + return -EFAULT; \ + } \ + \ + err = getsockopt_iter(__s, __level, __optname, &optval_iter); \ + if (unlikely(err < 0)) \ + return err; \ + \ + len = err; \ + if (copy_to_sockptr(__optlen, &len, sizeof(len))) \ + return -EFAULT; \ + \ + return 0; \ +} while (0) + +static __always_inline +int sk_wrap_getsockopt_iter(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen, + int (*getsockopt_iter)(struct sock *sk, int level, int optname, struct iov_iter *optval_iter)) +{ + __generic_wrap_getsockopt_iter(sk, level, optname, optval, optlen, getsockopt_iter); +} + +static __always_inline +int sock_wrap_getsockopt_iter(struct socket *sock, int level, int optname, sockptr_t optval, sockptr_t optlen, + int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_iter)) +{ + __generic_wrap_getsockopt_iter(sock, level, optname, optval, optlen, getsockopt_iter); +} + int sk_getsockopt(struct sock *sk, int level, int optname, sockptr_t optval, sockptr_t optlen); int sock_gettstamp(struct socket *sock, void __user *userstamp, diff --git a/net/core/sock.c b/net/core/sock.c index 323892066def..61625060e724 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3857,9 +3857,17 @@ int sock_common_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen) { struct sock *sk = sock->sk; - /* IPV6_ADDRFORM can change sk->sk_prot under us. */ - return READ_ONCE(sk->sk_prot)->getsockopt(sk, level, optname, optval, optlen); + struct proto *prot = READ_ONCE(sk->sk_prot); + + if (prot->getsockopt_iter) { + return sk_wrap_getsockopt_iter(sk, level, optname, + USER_SOCKPTR(optval), + USER_SOCKPTR(optlen), + prot->getsockopt_iter); + } + + return prot->getsockopt(sk, level, optname, optval, optlen); } EXPORT_SYMBOL(sock_common_getsockopt); diff --git a/net/socket.c b/net/socket.c index 9a0e720f0859..792cfd272611 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2335,6 +2335,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level, { int max_optlen __maybe_unused = 0; const struct proto_ops *ops; + const struct proto *prot; int err; err = security_socket_getsockopt(sock, level, optname); @@ -2345,12 +2346,19 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level, copy_from_sockptr(&max_optlen, optlen, sizeof(int)); ops = READ_ONCE(sock->ops); + prot = READ_ONCE(sock->sk->sk_prot); if (level == SOL_SOCKET) { err = sk_getsockopt(sock->sk, level, optname, optval, optlen); - } else if (unlikely(!ops->getsockopt)) { + } else if (ops->getsockopt_iter) { + err = sock_wrap_getsockopt_iter(sock, level, optname, optval, optlen, + ops->getsockopt_iter); + } else if (ops->getsockopt == sock_common_getsockopt && prot->getsockopt_iter) { + err = sk_wrap_getsockopt_iter(sock->sk, level, optname, optval, optlen, + prot->getsockopt_iter); + } else if (unlikely(!ops->getsockopt || optlen.is_kernel)) { err = -EOPNOTSUPP; } else { - if (WARN_ONCE(optval.is_kernel || optlen.is_kernel, + if (WARN_ONCE(optval.is_kernel, "Invalid argument type")) return -EOPNOTSUPP; -- 2.34.1 From metze at samba.org Tue Apr 1 13:48:58 2025 From: metze at samba.org (Stefan Metzmacher) Date: Tue, 1 Apr 2025 15:48:58 +0200 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> Message-ID: Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: >> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >>> On 03/31, Stefan Metzmacher wrote: >>>> The motivation for this is to remove the SOL_SOCKET limitation >>>> from io_uring_cmd_getsockopt(). >>>> >>>> The reason for this limitation is that io_uring_cmd_getsockopt() >>>> passes a kernel pointer as optlen to do_sock_getsockopt() >>>> and can't reach the ops->getsockopt() path. >>>> >>>> The first idea would be to change the optval and optlen arguments >>>> to the protocol specific hooks also to sockptr_t, as that >>>> is already used for setsockopt() and also by do_sock_getsockopt() >>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>>> >>>> But as Linus don't like 'sockptr_t' I used a different approach. >>>> >>>> @Linus, would that optlen_t approach fit better for you? >>> >>> [..] >>> >>>> Instead of passing the optlen as user or kernel pointer, >>>> we only ever pass a kernel pointer and do the >>>> translation from/to userspace in do_sock_getsockopt(). >>> >>> At this point why not just fully embrace iov_iter? You have the size >>> now + the user (or kernel) pointer. Might as well do >>> s/sockptr_t/iov_iter/ conversion? >> >> I think that would only be possible if we introduce >> proto[_ops].getsockopt_iter() and then convert the implementations >> step by step. Doing it all in one go has a lot of potential to break >> the uapi. I could try to convert things like socket, ip and tcp myself, but >> the rest needs to be converted by the maintainer of the specific protocol, >> as it needs to be tested. As there are crazy things happening in the existing >> implementations, e.g. some getsockopt() implementations use optval as in and out >> buffer. >> >> I first tried to convert both optval and optlen of getsockopt to sockptr_t, >> and that showed that touching the optval part starts to get complex very soon, >> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 >> (note it didn't converted everything, I gave up after hitting >> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. >> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe >> more are the ones also doing both copy_from_user and copy_to_user on optval) >> >> I come also across one implementation that returned -ERANGE because *optlen was >> too short and put the required length into *optlen, which means the returned >> *optlen is larger than the optval buffer given from userspace. >> >> Because of all these strange things I tried to do a minimal change >> in order to get rid of the io_uring limitation and only converted >> optlen and leave optval as is. >> >> In order to have a patchset that has a low risk to cause regressions. >> >> But as alternative introducing a prototype like this: >> >> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, >> ??????????????????????????????? struct iov_iter *optval_iter); >> >> That returns a non-negative value which can be placed into *optlen >> or negative value as error and *optlen will not be changed on error. >> optval_iter will get direction ITER_DEST, so it can only be written to. >> >> Implementations could then opt in for the new interface and >> allow do_sock_getsockopt() work also for the io_uring case, >> while all others would still get -EOPNOTSUPP. >> >> So what should be the way to go? > > Ok, I've added the infrastructure for getsockopt_iter, see below, > but the first part I wanted to convert was > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > writing. > > So we could go with the optlen_t approach, or we need > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > with ITER_DEST... > > So who wants to decide? I just noticed that it's even possible in same cases to pass in a short buffer to optval, but have a longer value in optlen, hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. This makes it really hard to believe that trying to use iov_iter for this is a good idea :-( Any ideas beside just going with optlen_t? metze From leitao at debian.org Tue Apr 1 15:35:50 2025 From: leitao at debian.org (Breno Leitao) Date: Tue, 1 Apr 2025 08:35:50 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> Message-ID: On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > On 03/31, Stefan Metzmacher wrote: > > > > > The motivation for this is to remove the SOL_SOCKET limitation > > > > > from io_uring_cmd_getsockopt(). > > > > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > and can't reach the ops->getsockopt() path. > > > > > > > > > > The first idea would be to change the optval and optlen arguments > > > > > to the protocol specific hooks also to sockptr_t, as that > > > > > is already used for setsockopt() and also by do_sock_getsockopt() > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > > > > > @Linus, would that optlen_t approach fit better for you? > > > > > > > > [..] > > > > > > > > > Instead of passing the optlen as user or kernel pointer, > > > > > we only ever pass a kernel pointer and do the > > > > > translation from/to userspace in do_sock_getsockopt(). > > > > > > > > At this point why not just fully embrace iov_iter? You have the size > > > > now + the user (or kernel) pointer. Might as well do > > > > s/sockptr_t/iov_iter/ conversion? > > > > > > I think that would only be possible if we introduce > > > proto[_ops].getsockopt_iter() and then convert the implementations > > > step by step. Doing it all in one go has a lot of potential to break > > > the uapi. I could try to convert things like socket, ip and tcp myself, but > > > the rest needs to be converted by the maintainer of the specific protocol, > > > as it needs to be tested. As there are crazy things happening in the existing > > > implementations, e.g. some getsockopt() implementations use optval as in and out > > > buffer. > > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > and that showed that touching the optval part starts to get complex very soon, > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > (note it didn't converted everything, I gave up after hitting > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > I come also across one implementation that returned -ERANGE because *optlen was > > > too short and put the required length into *optlen, which means the returned > > > *optlen is larger than the optval buffer given from userspace. > > > > > > Because of all these strange things I tried to do a minimal change > > > in order to get rid of the io_uring limitation and only converted > > > optlen and leave optval as is. > > > > > > In order to have a patchset that has a low risk to cause regressions. > > > > > > But as alternative introducing a prototype like this: > > > > > > ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > ??????????????????????????????? struct iov_iter *optval_iter); > > > > > > That returns a non-negative value which can be placed into *optlen > > > or negative value as error and *optlen will not be changed on error. > > > optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > Implementations could then opt in for the new interface and > > > allow do_sock_getsockopt() work also for the io_uring case, > > > while all others would still get -EOPNOTSUPP. > > > > > > So what should be the way to go? > > > > Ok, I've added the infrastructure for getsockopt_iter, see below, > > but the first part I wanted to convert was > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > writing. > > > > So we could go with the optlen_t approach, or we need > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > with ITER_DEST... > > > > So who wants to decide? > > I just noticed that it's even possible in same cases > to pass in a short buffer to optval, but have a longer value in optlen, > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > This makes it really hard to believe that trying to use iov_iter for this > is a good idea :-( That was my finding as well a while ago, when I was planning to get the __user pointers converted to iov_iter. There are some weird ways of using optlen and optval, which makes them non-trivial to covert to iov_iter. From stfomichev at gmail.com Tue Apr 1 15:45:39 2025 From: stfomichev at gmail.com (Stanislav Fomichev) Date: Tue, 1 Apr 2025 08:45:39 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> Message-ID: On 04/01, Breno Leitao wrote: > On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > On 03/31, Stefan Metzmacher wrote: > > > > > > The motivation for this is to remove the SOL_SOCKET limitation > > > > > > from io_uring_cmd_getsockopt(). > > > > > > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > > passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > and can't reach the ops->getsockopt() path. > > > > > > > > > > > > The first idea would be to change the optval and optlen arguments > > > > > > to the protocol specific hooks also to sockptr_t, as that > > > > > > is already used for setsockopt() and also by do_sock_getsockopt() > > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > > > > > > > @Linus, would that optlen_t approach fit better for you? > > > > > > > > > > [..] > > > > > > > > > > > Instead of passing the optlen as user or kernel pointer, > > > > > > we only ever pass a kernel pointer and do the > > > > > > translation from/to userspace in do_sock_getsockopt(). > > > > > > > > > > At this point why not just fully embrace iov_iter? You have the size > > > > > now + the user (or kernel) pointer. Might as well do > > > > > s/sockptr_t/iov_iter/ conversion? > > > > > > > > I think that would only be possible if we introduce > > > > proto[_ops].getsockopt_iter() and then convert the implementations > > > > step by step. Doing it all in one go has a lot of potential to break > > > > the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > the rest needs to be converted by the maintainer of the specific protocol, > > > > as it needs to be tested. As there are crazy things happening in the existing > > > > implementations, e.g. some getsockopt() implementations use optval as in and out > > > > buffer. > > > > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > and that showed that touching the optval part starts to get complex very soon, > > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > (note it didn't converted everything, I gave up after hitting > > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > > > I come also across one implementation that returned -ERANGE because *optlen was > > > > too short and put the required length into *optlen, which means the returned > > > > *optlen is larger than the optval buffer given from userspace. > > > > > > > > Because of all these strange things I tried to do a minimal change > > > > in order to get rid of the io_uring limitation and only converted > > > > optlen and leave optval as is. > > > > > > > > In order to have a patchset that has a low risk to cause regressions. > > > > > > > > But as alternative introducing a prototype like this: > > > > > > > > ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > ??????????????????????????????? struct iov_iter *optval_iter); > > > > > > > > That returns a non-negative value which can be placed into *optlen > > > > or negative value as error and *optlen will not be changed on error. > > > > optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > > > Implementations could then opt in for the new interface and > > > > allow do_sock_getsockopt() work also for the io_uring case, > > > > while all others would still get -EOPNOTSUPP. > > > > > > > > So what should be the way to go? > > > > > > Ok, I've added the infrastructure for getsockopt_iter, see below, > > > but the first part I wanted to convert was > > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > writing. > > > > > > So we could go with the optlen_t approach, or we need > > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > with ITER_DEST... > > > > > > So who wants to decide? > > > > I just noticed that it's even possible in same cases > > to pass in a short buffer to optval, but have a longer value in optlen, > > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > This makes it really hard to believe that trying to use iov_iter for this > > is a good idea :-( > > That was my finding as well a while ago, when I was planning to get the > __user pointers converted to iov_iter. There are some weird ways of > using optlen and optval, which makes them non-trivial to covert to > iov_iter. Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% of useful socket opts. See if there are any obvious problems with them and if not, try converting. The rest we can cover separately when/if needed. From metze at samba.org Tue Apr 1 21:20:45 2025 From: metze at samba.org (Stefan Metzmacher) Date: Tue, 1 Apr 2025 23:20:45 +0200 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> Message-ID: <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > On 04/01, Breno Leitao wrote: >> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: >>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: >>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: >>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >>>>>> On 03/31, Stefan Metzmacher wrote: >>>>>>> The motivation for this is to remove the SOL_SOCKET limitation >>>>>>> from io_uring_cmd_getsockopt(). >>>>>>> >>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() >>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() >>>>>>> and can't reach the ops->getsockopt() path. >>>>>>> >>>>>>> The first idea would be to change the optval and optlen arguments >>>>>>> to the protocol specific hooks also to sockptr_t, as that >>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() >>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>>>>>> >>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. >>>>>>> >>>>>>> @Linus, would that optlen_t approach fit better for you? >>>>>> >>>>>> [..] >>>>>> >>>>>>> Instead of passing the optlen as user or kernel pointer, >>>>>>> we only ever pass a kernel pointer and do the >>>>>>> translation from/to userspace in do_sock_getsockopt(). >>>>>> >>>>>> At this point why not just fully embrace iov_iter? You have the size >>>>>> now + the user (or kernel) pointer. Might as well do >>>>>> s/sockptr_t/iov_iter/ conversion? >>>>> >>>>> I think that would only be possible if we introduce >>>>> proto[_ops].getsockopt_iter() and then convert the implementations >>>>> step by step. Doing it all in one go has a lot of potential to break >>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but >>>>> the rest needs to be converted by the maintainer of the specific protocol, >>>>> as it needs to be tested. As there are crazy things happening in the existing >>>>> implementations, e.g. some getsockopt() implementations use optval as in and out >>>>> buffer. >>>>> >>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, >>>>> and that showed that touching the optval part starts to get complex very soon, >>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 >>>>> (note it didn't converted everything, I gave up after hitting >>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. >>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe >>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) >>>>> >>>>> I come also across one implementation that returned -ERANGE because *optlen was >>>>> too short and put the required length into *optlen, which means the returned >>>>> *optlen is larger than the optval buffer given from userspace. >>>>> >>>>> Because of all these strange things I tried to do a minimal change >>>>> in order to get rid of the io_uring limitation and only converted >>>>> optlen and leave optval as is. >>>>> >>>>> In order to have a patchset that has a low risk to cause regressions. >>>>> >>>>> But as alternative introducing a prototype like this: >>>>> >>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, >>>>> ??????????????????????????????? struct iov_iter *optval_iter); >>>>> >>>>> That returns a non-negative value which can be placed into *optlen >>>>> or negative value as error and *optlen will not be changed on error. >>>>> optval_iter will get direction ITER_DEST, so it can only be written to. >>>>> >>>>> Implementations could then opt in for the new interface and >>>>> allow do_sock_getsockopt() work also for the io_uring case, >>>>> while all others would still get -EOPNOTSUPP. >>>>> >>>>> So what should be the way to go? >>>> >>>> Ok, I've added the infrastructure for getsockopt_iter, see below, >>>> but the first part I wanted to convert was >>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before >>>> writing. >>>> >>>> So we could go with the optlen_t approach, or we need >>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one >>>> with ITER_DEST... >>>> >>>> So who wants to decide? >>> >>> I just noticed that it's even possible in same cases >>> to pass in a short buffer to optval, but have a longer value in optlen, >>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. >>> >>> This makes it really hard to believe that trying to use iov_iter for this >>> is a good idea :-( >> >> That was my finding as well a while ago, when I was planning to get the >> __user pointers converted to iov_iter. There are some weird ways of >> using optlen and optval, which makes them non-trivial to covert to >> iov_iter. > > Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > of useful socket opts. See if there are any obvious problems with them > and if not, try converting. The rest we can cover separately when/if > needed. That's what I tried, but it fails with tcp_getsockopt -> do_tcp_getsockopt -> tcp_ao_get_mkts -> tcp_ao_copy_mkts_to_user -> copy_struct_from_sockptr tcp_ao_get_sock_info -> copy_struct_from_sockptr That's not possible with a ITER_DEST iov_iter. metze From stfomichev at gmail.com Tue Apr 1 22:04:29 2025 From: stfomichev at gmail.com (Stanislav Fomichev) Date: Tue, 1 Apr 2025 15:04:29 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> Message-ID: On 04/01, Stefan Metzmacher wrote: > Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > On 04/01, Breno Leitao wrote: > > > On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > > Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > > > On 03/31, Stefan Metzmacher wrote: > > > > > > > > The motivation for this is to remove the SOL_SOCKET limitation > > > > > > > > from io_uring_cmd_getsockopt(). > > > > > > > > > > > > > > > > The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > > > > passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > > > and can't reach the ops->getsockopt() path. > > > > > > > > > > > > > > > > The first idea would be to change the optval and optlen arguments > > > > > > > > to the protocol specific hooks also to sockptr_t, as that > > > > > > > > is already used for setsockopt() and also by do_sock_getsockopt() > > > > > > > > sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > > > > > > > > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > > > > > > > > > > > @Linus, would that optlen_t approach fit better for you? > > > > > > > > > > > > > > [..] > > > > > > > > > > > > > > > Instead of passing the optlen as user or kernel pointer, > > > > > > > > we only ever pass a kernel pointer and do the > > > > > > > > translation from/to userspace in do_sock_getsockopt(). > > > > > > > > > > > > > > At this point why not just fully embrace iov_iter? You have the size > > > > > > > now + the user (or kernel) pointer. Might as well do > > > > > > > s/sockptr_t/iov_iter/ conversion? > > > > > > > > > > > > I think that would only be possible if we introduce > > > > > > proto[_ops].getsockopt_iter() and then convert the implementations > > > > > > step by step. Doing it all in one go has a lot of potential to break > > > > > > the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > > > the rest needs to be converted by the maintainer of the specific protocol, > > > > > > as it needs to be tested. As there are crazy things happening in the existing > > > > > > implementations, e.g. some getsockopt() implementations use optval as in and out > > > > > > buffer. > > > > > > > > > > > > I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > > > and that showed that touching the optval part starts to get complex very soon, > > > > > > see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > > (note it didn't converted everything, I gave up after hitting > > > > > > sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > > sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > > > more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > > > > > > > I come also across one implementation that returned -ERANGE because *optlen was > > > > > > too short and put the required length into *optlen, which means the returned > > > > > > *optlen is larger than the optval buffer given from userspace. > > > > > > > > > > > > Because of all these strange things I tried to do a minimal change > > > > > > in order to get rid of the io_uring limitation and only converted > > > > > > optlen and leave optval as is. > > > > > > > > > > > > In order to have a patchset that has a low risk to cause regressions. > > > > > > > > > > > > But as alternative introducing a prototype like this: > > > > > > > > > > > > ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > > > ??????????????????????????????? struct iov_iter *optval_iter); > > > > > > > > > > > > That returns a non-negative value which can be placed into *optlen > > > > > > or negative value as error and *optlen will not be changed on error. > > > > > > optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > > > > > > > Implementations could then opt in for the new interface and > > > > > > allow do_sock_getsockopt() work also for the io_uring case, > > > > > > while all others would still get -EOPNOTSUPP. > > > > > > > > > > > > So what should be the way to go? > > > > > > > > > > Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > > but the first part I wanted to convert was > > > > > tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > > writing. > > > > > > > > > > So we could go with the optlen_t approach, or we need > > > > > logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > > with ITER_DEST... > > > > > > > > > > So who wants to decide? > > > > > > > > I just noticed that it's even possible in same cases > > > > to pass in a short buffer to optval, but have a longer value in optlen, > > > > hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > > > > > This makes it really hard to believe that trying to use iov_iter for this > > > > is a good idea :-( > > > > > > That was my finding as well a while ago, when I was planning to get the > > > __user pointers converted to iov_iter. There are some weird ways of > > > using optlen and optval, which makes them non-trivial to covert to > > > iov_iter. > > > > Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > of useful socket opts. See if there are any obvious problems with them > > and if not, try converting. The rest we can cover separately when/if > > needed. > > That's what I tried, but it fails with > tcp_getsockopt -> > do_tcp_getsockopt -> > tcp_ao_get_mkts -> > tcp_ao_copy_mkts_to_user -> > copy_struct_from_sockptr > tcp_ao_get_sock_info -> > copy_struct_from_sockptr > > That's not possible with a ITER_DEST iov_iter. > > metze Can we create two iterators over the same memory? One for ITER_SOURCE and another for ITER_DEST. And then make getsockopt_iter accept optval_in and optval_out. We can also use optval_out position (iov_offset) as optlen output value. Don't see why it won't work, but I agree that's gonna be a messy conversion so let's see if someone else has better suggestions. From metze at samba.org Tue Apr 1 22:53:58 2025 From: metze at samba.org (Stefan Metzmacher) Date: Wed, 2 Apr 2025 00:53:58 +0200 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> Message-ID: <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > On 04/01, Stefan Metzmacher wrote: >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: >>> On 04/01, Breno Leitao wrote: >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: >>>>>>>> On 03/31, Stefan Metzmacher wrote: >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation >>>>>>>>> from io_uring_cmd_getsockopt(). >>>>>>>>> >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() >>>>>>>>> and can't reach the ops->getsockopt() path. >>>>>>>>> >>>>>>>>> The first idea would be to change the optval and optlen arguments >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). >>>>>>>>> >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. >>>>>>>>> >>>>>>>>> @Linus, would that optlen_t approach fit better for you? >>>>>>>> >>>>>>>> [..] >>>>>>>> >>>>>>>>> Instead of passing the optlen as user or kernel pointer, >>>>>>>>> we only ever pass a kernel pointer and do the >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). >>>>>>>> >>>>>>>> At this point why not just fully embrace iov_iter? You have the size >>>>>>>> now + the user (or kernel) pointer. Might as well do >>>>>>>> s/sockptr_t/iov_iter/ conversion? >>>>>>> >>>>>>> I think that would only be possible if we introduce >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations >>>>>>> step by step. Doing it all in one go has a lot of potential to break >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, >>>>>>> as it needs to be tested. As there are crazy things happening in the existing >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out >>>>>>> buffer. >>>>>>> >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, >>>>>>> and that showed that touching the optval part starts to get complex very soon, >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 >>>>>>> (note it didn't converted everything, I gave up after hitting >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) >>>>>>> >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was >>>>>>> too short and put the required length into *optlen, which means the returned >>>>>>> *optlen is larger than the optval buffer given from userspace. >>>>>>> >>>>>>> Because of all these strange things I tried to do a minimal change >>>>>>> in order to get rid of the io_uring limitation and only converted >>>>>>> optlen and leave optval as is. >>>>>>> >>>>>>> In order to have a patchset that has a low risk to cause regressions. >>>>>>> >>>>>>> But as alternative introducing a prototype like this: >>>>>>> >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); >>>>>>> >>>>>>> That returns a non-negative value which can be placed into *optlen >>>>>>> or negative value as error and *optlen will not be changed on error. >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. >>>>>>> >>>>>>> Implementations could then opt in for the new interface and >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, >>>>>>> while all others would still get -EOPNOTSUPP. >>>>>>> >>>>>>> So what should be the way to go? >>>>>> >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, >>>>>> but the first part I wanted to convert was >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before >>>>>> writing. >>>>>> >>>>>> So we could go with the optlen_t approach, or we need >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one >>>>>> with ITER_DEST... >>>>>> >>>>>> So who wants to decide? >>>>> >>>>> I just noticed that it's even possible in same cases >>>>> to pass in a short buffer to optval, but have a longer value in optlen, >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. >>>>> >>>>> This makes it really hard to believe that trying to use iov_iter for this >>>>> is a good idea :-( >>>> >>>> That was my finding as well a while ago, when I was planning to get the >>>> __user pointers converted to iov_iter. There are some weird ways of >>>> using optlen and optval, which makes them non-trivial to covert to >>>> iov_iter. >>> >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% >>> of useful socket opts. See if there are any obvious problems with them >>> and if not, try converting. The rest we can cover separately when/if >>> needed. >> >> That's what I tried, but it fails with >> tcp_getsockopt -> >> do_tcp_getsockopt -> >> tcp_ao_get_mkts -> >> tcp_ao_copy_mkts_to_user -> >> copy_struct_from_sockptr >> tcp_ao_get_sock_info -> >> copy_struct_from_sockptr >> >> That's not possible with a ITER_DEST iov_iter. >> >> metze > > Can we create two iterators over the same memory? One for ITER_SOURCE and > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > optval_out. We can also use optval_out position (iov_offset) as optlen output > value. Don't see why it won't work, but I agree that's gonna be a messy > conversion so let's see if someone else has better suggestions. Yes, that might work, but it would be good to get some feedback if this would be the way to go: int (*getsockopt_iter)(struct socket *sock, int level, int optname, struct iov_iter *optval_in, struct iov_iter *optval_out); And *optlen = optval_out->iov_offset; Any objection or better ideas? Linus would that be what you had in mind? Thanks! metze From stfomichev at gmail.com Wed Apr 2 14:19:46 2025 From: stfomichev at gmail.com (Stanislav Fomichev) Date: Wed, 2 Apr 2025 07:19:46 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <20250402132906.0ceb8985@pumpkin> References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> <20250402132906.0ceb8985@pumpkin> Message-ID: On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 00:53:58 +0200 > Stefan Metzmacher wrote: > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > On 04/01, Stefan Metzmacher wrote: > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > >>> On 04/01, Breno Leitao wrote: > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > >>>>>>>>> > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > >>>>>>>>> > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > >>>>>>>>> > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > >>>>>>>>> > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > >>>>>>>> > > >>>>>>>> [..] > > >>>>>>>> > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > >>>>>>>>> we only ever pass a kernel pointer and do the > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > >>>>>>>> > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > >>>>>>> > > >>>>>>> I think that would only be possible if we introduce > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > >>>>>>> buffer. > > >>>>>>> > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > >>>>>>> > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > >>>>>>> too short and put the required length into *optlen, which means the returned > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > >>>>>>> > > >>>>>>> Because of all these strange things I tried to do a minimal change > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > >>>>>>> optlen and leave optval as is. > > >>>>>>> > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > >>>>>>> > > >>>>>>> But as alternative introducing a prototype like this: > > >>>>>>> > > >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); > > >>>>>>> > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > >>>>>>> > > >>>>>>> Implementations could then opt in for the new interface and > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > >>>>>>> while all others would still get -EOPNOTSUPP. > > >>>>>>> > > >>>>>>> So what should be the way to go? > > >>>>>> > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > >>>>>> but the first part I wanted to convert was > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > >>>>>> writing. > > >>>>>> > > >>>>>> So we could go with the optlen_t approach, or we need > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > >>>>>> with ITER_DEST... > > >>>>>> > > >>>>>> So who wants to decide? > > >>>>> > > >>>>> I just noticed that it's even possible in same cases > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > >>>>> > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > >>>>> is a good idea :-( > > >>>> > > >>>> That was my finding as well a while ago, when I was planning to get the > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > >>>> using optlen and optval, which makes them non-trivial to covert to > > >>>> iov_iter. > > >>> > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > >>> of useful socket opts. See if there are any obvious problems with them > > >>> and if not, try converting. The rest we can cover separately when/if > > >>> needed. > > >> > > >> That's what I tried, but it fails with > > >> tcp_getsockopt -> > > >> do_tcp_getsockopt -> > > >> tcp_ao_get_mkts -> > > >> tcp_ao_copy_mkts_to_user -> > > >> copy_struct_from_sockptr > > >> tcp_ao_get_sock_info -> > > >> copy_struct_from_sockptr > > >> > > >> That's not possible with a ITER_DEST iov_iter. > > >> > > >> metze > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > conversion so let's see if someone else has better suggestions. > > > > Yes, that might work, but it would be good to get some feedback > > if this would be the way to go: > > > > int (*getsockopt_iter)(struct socket *sock, > > int level, int optname, > > struct iov_iter *optval_in, > > struct iov_iter *optval_out); > > > > And *optlen = optval_out->iov_offset; > > > > Any objection or better ideas? Linus would that be what you had in mind? > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > Also look at the SCTP code. Performance usually does not matter for set/getsockopts, there are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) and maybe recent devmem sockopts; we can special-case these if needed, or keep sockptr_t, idk. I'm skeptical we can convert everything though, that's why the suggestion to start with sk/ip/tcp/udp. > How do you handle code that wants to return an updated length (often longer > than the one provided) and an error code (eg ERRSIZE or similar). > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > where the buffer length the application provides is only that of the header. > The actual buffer length is contained in the header. > The return length is the amount written into the full buffer. Let's discuss these special cases as they come up? Worst case these places can always re-init iov_iter with a comment on why it is ok. But I do agree in general that there are a few places that do wild stuff. From torvalds at linux-foundation.org Wed Apr 2 00:40:19 2025 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Tue, 1 Apr 2025 17:40:19 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: Message-ID: " On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher wrote: > > But as Linus don't like 'sockptr_t' I used a different approach. So the sockptr_t thing has already happened. I hate it, and I think it's ugly as hell, but it is what it is. I think it's a complete hack and having that "kernel or user" pointer flag is disgusting. Making things worse, the naming is disgusting too, talking about some random "socket pointer", when it has absolutely nothing to do with socket, and isn't even a pointer. It's something else. It's literally called "socket" not because it has anything to do with sockets, but because it's a socket-specific hack that isn't acceptable anywhere else in the kernel. So that "socket" part of the name is literally shorthand for "only sockets are disgusting enough to use this, and nobody else should ever touch this crap". At least so far that part has mostly worked, even if there's some "sockptr_t" use in the crypto code. I didn't look closer, because I didn't want to lose my lunch. I don't understand why the networking code uses that thing. If you have a "fat pointer", you should damn well make it have the size of the area too, and do things *right*. Instead of doing what sockptr_t does, which is a complete hack to just pass a kernel/user flag, and then passes the length *separately* because the socket code couldn't be arsed to do the right thing. So I do still think "sockptr_t" should die. As Stanislav says, if you actually want that "user or kernel" thing, just use an "iov_iter". No, an "iov_iter" isn't exactly a pretty thing either, but at least it's the standard way to say "this pointer can have multiple different kinds of sources". And it keeps the size of the thing it points to around, so it's at least a fat pointer with proper ranges, even if it isn't exactly "type safe" (yes, it's type safe in the sense that it stays as a "iov_iter", but it's still basically a "random pointer"). > @Linus, would that optlen_t approach fit better for you? The optlen_t thing is slightly better mainly because it's more type-safe. At least it's not a "random misnamed user-or-kernel-pointer" thing where the name is about how nothing else is so broken as to use it. So it's better because it's more limited, and it's better in that at least it has a type-safe pointer rather than a "void *" with no size or type associated with it. That said, I don't think it's exactly great. It's just another case of "networking can't just do it right, and uses a random hack with special flag values". So I do think that it would be better to actually get rid of "sockptr_t optval, unsigned int optlen" ENTIRELY, and replace that with iov_iter and just make networking bite the bullet and do the RightThing(tm). In fact, to make it *really* typesafe, it might be a good idea to wrap the iov_iter in another struct, something like typedef struct sockopt { struct iov_iter iter; } sockopt_t; and make the networking functions make the typing very clear, and end up with an interface something like int do_tcp_setsockopt(struct sock *sk, int level, int optname, sockopt_t *val); where that "sockopt_t *val" replaces not just the "sockptr_t optval", but also the "unsigned int optlen" thing. And no, I didn't look at how much churn that would be. Probably a lot. Maybe more than people are willing to do - even if I think some of it could be automated with coccinelle or whatever. Linus From david.laight.linux at gmail.com Wed Apr 2 12:29:06 2025 From: david.laight.linux at gmail.com (David Laight) Date: Wed, 2 Apr 2025 13:29:06 +0100 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> Message-ID: <20250402132906.0ceb8985@pumpkin> On Wed, 2 Apr 2025 00:53:58 +0200 Stefan Metzmacher wrote: > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > On 04/01, Stefan Metzmacher wrote: > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > >>> On 04/01, Breno Leitao wrote: > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > >>>>>>>>> from io_uring_cmd_getsockopt(). > >>>>>>>>> > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > >>>>>>>>> and can't reach the ops->getsockopt() path. > >>>>>>>>> > >>>>>>>>> The first idea would be to change the optval and optlen arguments > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > >>>>>>>>> > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > >>>>>>>>> > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > >>>>>>>> > >>>>>>>> [..] > >>>>>>>> > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > >>>>>>>>> we only ever pass a kernel pointer and do the > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > >>>>>>>> > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > >>>>>>>> now + the user (or kernel) pointer. Might as well do > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > >>>>>>> > >>>>>>> I think that would only be possible if we introduce > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > >>>>>>> buffer. > >>>>>>> > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > >>>>>>> (note it didn't converted everything, I gave up after hitting > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > >>>>>>> > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > >>>>>>> too short and put the required length into *optlen, which means the returned > >>>>>>> *optlen is larger than the optval buffer given from userspace. > >>>>>>> > >>>>>>> Because of all these strange things I tried to do a minimal change > >>>>>>> in order to get rid of the io_uring limitation and only converted > >>>>>>> optlen and leave optval as is. > >>>>>>> > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > >>>>>>> > >>>>>>> But as alternative introducing a prototype like this: > >>>>>>> > >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); > >>>>>>> > >>>>>>> That returns a non-negative value which can be placed into *optlen > >>>>>>> or negative value as error and *optlen will not be changed on error. > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > >>>>>>> > >>>>>>> Implementations could then opt in for the new interface and > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > >>>>>>> while all others would still get -EOPNOTSUPP. > >>>>>>> > >>>>>>> So what should be the way to go? > >>>>>> > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > >>>>>> but the first part I wanted to convert was > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > >>>>>> writing. > >>>>>> > >>>>>> So we could go with the optlen_t approach, or we need > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > >>>>>> with ITER_DEST... > >>>>>> > >>>>>> So who wants to decide? > >>>>> > >>>>> I just noticed that it's even possible in same cases > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > >>>>> > >>>>> This makes it really hard to believe that trying to use iov_iter for this > >>>>> is a good idea :-( > >>>> > >>>> That was my finding as well a while ago, when I was planning to get the > >>>> __user pointers converted to iov_iter. There are some weird ways of > >>>> using optlen and optval, which makes them non-trivial to covert to > >>>> iov_iter. > >>> > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > >>> of useful socket opts. See if there are any obvious problems with them > >>> and if not, try converting. The rest we can cover separately when/if > >>> needed. > >> > >> That's what I tried, but it fails with > >> tcp_getsockopt -> > >> do_tcp_getsockopt -> > >> tcp_ao_get_mkts -> > >> tcp_ao_copy_mkts_to_user -> > >> copy_struct_from_sockptr > >> tcp_ao_get_sock_info -> > >> copy_struct_from_sockptr > >> > >> That's not possible with a ITER_DEST iov_iter. > >> > >> metze > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > value. Don't see why it won't work, but I agree that's gonna be a messy > > conversion so let's see if someone else has better suggestions. > > Yes, that might work, but it would be good to get some feedback > if this would be the way to go: > > int (*getsockopt_iter)(struct socket *sock, > int level, int optname, > struct iov_iter *optval_in, > struct iov_iter *optval_out); > > And *optlen = optval_out->iov_offset; > > Any objection or better ideas? Linus would that be what you had in mind? I'd worry about performance - yes I know 'iter' are used elsewhere but... Also look at the SCTP code. How do you handle code that wants to return an updated length (often longer than the one provided) and an error code (eg ERRSIZE or similar). There is also a very strange use (I think it is a sockopt rather than an ioctl) where the buffer length the application provides is only that of the header. The actual buffer length is contained in the header. The return length is the amount written into the full buffer. David From david.laight.linux at gmail.com Wed Apr 2 12:35:20 2025 From: david.laight.linux at gmail.com (David Laight) Date: Wed, 2 Apr 2025 13:35:20 +0100 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: Message-ID: <20250402133520.40451468@pumpkin> On Tue, 1 Apr 2025 17:40:19 -0700 Linus Torvalds wrote: > " > > On Mon, 31 Mar 2025 at 13:11, Stefan Metzmacher wrote: > > > > But as Linus don't like 'sockptr_t' I used a different approach. > > So the sockptr_t thing has already happened. I hate it, and I think > it's ugly as hell, but it is what it is. > > I think it's a complete hack and having that "kernel or user" pointer > flag is disgusting. I have proposed a patch which replaced it with a structure. That showed up some really hacky code in IIRC io_uring. Using sockptr_t for the buffer was one thing, the generic code can't copy the buffer to/from user because code lies about the length. But using for the length is just brain-dead. That is fixed size and can be copied from/to user by the wrapper. The code bloat reduction will be significant. David From david.laight.linux at gmail.com Wed Apr 2 20:46:38 2025 From: david.laight.linux at gmail.com (David Laight) Date: Wed, 2 Apr 2025 21:46:38 +0100 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> <20250402132906.0ceb8985@pumpkin> Message-ID: <20250402214638.0b5eed55@pumpkin> On Wed, 2 Apr 2025 07:19:46 -0700 Stanislav Fomichev wrote: > On 04/02, David Laight wrote: > > On Wed, 2 Apr 2025 00:53:58 +0200 > > Stefan Metzmacher wrote: > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > On 04/01, Stefan Metzmacher wrote: > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > >>> On 04/01, Breno Leitao wrote: > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > >>>>>>>>> > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > >>>>>>>>> > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > >>>>>>>>> > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > >>>>>>>>> > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > >>>>>>>> > > > >>>>>>>> [..] > > > >>>>>>>> > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > >>>>>>>> > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > >>>>>>> > > > >>>>>>> I think that would only be possible if we introduce > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > >>>>>>> buffer. > > > >>>>>>> > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > >>>>>>> > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > >>>>>>> > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > >>>>>>> optlen and leave optval as is. > > > >>>>>>> > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > >>>>>>> > > > >>>>>>> But as alternative introducing a prototype like this: > > > >>>>>>> > > > >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); > > > >>>>>>> > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > >>>>>>> > > > >>>>>>> Implementations could then opt in for the new interface and > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > >>>>>>> > > > >>>>>>> So what should be the way to go? > > > >>>>>> > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > >>>>>> but the first part I wanted to convert was > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > >>>>>> writing. > > > >>>>>> > > > >>>>>> So we could go with the optlen_t approach, or we need > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > >>>>>> with ITER_DEST... > > > >>>>>> > > > >>>>>> So who wants to decide? > > > >>>>> > > > >>>>> I just noticed that it's even possible in same cases > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > >>>>> > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > >>>>> is a good idea :-( > > > >>>> > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > >>>> iov_iter. > > > >>> > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > >>> of useful socket opts. See if there are any obvious problems with them > > > >>> and if not, try converting. The rest we can cover separately when/if > > > >>> needed. > > > >> > > > >> That's what I tried, but it fails with > > > >> tcp_getsockopt -> > > > >> do_tcp_getsockopt -> > > > >> tcp_ao_get_mkts -> > > > >> tcp_ao_copy_mkts_to_user -> > > > >> copy_struct_from_sockptr > > > >> tcp_ao_get_sock_info -> > > > >> copy_struct_from_sockptr > > > >> > > > >> That's not possible with a ITER_DEST iov_iter. > > > >> > > > >> metze > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > conversion so let's see if someone else has better suggestions. > > > > > > Yes, that might work, but it would be good to get some feedback > > > if this would be the way to go: > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > int level, int optname, > > > struct iov_iter *optval_in, > > > struct iov_iter *optval_out); > > > > > > And *optlen = optval_out->iov_offset; > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > Also look at the SCTP code. > > Performance usually does not matter for set/getsockopts, there > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) That might be the one that is really horrid and completely abuses the 'length' parameter. > and maybe recent > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > idk. I'm skeptical we can convert everything though, that's why the > suggestion to start with sk/ip/tcp/udp. > > > How do you handle code that wants to return an updated length (often longer > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > where the buffer length the application provides is only that of the header. > > The actual buffer length is contained in the header. > > The return length is the amount written into the full buffer. > > Let's discuss these special cases as they come up? Worst case these > places can always re-init iov_iter with a comment on why it is ok. > But I do agree in general that there are a few places that do wild > stuff. The problem is that the generic code has to deal with all the 'wild stuff'. It is also common to do non-sequential accesses - so iov_iter doesn't match at all. There also isn't a requirement for scatter-gather. For 'normal' getsockopt (and setsockopt) with short lengths it actually makes sense for the syscall wrapper to do the user copies. But it would need to pass the user ptr+len as well as the kernel ptr+len to give the required flexibilty. Then you have to work out whether the final copy to user is needed or not. (not that hard, but it all adds complication). David From torvalds at linux-foundation.org Wed Apr 2 21:07:54 2025 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 2 Apr 2025 14:07:54 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <20250402214638.0b5eed55@pumpkin> References: <39515c76-310d-41af-a8b4-a814841449e3@samba.org> <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> <20250402132906.0ceb8985@pumpkin> <20250402214638.0b5eed55@pumpkin> Message-ID: On Wed, 2 Apr 2025 at 13:46, David Laight wrote: > > The problem is that the generic code has to deal with all the 'wild stuff'. > It is also common to do non-sequential accesses - so iov_iter doesn't match > at all. > There also isn't a requirement for scatter-gather. Note that the generic code has special cases for the simple stuff, which is all that the sockopt code would need. Now, that's _particularly_ true for the "single user address range" thing, where there's a special ITER_UBUF thing. We don't actually have a "single kernel range" version of that, but ITER_KVEC is simple to use, and the sockopt code could say "I only ever look at the first buffer". It's ok to just not handle all the cases, and you don't *have* to use the generic "copy_from_iter()" routines if you don't want to. In fact, I would expect that something like sockopt generally wouldn't want to use the normal iter copying routines, since those are basically all geared towards "copy and update the iter". Linus From stfomichev at gmail.com Wed Apr 2 21:21:35 2025 From: stfomichev at gmail.com (Stanislav Fomichev) Date: Wed, 2 Apr 2025 14:21:35 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <20250402214638.0b5eed55@pumpkin> References: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> <20250402132906.0ceb8985@pumpkin> <20250402214638.0b5eed55@pumpkin> Message-ID: On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 07:19:46 -0700 > Stanislav Fomichev wrote: > > > On 04/02, David Laight wrote: > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > Stefan Metzmacher wrote: > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > On 04/01, Stefan Metzmacher wrote: > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > >>> On 04/01, Breno Leitao wrote: > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > >>>>>>>>> > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > >>>>>>>>> > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > >>>>>>>>> > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > > >>>>>>>>> > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > >>>>>>>> > > > > >>>>>>>> [..] > > > > >>>>>>>> > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > >>>>>>>> > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > >>>>>>> > > > > >>>>>>> I think that would only be possible if we introduce > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > > >>>>>>> buffer. > > > > >>>>>>> > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > >>>>>>> > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > > >>>>>>> > > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > > >>>>>>> optlen and leave optval as is. > > > > >>>>>>> > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > > >>>>>>> > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > >>>>>>> > > > > >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); > > > > >>>>>>> > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > > >>>>>>> > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > >>>>>>> > > > > >>>>>>> So what should be the way to go? > > > > >>>>>> > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > >>>>>> but the first part I wanted to convert was > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > >>>>>> writing. > > > > >>>>>> > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > >>>>>> with ITER_DEST... > > > > >>>>>> > > > > >>>>>> So who wants to decide? > > > > >>>>> > > > > >>>>> I just noticed that it's even possible in same cases > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > >>>>> > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > > >>>>> is a good idea :-( > > > > >>>> > > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > > >>>> iov_iter. > > > > >>> > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > > >>> of useful socket opts. See if there are any obvious problems with them > > > > >>> and if not, try converting. The rest we can cover separately when/if > > > > >>> needed. > > > > >> > > > > >> That's what I tried, but it fails with > > > > >> tcp_getsockopt -> > > > > >> do_tcp_getsockopt -> > > > > >> tcp_ao_get_mkts -> > > > > >> tcp_ao_copy_mkts_to_user -> > > > > >> copy_struct_from_sockptr > > > > >> tcp_ao_get_sock_info -> > > > > >> copy_struct_from_sockptr > > > > >> > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > >> > > > > >> metze > > > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > if this would be the way to go: > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > int level, int optname, > > > > struct iov_iter *optval_in, > > > > struct iov_iter *optval_out); > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > > Also look at the SCTP code. > > > > Performance usually does not matter for set/getsockopts, there > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > That might be the one that is really horrid and completely abuses > the 'length' parameter. It is reading and writing, yes, but it's not a huge problem. And it does enforce the optlen (to copy back the same amount of bytes). It's not that bad, it's just an example of where we need to be extra careful. > > and maybe recent > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > idk. I'm skeptical we can convert everything though, that's why the > > suggestion to start with sk/ip/tcp/udp. > > > > > How do you handle code that wants to return an updated length (often longer > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > > where the buffer length the application provides is only that of the header. > > > The actual buffer length is contained in the header. > > > The return length is the amount written into the full buffer. > > > > Let's discuss these special cases as they come up? Worst case these > > places can always re-init iov_iter with a comment on why it is ok. > > But I do agree in general that there are a few places that do wild > > stuff. > > The problem is that the generic code has to deal with all the 'wild stuff'. getsockopt_iter will have optval_in for the minority of socket options (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well as optval_out. The latter is what the majority of socket options will use to write their value. That doesn't seem too complicated to handle? > It is also common to do non-sequential accesses - so iov_iter doesn't match > at all. I disagree that it's 'common'. Searching for copy_from_sockptr_offset returns a few cases and they are mostly using read-with-offset because there is no sequential read (iterator) semantics with sockptr_t. > There also isn't a requirement for scatter-gather. > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes > sense for the syscall wrapper to do the user copies. > But it would need to pass the user ptr+len as well as the kernel ptr+len > to give the required flexibilty. > Then you have to work out whether the final copy to user is needed or not. > (not that hard, but it all adds complication). Not sure I understand what's the problem. The user vs kernel part will be abstracted by iov_iter. The callers will have to write the optlen back. And there are two call sites we care about: io_uring and regular system call. What's your suggestion? Maybe I'm missing something. Do you prefer get_optlen/put_optlen? From david.laight.linux at gmail.com Wed Apr 2 22:38:05 2025 From: david.laight.linux at gmail.com (David Laight) Date: Wed, 2 Apr 2025 23:38:05 +0100 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: References: <407c1a05-24a7-430b-958c-0ca78c467c07@samba.org> <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> <20250402132906.0ceb8985@pumpkin> <20250402214638.0b5eed55@pumpkin> Message-ID: <20250402233805.464ed70e@pumpkin> On Wed, 2 Apr 2025 14:21:35 -0700 Stanislav Fomichev wrote: > On 04/02, David Laight wrote: > > On Wed, 2 Apr 2025 07:19:46 -0700 > > Stanislav Fomichev wrote: > > > > > On 04/02, David Laight wrote: > > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > > Stefan Metzmacher wrote: > > > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > > On 04/01, Stefan Metzmacher wrote: > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > > >>> On 04/01, Breno Leitao wrote: > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > > >>>>>>>>> > > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > > >>>>>>>>> > > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > >>>>>>>>> > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > > > >>>>>>>>> > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > > >>>>>>>> > > > > > >>>>>>>> [..] > > > > > >>>>>>>> > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > > >>>>>>>> > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > > >>>>>>> > > > > > >>>>>>> I think that would only be possible if we introduce > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > > > >>>>>>> buffer. > > > > > >>>>>>> > > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > >>>>>>> > > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > > > >>>>>>> > > > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > > > >>>>>>> optlen and leave optval as is. > > > > > >>>>>>> > > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > > > >>>>>>> > > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > > >>>>>>> > > > > > >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > > >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); > > > > > >>>>>>> > > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > >>>>>>> > > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > > >>>>>>> > > > > > >>>>>>> So what should be the way to go? > > > > > >>>>>> > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > > >>>>>> but the first part I wanted to convert was > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > > >>>>>> writing. > > > > > >>>>>> > > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > > >>>>>> with ITER_DEST... > > > > > >>>>>> > > > > > >>>>>> So who wants to decide? > > > > > >>>>> > > > > > >>>>> I just noticed that it's even possible in same cases > > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > > >>>>> > > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > > > >>>>> is a good idea :-( > > > > > >>>> > > > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > > > >>>> iov_iter. > > > > > >>> > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > > > >>> of useful socket opts. See if there are any obvious problems with them > > > > > >>> and if not, try converting. The rest we can cover separately when/if > > > > > >>> needed. > > > > > >> > > > > > >> That's what I tried, but it fails with > > > > > >> tcp_getsockopt -> > > > > > >> do_tcp_getsockopt -> > > > > > >> tcp_ao_get_mkts -> > > > > > >> tcp_ao_copy_mkts_to_user -> > > > > > >> copy_struct_from_sockptr > > > > > >> tcp_ao_get_sock_info -> > > > > > >> copy_struct_from_sockptr > > > > > >> > > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > > >> > > > > > >> metze > > > > > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > > if this would be the way to go: > > > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > > int level, int optname, > > > > > struct iov_iter *optval_in, > > > > > struct iov_iter *optval_out); > > > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > > > Also look at the SCTP code. > > > > > > Performance usually does not matter for set/getsockopts, there > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > > > That might be the one that is really horrid and completely abuses > > the 'length' parameter. > > It is reading and writing, yes, but it's not a huge problem. And it > does enforce the optlen (to copy back the same amount of bytes). It's > not that bad, it's just an example of where we need to be extra > careful. > > > > and maybe recent > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > > idk. I'm skeptical we can convert everything though, that's why the > > > suggestion to start with sk/ip/tcp/udp. > > > > > > > How do you handle code that wants to return an updated length (often longer > > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > > > where the buffer length the application provides is only that of the header. > > > > The actual buffer length is contained in the header. > > > > The return length is the amount written into the full buffer. > > > > > > Let's discuss these special cases as they come up? Worst case these > > > places can always re-init iov_iter with a comment on why it is ok. > > > But I do agree in general that there are a few places that do wild > > > stuff. > > > > The problem is that the generic code has to deal with all the 'wild stuff'. > > getsockopt_iter will have optval_in for the minority of socket options > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well > as optval_out. The latter is what the majority of socket options > will use to write their value. That doesn't seem too complicated to > handle? > > > It is also common to do non-sequential accesses - so iov_iter doesn't match > > at all. > > I disagree that it's 'common'. Searching for copy_from_sockptr_offset > returns a few cases and they are mostly using read-with-offset because > there is no sequential read (iterator) semantics with sockptr_t. > > > There also isn't a requirement for scatter-gather. > > > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes > > sense for the syscall wrapper to do the user copies. > > But it would need to pass the user ptr+len as well as the kernel ptr+len > > to give the required flexibilty. > > Then you have to work out whether the final copy to user is needed or not. > > (not that hard, but it all adds complication). > > Not sure I understand what's the problem. The user vs kernel part will > be abstracted by iov_iter. The callers will have to write the optlen > back. And there are two call sites we care about: io_uring and regular > system call. What's your suggestion? Maybe I'm missing something. Do you > prefer get_optlen/put_optlen? I think the final aim should be to pass the user supplied length to the per-protocol code and have it return the length/error to be passed back to the user. But in a lot of cases the syscall wrapper can do the buffer copies (as well as the length copies). That would be restricted to short length (on stack). So code that needed a long buffer (like some of the sctp options) would need to directly access the user buffer (or a long buffer provided by an in-kernel user). But you'll find code that reads/writes well beyond the apparent size of the user buffer. (And not just code that accesses 4 bytes without checking the length). David From stfomichev at gmail.com Wed Apr 2 23:39:17 2025 From: stfomichev at gmail.com (Stanislav Fomichev) Date: Wed, 2 Apr 2025 16:39:17 -0700 Subject: [rds-devel] [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() In-Reply-To: <20250402233805.464ed70e@pumpkin> References: <0f0f9cfd-77be-4988-8043-0d1bd0d157e7@samba.org> <4b7ac4e9-6856-4e54-a2ba-15465e9622ac@samba.org> <20250402132906.0ceb8985@pumpkin> <20250402214638.0b5eed55@pumpkin> <20250402233805.464ed70e@pumpkin> Message-ID: On 04/02, David Laight wrote: > On Wed, 2 Apr 2025 14:21:35 -0700 > Stanislav Fomichev wrote: > > > On 04/02, David Laight wrote: > > > On Wed, 2 Apr 2025 07:19:46 -0700 > > > Stanislav Fomichev wrote: > > > > > > > On 04/02, David Laight wrote: > > > > > On Wed, 2 Apr 2025 00:53:58 +0200 > > > > > Stefan Metzmacher wrote: > > > > > > > > > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev: > > > > > > > On 04/01, Stefan Metzmacher wrote: > > > > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev: > > > > > > >>> On 04/01, Breno Leitao wrote: > > > > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote: > > > > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher: > > > > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher: > > > > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev: > > > > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote: > > > > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation > > > > > > >>>>>>>>> from io_uring_cmd_getsockopt(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt() > > > > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt() > > > > > > >>>>>>>>> and can't reach the ops->getsockopt() path. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments > > > > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that > > > > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt() > > > > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT(). > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach. > > > > > > >>>>>>>>> > > > > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you? > > > > > > >>>>>>>> > > > > > > >>>>>>>> [..] > > > > > > >>>>>>>> > > > > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer, > > > > > > >>>>>>>>> we only ever pass a kernel pointer and do the > > > > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt(). > > > > > > >>>>>>>> > > > > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size > > > > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do > > > > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion? > > > > > > >>>>>>> > > > > > > >>>>>>> I think that would only be possible if we introduce > > > > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations > > > > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break > > > > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but > > > > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol, > > > > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing > > > > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out > > > > > > >>>>>>> buffer. > > > > > > >>>>>>> > > > > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t, > > > > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon, > > > > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1 > > > > > > >>>>>>> (note it didn't converted everything, I gave up after hitting > > > > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs. > > > > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe > > > > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval) > > > > > > >>>>>>> > > > > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was > > > > > > >>>>>>> too short and put the required length into *optlen, which means the returned > > > > > > >>>>>>> *optlen is larger than the optval buffer given from userspace. > > > > > > >>>>>>> > > > > > > >>>>>>> Because of all these strange things I tried to do a minimal change > > > > > > >>>>>>> in order to get rid of the io_uring limitation and only converted > > > > > > >>>>>>> optlen and leave optval as is. > > > > > > >>>>>>> > > > > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions. > > > > > > >>>>>>> > > > > > > >>>>>>> But as alternative introducing a prototype like this: > > > > > > >>>>>>> > > > > > > >>>>>>> ???????? int (*getsockopt_iter)(struct socket *sock, int level, int optname, > > > > > > >>>>>>> ??????????????????????????????? struct iov_iter *optval_iter); > > > > > > >>>>>>> > > > > > > >>>>>>> That returns a non-negative value which can be placed into *optlen > > > > > > >>>>>>> or negative value as error and *optlen will not be changed on error. > > > > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to. > > > > > > >>>>>>> > > > > > > >>>>>>> Implementations could then opt in for the new interface and > > > > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case, > > > > > > >>>>>>> while all others would still get -EOPNOTSUPP. > > > > > > >>>>>>> > > > > > > >>>>>>> So what should be the way to go? > > > > > > >>>>>> > > > > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below, > > > > > > >>>>>> but the first part I wanted to convert was > > > > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before > > > > > > >>>>>> writing. > > > > > > >>>>>> > > > > > > >>>>>> So we could go with the optlen_t approach, or we need > > > > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one > > > > > > >>>>>> with ITER_DEST... > > > > > > >>>>>> > > > > > > >>>>>> So who wants to decide? > > > > > > >>>>> > > > > > > >>>>> I just noticed that it's even possible in same cases > > > > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen, > > > > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen. > > > > > > >>>>> > > > > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this > > > > > > >>>>> is a good idea :-( > > > > > > >>>> > > > > > > >>>> That was my finding as well a while ago, when I was planning to get the > > > > > > >>>> __user pointers converted to iov_iter. There are some weird ways of > > > > > > >>>> using optlen and optval, which makes them non-trivial to covert to > > > > > > >>>> iov_iter. > > > > > > >>> > > > > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90% > > > > > > >>> of useful socket opts. See if there are any obvious problems with them > > > > > > >>> and if not, try converting. The rest we can cover separately when/if > > > > > > >>> needed. > > > > > > >> > > > > > > >> That's what I tried, but it fails with > > > > > > >> tcp_getsockopt -> > > > > > > >> do_tcp_getsockopt -> > > > > > > >> tcp_ao_get_mkts -> > > > > > > >> tcp_ao_copy_mkts_to_user -> > > > > > > >> copy_struct_from_sockptr > > > > > > >> tcp_ao_get_sock_info -> > > > > > > >> copy_struct_from_sockptr > > > > > > >> > > > > > > >> That's not possible with a ITER_DEST iov_iter. > > > > > > >> > > > > > > >> metze > > > > > > > > > > > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and > > > > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and > > > > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output > > > > > > > value. Don't see why it won't work, but I agree that's gonna be a messy > > > > > > > conversion so let's see if someone else has better suggestions. > > > > > > > > > > > > Yes, that might work, but it would be good to get some feedback > > > > > > if this would be the way to go: > > > > > > > > > > > > int (*getsockopt_iter)(struct socket *sock, > > > > > > int level, int optname, > > > > > > struct iov_iter *optval_in, > > > > > > struct iov_iter *optval_out); > > > > > > > > > > > > And *optlen = optval_out->iov_offset; > > > > > > > > > > > > Any objection or better ideas? Linus would that be what you had in mind? > > > > > > > > > > I'd worry about performance - yes I know 'iter' are used elsewhere but... > > > > > Also look at the SCTP code. > > > > > > > > Performance usually does not matter for set/getsockopts, there > > > > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE) > > > > > > That might be the one that is really horrid and completely abuses > > > the 'length' parameter. > > > > It is reading and writing, yes, but it's not a huge problem. And it > > does enforce the optlen (to copy back the same amount of bytes). It's > > not that bad, it's just an example of where we need to be extra > > careful. > > > > > > and maybe recent > > > > devmem sockopts; we can special-case these if needed, or keep sockptr_t, > > > > idk. I'm skeptical we can convert everything though, that's why the > > > > suggestion to start with sk/ip/tcp/udp. > > > > > > > > > How do you handle code that wants to return an updated length (often longer > > > > > than the one provided) and an error code (eg ERRSIZE or similar). > > > > > > > > > > There is also a very strange use (I think it is a sockopt rather than an ioctl) > > > > > where the buffer length the application provides is only that of the header. > > > > > The actual buffer length is contained in the header. > > > > > The return length is the amount written into the full buffer. > > > > > > > > Let's discuss these special cases as they come up? Worst case these > > > > places can always re-init iov_iter with a comment on why it is ok. > > > > But I do agree in general that there are a few places that do wild > > > > stuff. > > > > > > The problem is that the generic code has to deal with all the 'wild stuff'. > > > > getsockopt_iter will have optval_in for the minority of socket options > > (like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well > > as optval_out. The latter is what the majority of socket options > > will use to write their value. That doesn't seem too complicated to > > handle? > > > > > It is also common to do non-sequential accesses - so iov_iter doesn't match > > > at all. > > > > I disagree that it's 'common'. Searching for copy_from_sockptr_offset > > returns a few cases and they are mostly using read-with-offset because > > there is no sequential read (iterator) semantics with sockptr_t. > > > > > There also isn't a requirement for scatter-gather. > > > > > > For 'normal' getsockopt (and setsockopt) with short lengths it actually makes > > > sense for the syscall wrapper to do the user copies. > > > But it would need to pass the user ptr+len as well as the kernel ptr+len > > > to give the required flexibilty. > > > Then you have to work out whether the final copy to user is needed or not. > > > (not that hard, but it all adds complication). > > > > Not sure I understand what's the problem. The user vs kernel part will > > be abstracted by iov_iter. The callers will have to write the optlen > > back. And there are two call sites we care about: io_uring and regular > > system call. What's your suggestion? Maybe I'm missing something. Do you > > prefer get_optlen/put_optlen? > > I think the final aim should be to pass the user supplied length to the > per-protocol code and have it return the length/error to be passed back to the > user. Like what Stefan's patch 3 is doing? Or you're suggesting to change getsockopt handlers to handle length more explicitly? If we were to proceed with sockptr to iov_iter conversion we'll have to do it anyway (or pass the length as the size of iov_iter). > But in a lot of cases the syscall wrapper can do the buffer copies (as well > as the length copies). > That would be restricted to short length (on stack). > So code that needed a long buffer (like some of the sctp options) > would need to directly access the user buffer (or a long buffer provided > by an in-kernel user). This sounds similar to what we did with bpf hooks - copy (head of) the buffer and run bpf program on top of it. I remember iptables setsockopt begin problematic because of its huge size.. It is an option, yes (to convert protocol handler to kernel memory mostly). > But you'll find code that reads/writes well beyond the apparent size of > the user buffer. > (And not just code that accesses 4 bytes without checking the length). With can start with getsockopt_iter + sk_getsockopt to see if there are any issues with that approach. If not, adding ip/tcp/udp to the mix should be doable. We can explain and comment on special cases if needed. When other protocols are needed from io_uring, we can convert more. But at least the new code will use the correct abstractions. From kangyan91 at outlook.com Wed Apr 2 16:15:56 2025 From: kangyan91 at outlook.com (YAN KANG) Date: Wed, 2 Apr 2025 16:15:56 +0000 Subject: [rds-devel] BUG: KASAN: slab-use-after-free in rds_inc_put Message-ID: Dear maintainers, My fuzzing tool found a new kernel bug titiled "BUG: KASAN: slab-use-after-free in rds_inc_put ". I tested it on the Linux upstream version (6.14.0-rc6) . Because the target object is freed in kernel workqueue kthread , I have no repro for this bug. But the crash log is sufficient to describe the cause of the bug. RootCause Analysis: in /net/rds/recv.c void rds_inc_put(struct rds_incoming *inc) { rdsdebug("put inc %p ref %d\n", inc, refcount_read(&inc->i_refcount)); if (refcount_dec_and_test(&inc->i_refcount)) { BUG_ON(!list_empty(&inc->i_item)); inc->i_conn->c_trans->inc_free(inc); // crash, because inc->i_conn is dangling pointer. } } struct rds_connection object is alloced in rds_sendmsg function and added to loop_conns list. Then there are two structures hold the reference of struct rds_connection object . 1. struct rds_sock has field ( struct rds_connection *rs_conn) , rs->rs_conn is initalized in rds_sendmsg function. 2. global list : loop_conns's item (struct rds_loop_connection * type) has a field (struct rds_connection *conn) . In function __rds_conn_create, conn is alloced and add to global list . In workqueue : cleanup_net calls rds_loop_kill_conns and free all connections. But in another thread, rds_sock still hold the dangling pointer. Fix suggestion: I think there needs to be some synchronization mechanism for rds_connection's lifecycle. If you fix this issue, please add the following tag to the commit: Reported-by: yan kang Reported-by: yue sun I hope it helps. Best regards yan kang Kernel crash log is below. ================================================================== ================================================================== BUG: KASAN: slab-use-after-free in rds_inc_put+0x210/0x220 net/rds/recv.c:83 Read of size 8 at addr ffff88803d111048 by task syz.0.615/15412 CPU: 0 UID: 0 PID: 15412 Comm: syz.0.615 Not tainted 6.14.0-rc6-00006-g7122647c49bb-dirty #112 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xc0/0x5e0 mm/kasan/report.c:489 kasan_report+0xbd/0xf0 mm/kasan/report.c:602 rds_inc_put+0x210/0x220 net/rds/recv.c:83 rds_clear_recv_queue+0x3e6/0x610 net/rds/recv.c:778 rds_release+0xdb/0x460 net/rds/af_rds.c:73 __sock_release+0xb0/0x270 net/socket.c:640 sock_close+0x1c/0x30 net/socket.c:1408 __fput+0x3f8/0xb40 fs/file_table.c:450 task_work_run+0x169/0x260 kernel/task_work.c:239 exit_task_work include/linux/task_work.h:43 [inline] do_exit+0xacc/0x2ce0 kernel/exit.c:938 do_group_exit+0xd3/0x2a0 kernel/exit.c:1087 get_signal+0x222c/0x2500 kernel/signal.c:3017 arch_do_signal_or_restart+0x81/0x7d0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218 do_syscall_64+0xd8/0x250 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff6239e6d48 Code: Unable to access opcode bytes at 0x7ff6239e6d1e. RSP: 002b:00007ff6214f5e90 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6 RAX: fffffffffffffdfc RBX: 00007ff623bb5f01 RCX: 00007ff6239e6d48 RDX: 00007ff6214f5f20 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00007ff623a39f8e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 00007ff6214f5f20 R13: 0000000000000000 R14: 0000000000000000 R15: 00007ff6214d6000 Allocated by task 16518: kasan_save_stack+0x24/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 unpoison_slab_object mm/kasan/common.c:319 [inline] __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:345 kasan_slab_alloc include/linux/kasan.h:250 [inline] slab_post_alloc_hook mm/slub.c:4119 [inline] slab_alloc_node mm/slub.c:4168 [inline] kmem_cache_alloc_noprof+0x167/0x3e0 mm/slub.c:4175 __rds_conn_create+0x83c/0x2330 net/rds/connection.c:193 rds_conn_create_outgoing+0x44/0x60 net/rds/connection.c:363 rds_sendmsg+0x11b2/0x3160 net/rds/send.c:1294 sock_sendmsg_nosec net/socket.c:711 [inline] __sock_sendmsg net/socket.c:726 [inline] __sys_sendto+0x4fc/0x570 net/socket.c:2197 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xe0/0x1c0 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 9656: kasan_save_stack+0x24/0x50 mm/kasan/common.c:47 kasan_save_track+0x14/0x30 mm/kasan/common.c:68 kasan_save_free_info+0x3b/0x60 mm/kasan/generic.c:582 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x54/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4613 [inline] kmem_cache_free+0x145/0x4b0 mm/slub.c:4715 rds_conn_destroy+0x61f/0x850 net/rds/connection.c:513 rds_loop_kill_conns net/rds/loop.c:213 [inline] rds_loop_exit_net+0x2cd/0x410 net/rds/loop.c:219 ops_exit_list+0xb0/0x180 net/core/net_namespace.c:172 cleanup_net+0x5b3/0xd90 net/core/net_namespace.c:648 process_one_work+0x966/0x1b90 kernel/workqueue.c:3236 process_scheduled_works kernel/workqueue.c:3317 [inline] worker_thread+0x66e/0xe80 kernel/workqueue.c:3398 kthread+0x2c7/0x3b0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 The buggy address belongs to the object at ffff88803d111000 which belongs to the cache rds_connection of size 240 The buggy address is located 72 bytes inside of freed 240-byte region [ffff88803d111000, ffff88803d1110f0) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88803d111000 pfn:0x3d111 flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff) page_type: f5(slab) raw: 00fff00000000000 ffff88802aefb500 dead000000000122 0000000000000000 raw: ffff88803d111000 00000000800d000c 00000001f5000000 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 16518, tgid 16516 (syz.2.792), ts 132666258205, free_ts 129603681735 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x2e7/0x350 mm/page_alloc.c:1558 prep_new_page mm/page_alloc.c:1566 [inline] get_page_from_freelist+0xe4e/0x2b20 mm/page_alloc.c:3476 __alloc_pages_noprof+0x219/0x2190 mm/page_alloc.c:4753 alloc_pages_mpol_noprof+0x2b6/0x600 mm/mempolicy.c:2269 alloc_slab_page mm/slub.c:2423 [inline] allocate_slab mm/slub.c:2589 [inline] new_slab+0x2d5/0x420 mm/slub.c:2642 ___slab_alloc+0xbb7/0x1850 mm/slub.c:3830 __slab_alloc.constprop.0+0x56/0xb0 mm/slub.c:3920 __slab_alloc_node mm/slub.c:3995 [inline] slab_alloc_node mm/slub.c:4156 [inline] kmem_cache_alloc_noprof+0x264/0x3e0 mm/slub.c:4175 __rds_conn_create+0x83c/0x2330 net/rds/connection.c:193 rds_conn_create_outgoing+0x44/0x60 net/rds/connection.c:363 rds_sendmsg+0x11b2/0x3160 net/rds/send.c:1294 sock_sendmsg_nosec net/socket.c:711 [inline] __sock_sendmsg net/socket.c:726 [inline] __sys_sendto+0x4fc/0x570 net/socket.c:2197 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0xe0/0x1c0 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f page last free pid 49 tgid 49 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1127 [inline] free_unref_page+0x700/0x10a0 mm/page_alloc.c:2659 vfree+0x172/0x940 mm/vmalloc.c:3383 delayed_vfree_work+0x57/0x70 mm/vmalloc.c:3303 process_one_work+0x966/0x1b90 kernel/workqueue.c:3236 process_scheduled_works kernel/workqueue.c:3317 [inline] worker_thread+0x66e/0xe80 kernel/workqueue.c:3398 kthread+0x2c7/0x3b0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Memory state around the buggy address: ffff88803d110f00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff88803d110f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88803d111000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88803d111080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc ffff88803d111100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== From leon at kernel.org Tue Apr 8 11:04:55 2025 From: leon at kernel.org (Leon Romanovsky) Date: Tue, 8 Apr 2025 14:04:55 +0300 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable Message-ID: From: Leon Romanovsky There is no need to perform checks if IB device ODP capable as ib_reg_user_mr() will check all access flags anyway. RDS is the only one in-kernel ODP user, so change return value for ODP not supported case, to the value used by RDS. Signed-off-by: Leon Romanovsky --- drivers/infiniband/core/verbs.c | 2 +- net/rds/ib.c | 8 -------- net/rds/ib.h | 1 - net/rds/ib_rdma.c | 5 ----- 4 files changed, 1 insertion(+), 15 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index c5e78bbefbd0..61620787ee48 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -2218,7 +2218,7 @@ struct ib_mr *ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, if (!(pd->device->attrs.kernel_cap_flags & IBK_ON_DEMAND_PAGING)) { pr_debug("ODP support not available\n"); - return ERR_PTR(-EINVAL); + return ERR_PTR(-EOPNOTSUPP); } } diff --git a/net/rds/ib.c b/net/rds/ib.c index 9826fe7f9d00..c62aa2ff4963 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) rds_ibdev->max_wrs = device->attrs.max_qp_wr; rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); - rds_ibdev->odp_capable = - !!(device->attrs.kernel_cap_flags & - IBK_ON_DEMAND_PAGING) && - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & - IB_ODP_SUPPORT_WRITE) && - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & - IB_ODP_SUPPORT_READ); - rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size; diff --git a/net/rds/ib.h b/net/rds/ib.h index 8ef3178ed4d6..f3ec4ff5951f 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -246,7 +246,6 @@ struct rds_ib_device { struct list_head conn_list; struct ib_device *dev; struct ib_pd *pd; - u8 odp_capable:1; unsigned int max_mrs; struct rds_ib_mr_pool *mr_1m_pool; diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index d1cfceeff133..75ab7b8db864 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -568,11 +568,6 @@ void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, struct ib_sge sge = {}; struct ib_mr *ib_mr; - if (!rds_ibdev->odp_capable) { - ret = -EOPNOTSUPP; - goto out; - } - ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr, access_flags); -- 2.49.0 From jgg at nvidia.com Tue Apr 8 12:23:38 2025 From: jgg at nvidia.com (Jason Gunthorpe) Date: Tue, 8 Apr 2025 09:23:38 -0300 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable In-Reply-To: References: Message-ID: <20250408122338.GA1778492@nvidia.com> On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote: > diff --git a/net/rds/ib.c b/net/rds/ib.c > index 9826fe7f9d00..c62aa2ff4963 100644 > --- a/net/rds/ib.c > +++ b/net/rds/ib.c > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) > rds_ibdev->max_wrs = device->attrs.max_qp_wr; > rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); > > - rds_ibdev->odp_capable = > - !!(device->attrs.kernel_cap_flags & > - IBK_ON_DEMAND_PAGING) && > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > - IB_ODP_SUPPORT_WRITE) && > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > - IB_ODP_SUPPORT_READ); This patch seems to drop the check for WRITE and READ support on the ODP. Jason From leon at kernel.org Tue Apr 8 12:34:13 2025 From: leon at kernel.org (Leon Romanovsky) Date: Tue, 8 Apr 2025 15:34:13 +0300 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable In-Reply-To: <20250408122338.GA1778492@nvidia.com> References: <20250408122338.GA1778492@nvidia.com> Message-ID: <20250408123413.GA199604@unreal> On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote: > > diff --git a/net/rds/ib.c b/net/rds/ib.c > > index 9826fe7f9d00..c62aa2ff4963 100644 > > --- a/net/rds/ib.c > > +++ b/net/rds/ib.c > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) > > rds_ibdev->max_wrs = device->attrs.max_qp_wr; > > rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); > > > > - rds_ibdev->odp_capable = > > - !!(device->attrs.kernel_cap_flags & > > - IBK_ON_DEMAND_PAGING) && > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > - IB_ODP_SUPPORT_WRITE) && > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > - IB_ODP_SUPPORT_READ); > > This patch seems to drop the check for WRITE and READ support on the > ODP. Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ. RDS doesn't need to check more than general ODP support and can safely rely on internal driver logic to create right MR. Thanks > > Jason From jgg at nvidia.com Tue Apr 8 12:38:14 2025 From: jgg at nvidia.com (Jason Gunthorpe) Date: Tue, 8 Apr 2025 09:38:14 -0300 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable In-Reply-To: <20250408123413.GA199604@unreal> References: <20250408122338.GA1778492@nvidia.com> <20250408123413.GA199604@unreal> Message-ID: <20250408123814.GC1778492@nvidia.com> On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote: > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote: > > > diff --git a/net/rds/ib.c b/net/rds/ib.c > > > index 9826fe7f9d00..c62aa2ff4963 100644 > > > --- a/net/rds/ib.c > > > +++ b/net/rds/ib.c > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) > > > rds_ibdev->max_wrs = device->attrs.max_qp_wr; > > > rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); > > > > > > - rds_ibdev->odp_capable = > > > - !!(device->attrs.kernel_cap_flags & > > > - IBK_ON_DEMAND_PAGING) && > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > - IB_ODP_SUPPORT_WRITE) && > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > - IB_ODP_SUPPORT_READ); > > > > This patch seems to drop the check for WRITE and READ support on the > > ODP. > > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ. Where? mlx5 reads this from FW and I don't see anything blocking IBK_ON_DEMAND_PAGING if the FW is weird. Jason From leon at kernel.org Tue Apr 8 19:11:38 2025 From: leon at kernel.org (Leon Romanovsky) Date: Tue, 8 Apr 2025 22:11:38 +0300 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable In-Reply-To: <20250408123814.GC1778492@nvidia.com> References: <20250408122338.GA1778492@nvidia.com> <20250408123413.GA199604@unreal> <20250408123814.GC1778492@nvidia.com> Message-ID: <20250408191138.GF199604@unreal> On Tue, Apr 08, 2025 at 09:38:14AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote: > > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote: > > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote: > > > > diff --git a/net/rds/ib.c b/net/rds/ib.c > > > > index 9826fe7f9d00..c62aa2ff4963 100644 > > > > --- a/net/rds/ib.c > > > > +++ b/net/rds/ib.c > > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) > > > > rds_ibdev->max_wrs = device->attrs.max_qp_wr; > > > > rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); > > > > > > > > - rds_ibdev->odp_capable = > > > > - !!(device->attrs.kernel_cap_flags & > > > > - IBK_ON_DEMAND_PAGING) && > > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > > - IB_ODP_SUPPORT_WRITE) && > > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > > - IB_ODP_SUPPORT_READ); > > > > > > This patch seems to drop the check for WRITE and READ support on the > > > ODP. > > > > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP > > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ. > > Where? mlx5 reads this from FW and I don't see anything blocking > IBK_ON_DEMAND_PAGING if the FW is weird. As the one who added it, I can assure you that we added these checks not because of weird FW, but because these caps existed. RDS calls to ib_reg_user_mr() with the following access_flags. 564 int access_flags = 565 (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | 566 IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC | 567 IB_ACCESS_ON_DEMAND); <...> 575 576 ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr, 577 access_flags); If for some reason ODP doesn't support WRITE and/or READ, ib_reg_user_mr() will return an error from FW, Thanks > > Jason From pranav.tyagi03 at gmail.com Tue Apr 8 19:41:53 2025 From: pranav.tyagi03 at gmail.com (Pranav Tyagi) Date: Wed, 9 Apr 2025 01:11:53 +0530 Subject: [rds-devel] [PATCH] net: rds: replace strncpy with memcpy Message-ID: <20250408194153.6570-1-pranav.tyagi03@gmail.com> Replace deprecated strncpy() function with memcpy() as the destination buffer is length bounded and not required to be NUL-terminated Signed-off-by: Pranav Tyagi --- net/rds/connection.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/net/rds/connection.c b/net/rds/connection.c index c749c5525b40..3718c3edb32e 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -749,8 +749,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer) cinfo->laddr = conn->c_laddr.s6_addr32[3]; cinfo->faddr = conn->c_faddr.s6_addr32[3]; cinfo->tos = conn->c_tos; - strncpy(cinfo->transport, conn->c_trans->t_name, - sizeof(cinfo->transport)); + memcpy(cinfo->transport, conn->c_trans->t_name, min(sizeof(cinfo->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo->transport)))); cinfo->flags = 0; rds_conn_info_set(cinfo->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags), @@ -775,8 +774,7 @@ static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer) cinfo6->next_rx_seq = cp->cp_next_rx_seq; cinfo6->laddr = conn->c_laddr; cinfo6->faddr = conn->c_faddr; - strncpy(cinfo6->transport, conn->c_trans->t_name, - sizeof(cinfo6->transport)); + memcpy(cinfo6->transport, conn->c_trans->t_name, min(sizeof(cinfo6->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo6->transport)))); cinfo6->flags = 0; rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags), -- 2.49.0 From shannon.nelson at amd.com Tue Apr 8 21:18:12 2025 From: shannon.nelson at amd.com (Nelson, Shannon) Date: Tue, 8 Apr 2025 14:18:12 -0700 Subject: [rds-devel] [PATCH] net: rds: replace strncpy with memcpy In-Reply-To: <20250408194153.6570-1-pranav.tyagi03@gmail.com> References: <20250408194153.6570-1-pranav.tyagi03@gmail.com> Message-ID: On 4/8/2025 12:41 PM, Pranav Tyagi wrote: > > Replace deprecated strncpy() function with memcpy() I suspect that strtomem() is a better answer here than a raw memcpy() - it already has all the strnlen() and min() stuff baked into it, along with some other compile-time checking. > as the destination buffer is length bounded > and not required to be NUL-terminated Are you sure that null-termination is not required? I'm not familiar with this bit of code, but the definitions of both of the .transport[] fields do say /* null term ascii */ sln > > Signed-off-by: Pranav Tyagi > --- > net/rds/connection.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/net/rds/connection.c b/net/rds/connection.c > index c749c5525b40..3718c3edb32e 100644 > --- a/net/rds/connection.c > +++ b/net/rds/connection.c > @@ -749,8 +749,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer) > cinfo->laddr = conn->c_laddr.s6_addr32[3]; > cinfo->faddr = conn->c_faddr.s6_addr32[3]; > cinfo->tos = conn->c_tos; > - strncpy(cinfo->transport, conn->c_trans->t_name, > - sizeof(cinfo->transport)); > + memcpy(cinfo->transport, conn->c_trans->t_name, min(sizeof(cinfo->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo->transport)))); > cinfo->flags = 0; > > rds_conn_info_set(cinfo->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags), > @@ -775,8 +774,7 @@ static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer) > cinfo6->next_rx_seq = cp->cp_next_rx_seq; > cinfo6->laddr = conn->c_laddr; > cinfo6->faddr = conn->c_faddr; > - strncpy(cinfo6->transport, conn->c_trans->t_name, > - sizeof(cinfo6->transport)); > + memcpy(cinfo6->transport, conn->c_trans->t_name, min(sizeof(cinfo6->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo6->transport)))); > cinfo6->flags = 0; > > rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags), > -- > 2.49.0 > > From allison.henderson at oracle.com Tue Apr 8 22:45:24 2025 From: allison.henderson at oracle.com (Allison Henderson) Date: Tue, 8 Apr 2025 22:45:24 +0000 Subject: [rds-devel] [PATCH] net: rds: replace strncpy with memcpy In-Reply-To: References: <20250408194153.6570-1-pranav.tyagi03@gmail.com> Message-ID: <32b8d4635b8f15cce3ae898cc480616428bc93ba.camel@oracle.com> On Tue, 2025-04-08 at 14:18 -0700, Nelson, Shannon wrote: > On 4/8/2025 12:41 PM, Pranav Tyagi wrote: > > > > Replace deprecated strncpy() function with memcpy() > > I suspect that strtomem() is a better answer here than a raw memcpy() - > it already has all the strnlen() and min() stuff baked into it, along > with some other compile-time checking. > > > as the destination buffer is length bounded > > and not required to be NUL-terminated > > Are you sure that null-termination is not required? I'm not familiar > with this bit of code, but the definitions of both of the .transport[] > fields do say /* null term ascii */ > > sln > Hi all, It appears that the transport names are null-terminated. Looking at rds_ib_transport, rds_tcp_transport, and rds_loop_transport, the t_name member is initialized to "infiniband", "tcp", or "loop", respectively? which include the null terminator. Given that, I think strscpy seems to be the appropriate function to use here. However, it looks like Baris has already submitted a similar patch yesterday, and unfortunately, we can't accept both. That said, thank you very much for your contribution?we really appreciate it! ? Allison > > > > Signed-off-by: Pranav Tyagi > > --- > > net/rds/connection.c | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/net/rds/connection.c b/net/rds/connection.c > > index c749c5525b40..3718c3edb32e 100644 > > --- a/net/rds/connection.c > > +++ b/net/rds/connection.c > > @@ -749,8 +749,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, void *buffer) > > cinfo->laddr = conn->c_laddr.s6_addr32[3]; > > cinfo->faddr = conn->c_faddr.s6_addr32[3]; > > cinfo->tos = conn->c_tos; > > - strncpy(cinfo->transport, conn->c_trans->t_name, > > - sizeof(cinfo->transport)); > > + memcpy(cinfo->transport, conn->c_trans->t_name, min(sizeof(cinfo->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo->transport)))); > > cinfo->flags = 0; > > > > rds_conn_info_set(cinfo->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags), > > @@ -775,8 +774,7 @@ static int rds6_conn_info_visitor(struct rds_conn_path *cp, void *buffer) > > cinfo6->next_rx_seq = cp->cp_next_rx_seq; > > cinfo6->laddr = conn->c_laddr; > > cinfo6->faddr = conn->c_faddr; > > - strncpy(cinfo6->transport, conn->c_trans->t_name, > > - sizeof(cinfo6->transport)); > > + memcpy(cinfo6->transport, conn->c_trans->t_name, min(sizeof(cinfo6->transport), strnlen(conn->c_trans->t_name, sizeof(cinfo6->transport)))); > > cinfo6->flags = 0; > > > > rds_conn_info_set(cinfo6->flags, test_bit(RDS_IN_XMIT, &cp->cp_flags), > > -- > > 2.49.0 > > > > > > From allison.henderson at oracle.com Wed Apr 9 00:54:39 2025 From: allison.henderson at oracle.com (Allison Henderson) Date: Wed, 9 Apr 2025 00:54:39 +0000 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable In-Reply-To: <20250408191138.GF199604@unreal> References: <20250408122338.GA1778492@nvidia.com> <20250408123413.GA199604@unreal> <20250408123814.GC1778492@nvidia.com> <20250408191138.GF199604@unreal> Message-ID: <94c8e113c11ec18c5e9330d7f2175a4469518e44.camel@oracle.com> On Tue, 2025-04-08 at 22:11 +0300, Leon Romanovsky wrote: > On Tue, Apr 08, 2025 at 09:38:14AM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote: > > > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote: > > > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote: > > > > > diff --git a/net/rds/ib.c b/net/rds/ib.c > > > > > index 9826fe7f9d00..c62aa2ff4963 100644 > > > > > --- a/net/rds/ib.c > > > > > +++ b/net/rds/ib.c > > > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) > > > > > rds_ibdev->max_wrs = device->attrs.max_qp_wr; > > > > > rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); > > > > > > > > > > - rds_ibdev->odp_capable = > > > > > - !!(device->attrs.kernel_cap_flags & > > > > > - IBK_ON_DEMAND_PAGING) && > > > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > > > - IB_ODP_SUPPORT_WRITE) && > > > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > > > - IB_ODP_SUPPORT_READ); > > > > > > > > This patch seems to drop the check for WRITE and READ support on the > > > > ODP. > > > > > > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP > > > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ. > > > > Where? mlx5 reads this from FW and I don't see anything blocking > > IBK_ON_DEMAND_PAGING if the FW is weird. > > As the one who added it, I can assure you that we added these checks not > because of weird FW, but because these caps existed. Hi Leon, Thanks for the patch. Is there a commit id for the FW checks we can see? Maybe we can just add a little more detail to the commit description to make clear where they are and what they're checking for. Thank you! Allison > > RDS calls to ib_reg_user_mr() with the following access_flags. > > 564 int access_flags = > 565 (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | > 566 IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC | > 567 IB_ACCESS_ON_DEMAND); > <...> > 575 > 576 ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr, > 577 access_flags); > > If for some reason ODP doesn't support WRITE and/or READ, ib_reg_user_mr() will return an error from FW, > > Thanks > > > > > > Jason From leon at kernel.org Thu Apr 10 11:35:05 2025 From: leon at kernel.org (Leon Romanovsky) Date: Thu, 10 Apr 2025 14:35:05 +0300 Subject: [rds-devel] [PATCH net-next] rds: rely on IB/core to determine if device is ODP capable In-Reply-To: <94c8e113c11ec18c5e9330d7f2175a4469518e44.camel@oracle.com> References: <20250408122338.GA1778492@nvidia.com> <20250408123413.GA199604@unreal> <20250408123814.GC1778492@nvidia.com> <20250408191138.GF199604@unreal> <94c8e113c11ec18c5e9330d7f2175a4469518e44.camel@oracle.com> Message-ID: <20250410113505.GQ199604@unreal> On Wed, Apr 09, 2025 at 12:54:39AM +0000, Allison Henderson wrote: > On Tue, 2025-04-08 at 22:11 +0300, Leon Romanovsky wrote: > > On Tue, Apr 08, 2025 at 09:38:14AM -0300, Jason Gunthorpe wrote: > > > On Tue, Apr 08, 2025 at 03:34:13PM +0300, Leon Romanovsky wrote: > > > > On Tue, Apr 08, 2025 at 09:23:38AM -0300, Jason Gunthorpe wrote: > > > > > On Tue, Apr 08, 2025 at 02:04:55PM +0300, Leon Romanovsky wrote: > > > > > > diff --git a/net/rds/ib.c b/net/rds/ib.c > > > > > > index 9826fe7f9d00..c62aa2ff4963 100644 > > > > > > --- a/net/rds/ib.c > > > > > > +++ b/net/rds/ib.c > > > > > > @@ -153,14 +153,6 @@ static int rds_ib_add_one(struct ib_device *device) > > > > > > rds_ibdev->max_wrs = device->attrs.max_qp_wr; > > > > > > rds_ibdev->max_sge = min(device->attrs.max_send_sge, RDS_IB_MAX_SGE); > > > > > > > > > > > > - rds_ibdev->odp_capable = > > > > > > - !!(device->attrs.kernel_cap_flags & > > > > > > - IBK_ON_DEMAND_PAGING) && > > > > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > > > > - IB_ODP_SUPPORT_WRITE) && > > > > > > - !!(device->attrs.odp_caps.per_transport_caps.rc_odp_caps & > > > > > > - IB_ODP_SUPPORT_READ); > > > > > > > > > > This patch seems to drop the check for WRITE and READ support on the > > > > > ODP. > > > > > > > > Right, and they are part of IBK_ON_DEMAND_PAGING support. All ODP > > > > providers support both IB_ODP_SUPPORT_WRITE and IB_ODP_SUPPORT_READ. > > > > > > Where? mlx5 reads this from FW and I don't see anything blocking > > > IBK_ON_DEMAND_PAGING if the FW is weird. > > > > As the one who added it, I can assure you that we added these checks not > > because of weird FW, but because these caps existed. > Hi Leon, > > Thanks for the patch. Is there a commit id for the FW checks we can see? It is part of FW checks to provided access_flags. In this case, you are asking for IB_ACCESS_REMOTE_READ and IB_ACCESS_ON_DEMAND. The check of IB_ODP_SUPPORT_READ is used when you need to dig which transport actually supports it. The thing is that ODP was always supported for RC QPs, from day one. > Maybe we can just add a little more detail to > the commit description to make clear where they are and what they're checking for. Thank you! Sure, will update it. Thanks > > Allison > > > > > RDS calls to ib_reg_user_mr() with the following access_flags. > > > > 564 int access_flags = > > 565 (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | > > 566 IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC | > > 567 IB_ACCESS_ON_DEMAND); > > <...> > > 575 > > 576 ib_mr = ib_reg_user_mr(rds_ibdev->pd, start, length, virt_addr, > > 577 access_flags); > > > > If for some reason ODP doesn't support WRITE and/or READ, ib_reg_user_mr() will return an error from FW, > > > > Thanks > > > > > > > > > > Jason >