From: l...@neteng.engr.sgi.com (Larry McVoy) Subject: Re: Some matters on the new verify_area Date: 1996/10/11 Message-ID: <53kt8g$juf@fido.asd.sgi.com>#1/1 X-Deja-AN: 188684072 references: <m0vBVQB-0005FdC@lightning.swansea.linux.org.uk> x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway) x-hdr-sender: l...@neteng.engr.sgi.com organization: Silicon Graphics Inc., Mountain View, CA x-env-sender: n...@fido.asd.sgi.com newsgroups: linux.dev.kernel Alan Cox (a...@lxorguk.ukuu.org.uk) wrote: : 2. It causes some extremely hard to solve problems in meeting both written : and the unwritten API specification of Unix. You get strange things like : a partial write of data returning EFAULT even though data was written. That : breaks stuff. In addition subtle bugs in the exception handling will leak : resources. This breaks POSIX.1. See IEEE Std 1003.1-1988, page 114, paragraph 1. I also think that Alan is probably onto something in the rest of the message which I didn't quote here. Linus, any chance of revisiting this issue? -- --- Larry McVoy l...@sgi.com http://reality.sgi.com/lm (415) 933-1804
From: Linus Torvalds <torva...@cs.Helsinki.FI> Subject: Re: Some matters on the new verify_area Date: 1996/10/11 Message-ID: <Pine.LNX.3.91.961011142512.28692A-100000@linux.cs.Helsinki.FI>#1/1 X-Deja-AN: 188713750 references: <53kt8g$juf@fido.asd.sgi.com> x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway) content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: torva...@cs.Helsinki.FI mime-version: 1.0 x-env-sender: torva...@cs.Helsinki.FI newsgroups: linux.dev.kernel On 11 Oct 1996, Larry McVoy wrote: > > Alan Cox (a...@lxorguk.ukuu.org.uk) wrote: > : 2. It causes some extremely hard to solve problems in meeting both written > : and the unwritten API specification of Unix. You get strange things like > : a partial write of data returning EFAULT even though data was written. That > : breaks stuff. In addition subtle bugs in the exception handling will leak > : resources. > > This breaks POSIX.1. See IEEE Std 1003.1-1988, page 114, paragraph 1. > I also think that Alan is probably onto something in the rest of the message > which I didn't quote here. > > Linus, any chance of revisiting this issue? My next version of this will have a better interface, and that includes partial reads/writes. The current exception() handling was the "simple" way to do it: it's ugly and hard to use, but the low-level implementation is very simple. I have a better interface already, but it's a not-so-small matter of actually getting it working correctly ;) With the new interface you can do just: bytes_unwritten = copy_to_user(user_buf, kernel_buf, nr); and it does all the checking and exception handling for you. The hard part is not actually getting it to work, but to get it to work as quickly as I want it to ;) Linus
From: Linus Torvalds <torva...@cs.helsinki.fi> Subject: Re: Some matters on the new verify_area Date: 1996/10/11 Message-ID: <Pine.LNX.3.91.961011185153.3395A-100000@linux.cs.Helsinki.FI>#1/1 X-Deja-AN: 188763427 sender: owner-linux-ker...@vger.rutgers.edu references: <199610111258.NAA28937@oberon.di.fc.ul.pt> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: torva...@cs.helsinki.fi mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Fri, 11 Oct 1996, Pedro Roque wrote: > > Linus, > it still doesn't work for the network stack at least... Actually, it _does_ work for the network stack. Even just the old "exception()" interface worked fine for the network stack, although the patches to do so weren't exactly pretty (one reason I decided the exception handling needs some work is that it was hard to use and I worried about the compiler doing things to the code that made exceptions unreliable). But yes, the code does need some changes. I did the changes for the TCP send side in 2.1.3, you can look at what I did there.. > We need to be able to either write the user buffer or reject all of it, in > a transaction like way. The problem is that if one gets an exception half-way > throught some skbs where already sent or part of the data is on send queues... Exceptions aren't totally asychnronous. In fact, they are totally synchronous wrt user level accesses, so the problem isn't all that large. > Now, the old verify area, unmodified, doesn't work either since the network > stack can sleep from the verify point till it actually uses the buffer... Indeed. The optimizations Alan suggested (to keep the old verify_area) simply do not work in a threaded environment. > It there anyway to pin down the mapping on a verify ? Efficiently? No. Believe me, I've been thinking about this, and the only efficient and thread-safe way to handle this is exceptions. But the bare exceptions exposed by 2.1.3 are certainly a bit rough in the edges. Linus
From: Pedro Roque <ro...@di.fc.ul.pt> Subject: Re: Some matters on the new verify_area Date: 1996/10/11 Message-ID: <199610111613.RAA29632@oberon.di.fc.ul.pt>#1/1 X-Deja-AN: 188908845 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: ro...@di.fc.ul.pt references: <199610111258.NAA28937@oberon.di.fc.ul.pt> x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel >>>>> "Linus" == Linus Torvalds <torva...@cs.Helsinki.FI> writes: Linus> On Fri, 11 Oct 1996, Pedro Roque wrote: >> Linus, it still doesn't work for the network stack at least... Linus> Actually, it _does_ work for the network stack. Linus> Even just the old "exception()" interface worked fine for Linus> the network stack, although the patches to do so weren't Linus> exactly pretty (one reason I decided the exception handling Linus> needs some work is that it was hard to use and I worried Linus> about the compiler doing things to the code that made Linus> exceptions unreliable). Linus> But yes, the code does need some changes. I did the changes Linus> for the TCP send side in 2.1.3, you can look at what I did Linus> there.. Linus, it is all a question of semantics... both TCP and UDP sends do: while(data from user) { copy_from_user... send packet to the network if (maybe) sleep(); /* usually waiting for memory */ } now what do you want Linux to do if on the second or third copy fails ? Just retuning -EFAULT seams inapropriate, to me. >> We need to be able to either write the user buffer or reject >> all of it, in a transaction like way. The problem is that if >> one gets an exception half-way throught some skbs where already >> sent or part of the data is on send queues... Linus> Exceptions aren't totally asychnronous. In fact, they are Linus> totally synchronous wrt user level accesses, so the problem Linus> isn't all that large. Rolling back all the processing is usually not possible. And yes, i tryed to think on ways of building skb_queues on send(), before things get passed to the IP level... it would be extremly slow and cumbersome. >> Now, the old verify area, unmodified, doesn't work either since >> the network stack can sleep from the verify point till it >> actually uses the buffer... Linus> Indeed. The optimizations Alan suggested (to keep the old Linus> verify_area) simply do not work in a threaded environment. >> It there anyway to pin down the mapping on a verify ? Linus> Efficiently? No. Believe me, I've been thinking about this, Linus> and the only efficient and thread-safe way to handle this Linus> is exceptions. But the bare exceptions exposed by 2.1.3 are Linus> certainly a bit rough in the edges. That is not what I am worried about... Tell us what are the right semantics you want for the case of other than the first of multiple reads from user fails and we'll code it up... regards, Pedro.
From: Linus Torvalds <torva...@cs.helsinki.fi> Subject: Re: Some matters on the new verify_area Date: 1996/10/11 Message-ID: <Pine.LNX.3.91.961011195618.5160A-100000@linux.cs.Helsinki.FI>#1/1 X-Deja-AN: 188932035 sender: owner-linux-ker...@vger.rutgers.edu references: <199610111613.RAA29632@oberon.di.fc.ul.pt> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: torva...@cs.helsinki.fi mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Fri, 11 Oct 1996, Pedro Roque wrote: > > Linus, it is all a question of semantics... > > both TCP and UDP sends do: > > while(data from user) > { > copy_from_user... > > send packet to the network > > if (maybe) > sleep(); /* usually waiting for memory */ > } > > now what do you want Linux to do if on the second or third copy fails ? > > Just retuning -EFAULT seams inapropriate, to me. Sure. That's why some mods are required. But it most of the code _already_ does things like if (!written) written = error; return written; (so an error like EFAULT is only returned if the _first_ copy fails). Anyway, it's not hard to do things like that. In the future, if you give a buffer that is partially valid, Linux will use up as much as possible of the buffer, and return the "used up" portion. Admittedly 2.1.3 doesn't do that, but it's not actually very hard to fix (my personal kernel already does the appropriate fix-ups). Linus
From: Linus Torvalds <torva...@cs.helsinki.fi> Subject: Re: Some matters on the new verify_area Date: 1996/10/12 Message-ID: <Pine.LNX.3.91.961012093405.21523A-100000@linux.cs.Helsinki.FI>#1/1 X-Deja-AN: 188896135 sender: owner-linux-ker...@vger.rutgers.edu references: <m0vBqxm-0005FeC@lightning.swansea.linux.org.uk> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: torva...@cs.helsinki.fi mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Sat, 12 Oct 1996, Alan Cox wrote: > > This is still not right. A Pipe is guaranteed to atomically write _ALL_ the > data or none, not half. A short write isnt the correct or POSIX legal response. Alan, don't be silly. Guys, you're being stupid on purpose here, or something. Can't you understand that THERE IS NO PROBLEM! You can make any damn semantics you like with the exception model, you just have to check the return value of "copy_from_user()" or "copy_to_user()". Depending on the return value you can choose to ignore the partial data you've written or not. For pipes, for example, we already lock the pipe data over any user-level transaction, so if a pipe write notices "oops, this copy failed", it is _trivial_ to just undo the write. > Consider a UDP send, where we fault half way through sending. At that point > we have sent half of the IP datagram, but the latter half. Bite me. I don't _CARE_. The new way of handling things is a damn lot faster than the old one, and it doesn't actually break any semantics at all, despite all your whining. Your argument that it makes it easier to send partial IP packets is equally bogus. We get a RETURN VALUE from the copy, for chirst sake. If you're so scared of somebody doing something evil, you can zero the rest of the packet and send out incomplete copies. Total code needed: bytes_uncopied = copy_and_csum_from_user(buf, ubuf, len, &skb->csum, partial_csum); if (bytes_uncopied) { memset(buf+len-bytes_uncopied, 0, bytes_uncopied); skb->csum = csum_partial(buf, len, partial_csum); sk->error = EFAULT; /* tell the user not to do that */ } Complex? Nope. Rocket science? Definitely not. The above is assuming we want to care about the whole problem with only partial IP packets sent in the first place. Quite frankly, it's not exactly _our_ problem if there are sites out there that crash when they get too many partial packets. Any random hacker can put a DOS machine on the net (or get root access to a Linux machine) and send out partial fragments without any help from the kernel at all.. You're arguing for "security by making it a bit harder to do".. Also, Alan, you're are sadly mistaken if you think you can easily do memory area lockings. It's a _lot_ more complex than you think, because it's not enough to lock the actual virtual memory areas (the "struct vm_area_struct"), you _also_ have to lock any inodes that are associated with a shared virtual memory area. The problem is that UNIX semantics for shared file mappings are _complex_: it's not enough that the virtual memory area is mapped, because any access past the end of the file is _also_ a fault. We don't handle it correctly right now, simply because it's so hard to handle. But with the new exception handling we _can_ handle things correctly. If you wanted to do a locking approach, you'd not only have to lock any vm_areas that are involved with a transfer from other threads, you'd _also_ have to lock any files that are shared-mapped if you want to get it right. The possibilities for problems are endless, and you also have some local security issues due to it (essentially the same security issues that arise from mandatory locking and NFSD). Trust me, you _cannot_ handle it even reasonably efficiently. You can do a half-assed job with reasonable overhead (that's what "verify_area()" essentially has done before I re-wrote the whole thing), but you simply _cannot_ do a good job efficiently. The new exception handling isn't going away. I get the overhead of doing a verified copy from user space down to about _five_ machine instructions with the new exception handling code, and quite frankly, no locking scheme can even come _close_. Not even a half-assed one, and certainly not a scheme that takes shared file faults into account. Linus
From: l...@neteng.engr.sgi.com (Larry McVoy) Subject: Re: Some matters on the new verify_area Date: 1996/10/12 Message-ID: <199610120741.AAA15159@neteng.engr.sgi.com>#1/1 X-Deja-AN: 188896394 sender: owner-linux-ker...@vger.rutgers.edu x-hdr-sender: l...@neteng.engr.sgi.com x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel : > This is still not right. A Pipe is guaranteed to atomically write _ALL_ the : > data or none, not half. A short write isnt the correct or POSIX legal response. : : You can make any damn semantics you like with the exception model, you just : have to check the return value of "copy_from_user()" or "copy_to_user()". If I understand this correctly (maybe I don't) then it implies that all I/O gizmos are now of the form bytes_uncopied = copy_from_user(args...); if (bytes_uncopied && i_care_that_it_didnt_do_it_all) { do something to undo it } Which in turn implies that all callers of copy_from_user() are copying into a buffer that can be undone. This would seem to be hard, but maybe it isn't. Certainly in file systems you will have the object (file/pipe/whatever) locked such that only you can muck with it, so conceivably you can unmuck it. Can anyone think of a case that can't be handled? Rephrased: Linus thinks that all Unix semantics can be handled through his new interface. As long as you can tell how much has been moved, and you can either undo or return the bytes moved, then I think that we are OK, are we not? And I assume Linus is going to show us some studly i/o rates (I take that the new thing is much faster - it seems like it is aimed at common usage, i.e., reducing copyin/copyout latency?). --lm
From: Linus Torvalds <torva...@cs.helsinki.fi> Subject: Re: Some matters on the new verify_area Date: 1996/10/12 Message-ID: <Pine.LNX.3.91.961012121223.23937A-100000@linux.cs.Helsinki.FI> X-Deja-AN: 188913763 sender: owner-linux-ker...@vger.rutgers.edu references: <199610120741.AAA15159@neteng.engr.sgi.com> content-type: TEXT/PLAIN; charset=US-ASCII x-hdr-sender: torva...@cs.helsinki.fi mime-version: 1.0 x-env-sender: owner-linux-kernel-outgo...@vger.rutgers.edu newsgroups: linux.dev.kernel On Sat, 12 Oct 1996, Larry McVoy wrote: > > If I understand this correctly (maybe I don't) then it implies that all > I/O gizmos are now of the form > > bytes_uncopied = copy_from_user(args...); > if (bytes_uncopied && i_care_that_it_didnt_do_it_all) { > do something to undo it > } Some are. The hairy ones. But actually I'd expect only very few copies to be of that type, and most being of the type where you can just do a partial read or write. There are actually _very_ few places that _require_ an "atomic" operation. For example, the file read code in mm/filemap.c just does nr -= copy_to_user(buf, page_cache, nr); error = -EFAULT; if (!nr) break; buf += nr; pos += nr; read += nr; ... (and then the return condition from the actual system call is if (!read) read = error; return read; but that's actually not something new: that has been there before to handle various other errors, so EFAULT is not a special case at all in this case). Note that the UDP datagram example that Alan was worried about was a totally different matter: there the worry wasn't so much the return value of the function itself, but Alan worried that we'd send out the first few fragments of a larger IP packet, and then not send out the rest at all. So he essentially worried about what showed up on the wire, because some BSD stacks don't handle partial fragments very well (out-of-memory errors because they don't time out the fragments?) > Which in turn implies that all callers of copy_from_user() are copying into > a buffer that can be undone. This would seem to be hard, but maybe it isn't. > Certainly in file systems you will have the object (file/pipe/whatever) > locked such that only you can muck with it, so conceivably you can unmuck > it. Pipes are really very special cases because they require that a write be atomic if it is smaller or equal to PIPE_BUF (or whatever the size was called). In short, POSIX pipes are actually very "non-unixy" (and I can't say I like the behaviour, but hey, it's not that hard to do). For just about anything else (including pipes with larger buffer sizes), it's completely acceptable to just do a partial read or write. If I remember correctly, POSIX says that a partial read or write from a file indicates either EOF or an IO error, and EFAULT is certainly an IO error as far as the the reader/writer is concerned ;) Again, I'd like to point out that EFAULT is a _programmer_ error, and that we don't really have to worry about any standards-conforming programs at all. In some sense any program that _ever_ results in EFAULT or a partial read/write due to that fault is never a POSIX-conforming program, and as such we could even just say "to hell with it, let's kill the program outright". Quite frankly, I don't think we'd need to undo anything at all for the pipe case (just do a partial IO operation), and we'd still be "POSIX-conforming". EFAULT really _is_ just a "segmentation fault" in a system call. Using exceptions internally is only the natural way to handle it. For example, at least my copy of some old XPG3 thing doesn't even _mention_ EFAULT in the error cases. Likewise, the "POSIX Programmer's Manual" does actually mention EFAULT and expands its meaning, but it also states that "no function is actually required to check against this" or something like that (and it has an empty list of calls that can return it). I don't actually have the official POSIX standards, but I suspect the situation is similar (ie EFAULT is not really considered a "real" error that can occur in a POSIX-conforming program). Can somebody check? > And I assume Linus is going to show us some studly i/o rates (I take that > the new thing is much faster - it seems like it is aimed at common usage, > i.e., reducing copyin/copyout latency?). Actually, the studly thing isn't the IO bandwidth, because that is pretty much always limited by memory copy speeds and/or device limitations. The _studly_ thing is the latency, because for latency numbers the actual copy doesn't dwarf the time it takes to check. And as you may remember, I actually think latency is at _least_ as important as throughput. Linus
From: l...@neteng.engr.sgi.com (Larry McVoy) Subject: Re: Some matters on the new verify_area Date: 1996/10/12 Message-ID: <53oqt2$l44@fido.asd.sgi.com>#1/1 X-Deja-AN: 189008214 references: <Pine.LNX.3.91.961011195618.5160A-100000@linux.cs.Helsinki.FI> <m0vBqxm-0005FeC@lightning.swansea.linux.org.uk> x-submitted-via: n...@ratatosk.yggdrasil.com (linux.* gateway) x-hdr-sender: l...@neteng.engr.sgi.com organization: Silicon Graphics Inc., Mountain View, CA x-env-sender: n...@fido.asd.sgi.com newsgroups: linux.dev.kernel : > Anyway, it's not hard to do things like that. In the future, if you give : > a buffer that is partially valid, Linux will use up as much as possible : > of the buffer, and return the "used up" portion. Admittedly 2.1.3 doesn't : > do that, but it's not actually very hard to fix (my personal kernel : > already does the appropriate fix-ups). : This is still not right. A Pipe is guaranteed to atomically write _ALL_ the : data or none, not half. A short write isnt the correct or POSIX legal response. I used to think this too (and I implemented all of posix.1 in SunOS, shame on me :-) but it isn't so. If you read the spec carefully, what it is really saying is that writes of up to PIPEBUF bytes must be atomic. Anything after that is up for grabs. The intent of the spec is to allow multiple writes to a pipe to get their data in there without it being garbled with the other guys. So as long as everyone is <= PIPEBUF, then the order is undefined but the results are defined: each guy gets their stuff in there, all of it, or doesn't. But there is never a half in half out case. This implies that you can not context switch or preempt whilest in the middle of a copyin/copyout to/from a pipe (actually I don't remember if the read side has the same semantics but it would be pretty silly if it didn't - yeah, I can get it in unscrambled but it comes out scrambled. I don't _think_ so). As to what happens for sizes > PIPEBUF? Posix is ecplicit about not defining this. My personal preference is that the kernel treat each PIPEBUF chunk individually, resetting its thinking on each boundry. That would lend reasonable results to applications that want pipe semantics for nbyte > PIPEBUF. The rationale for not doing that was that 4k was big enough, you can always do multiple writes. Short sighted. Getting back to the semantics of <= PIPEBUF, which is where all our issues are: POSIX didn't consider the EFAULT semantics that Linus has added so we get no help there. What they did consider is something quite similar, signals. They carefully spelled out what happens if you are doing a write and you get interrupted in the middle. I think we should follow those semantics. In that case, POSIX states: If a write() is interrupted by a signal before it writes any data, it shall return -1 with errno set to EINTR. If a write() is interrupted by a signal after it successfuly writes some data, it shall return -1 with errno set to EINTR, or it shall return the number of bytes written. A write() to a pipe or a FIFO shall *never* return with errno set to EINTR if it has transfered any data and nbyte is less than or equal to PIPE_BUF. I think what they are saying is that you can be as sloppy as you like elsewhere, but the results on pipes have to be exact. In general, Linus/Linux has always taken the high moral ground and applied the clean semantics to everything. I don't see why this should be an exception. My personal suggestions: . for everything except pipes & sockets, return the number of bytes written, no matter what that was. . for pipes, at least do the right thing for <= PIPEBUF. If you get a fault (or if for any other reason you transfer < nbyte), undo the entire I/O rather than return < nbyte. This is easy for applications to handle and is in the spirit of supporting atomic transfers for communication. . for sockets, treat them like pipes with a PIPE_BUF size == MTU. POSIX doesn't cover sockets (maybe they will/do in a later spec) but I'll bet you a plugged nickel they will have those semantics. Finally, let's stick together on this one. I know it is anal in the extreme but this sort of stuff is the kind of stuff that differentiates a toy from a real product. There are billions of dollars of Unix systems sold every year that depend on POSIX semantics - and spec the conformance as a purchasing go/no go. -- --- Larry McVoy l...@sgi.com http://reality.sgi.com/lm (415) 933-1804