Newsgroups: comp.os.linux From: pnakada@oracle.com (Paul Nakada) Subject: linux scheduling Nntp-Posting-Host: pnakada.us.oracle.com Organization: Oracle Corporation, Redwood Shores, CA Distribution: comp Date: Mon, 15 Jun 1992 21:20:16 GMT X-Disclaimer: This message was written by an unauthenticated user at Oracle Corporation. The opinions expressed are those of the user and not necessarily those of Oracle. Linux Hackers - Could someone (linus?) please give a high level description of the scheduling mechanism in Linux? Some things of concern: process timeout process blocking (for i/o) process wake-up (i/o interrupt) process scheduling (description of queues, when / where processes are added to queues) what is a jiffy? how about a little "walk through" i.e. a) a process calls a kernel trap; b) process enters kernel mode, current processor state saved. c) pop return address from stack and save d) kernel initiates disk i/o and places process on i/o queue e) get next active process f) push return address on stack and return from handler to new process g) process times out etc etc etc. why do i ask this? I've been studying the scheduling code for a couple days, trying to absorb enough to take a stab at improving response time under heavy disk i/o. I can understand the parts as separate modules, but it's not obvious to me how they work together as a system. I imagine it's second nature to Linus and other veteran kernel hackers, but all I've got is an OS class from a few years back. Thanks for any help. - Paul -- Paul Nakada | Oracle Corporation | pnakada@oracle.com
From: torvalds@klaava.Helsinki.FI (Linus Benedict Torvalds) Newsgroups: comp.os.linux Subject: Re: linux scheduling Date: 15 Jun 92 23:13:13 GMT Organization: University of Helsinki In article < PNAKADA.92Jun15132016@pnakada.oracle.com> pnakada@oracle.com (Paul Nakada) writes: > >Could someone (linus?) please give a high level description of the >scheduling mechanism in Linux? The scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes. Simply put: Linux gets a timer-interrupt at every jiffy (100 Hz), and decrements the current->counter. If current->counter is zero, and the process isn't running in kernel mode, the scheduler() is called. Also, when a process is woken up and it's counter is greater than the current process, a flag is set, which forces the re-scheduling. The scheduling is pretty simple: when schedule() is called, the process that has the highest non-zero counter and is runnable is run. If no runnable process exists, task[0] ("swapper") gets to run, but it currently doesn't do anything. In the future, the swapper task might do page-aging or similar. If all runnable processes have a zero counter, the schedule() algorithm goes through all processes once more, giving them a new counter-value of "priority+counter/2". It might seem useless to add the counter/2, as counter is 0, but in fact this is done to /all/ processes, including sleeping ones, which thereby get higher counter-values for when they wake up. The above algorithm seems to be pretty fair, and works well while being very easy to understand. It also gives much snappier response than many systems I've seen. And it's all done without any queues at all, which I find clean and simple. In fact, the scedule() algorithm isn't more than a couple of lines in C, and going to sleep or waking up is just a matter of setting the process status and calling schedule(). >a) a process calls a kernel trap; >b) process enters kernel mode, current processor state saved. In fact, the current processor state is never saved anywhere as in "normal" unices. A kernel trap under linux is very much like a system call: registers get saved on the stack as needed, but we don't mess around with updating task-structures etc. >c) pop return address from stack and save No, let it lie on the stack: when the process returns it gets restored. Why save it anywhere else? >d) kernel initiates disk i/o and places process on i/o queue >e) get next active process >f) push return address on stack and return from handler to new process >g) process times out > etc etc etc. It's all much simpler than this: the process, running in kernel mode, does what it wants, and if it wants to sleep, it just does current->state = TASK_INTERRUPTIBLE; schedule(); assuming it has made sure it will wake up some way (due to signals or the kernel time-out feature or by putting the task-struct address in a wait-variable). The other processes run happily in between, and when something causes the process to wake up, it just continues after the schedule() call. When it's done all the things the system call (or page-exception or whetever) required it to do, it just does a return, and it's back in user mode. As to how schedule() switches tasks, it's builtin in the 386, so it's actually just a couple machine-code instructions. It's slightly more complicated than just a jump through a task-segment, as it checks that it's not the same task, and for minimal 387-saving by testing the "last_task_used_math" variable and doing a clts if possible, but it's still less than 10 insns. To understand what /really/ goes on, you'd better read a 386 book, but it's not too complicated if you just care about the general idea. >why do i ask this? I've been studying the scheduling code for a >couple days, trying to absorb enough to take a stab at improving >response time under heavy disk i/o. I can understand the parts as >separate modules, but it's not obvious to me how they work together as >a system. I imagine it's second nature to Linus and other veteran >kernel hackers, but all I've got is an OS class from a few years back. I doubt the scheduler is the problem when it comes to response-time under heavy disk i/o. It's probably at least partly because of the way buffer-cache blocks are locked, which sometimes forces the buffer algorithms to sleep on them unnecessarily. Thus if a buffer-request comes through when blocks are locked due to actual IO, the request sometimes blocks just to make sure there are no races. I'd take a look at fs/buffer.c before doing anything else. But be /very/ careful when changing buffer.c: it's very easy to add a race-condition if you don't think of all the possible things that can happen due to interrupts etc while a buffer is searced for. Linus