Path: gmd.de!newsserver.jvnc.net!howland.reston.ans.net! europa.eng.gtefsd.com!uunet!news.univie.ac.at!fstgds15.tu-graz.ac.at! piis10.joanneum.ac.at!zloebl From: zlo...@piis10.joanneum.ac.at (Klaus ZLOEBL) Newsgroups: comp.os.linux.development Subject: linux scheduler alternatives??? Date: 27 Sep 1993 11:19:36 GMT Organization: Graz University of Technology, Austria Lines: 22 Message-ID: <286i88INN9rh@fstgds15.tu-graz.ac.at> NNTP-Posting-Host: piis10.joanneum.ac.at X-Newsreader: TIN [version 1.1 PL8] I found some time last weekend and read the khg after looking around in the kernel sources i had to notice that i don't fine the place where the timeout timer is set. So could somebody be so nice and tell me where to find it. Also i'd like to write a new scheduler (multilevel feedback queue) is there a new khg ? (i haven't seen one) What algorithm is use for paging? I have played with osp (an operating system simulator), but a real OS would be an even better playground! Thanks for any hints and pointers -- Klaus Zloebl | E-Mail: zlo...@piis10.joanneum.ac.at Joanneum Research | PSI : PSI%2631102911::ZLOEBL Steyrergasse 17 | A-8020 Graz | Phone: ++43/316/8020/243 AUSTRIA |
Path: gmd.de!newsserver.jvnc.net!howland.reston.ans.net!pipex!sunic! news.funet.fi!klaava!klaava!not-for-mail From: torva...@klaava.Helsinki.FI (Linus Torvalds) Newsgroups: comp.os.linux.development Subject: Re: linux scheduler alternatives??? Date: 28 Sep 1993 13:18:01 +0200 Organization: University of Helsinki Lines: 33 Message-ID: <2896h9$ou1@klaava.Helsinki.FI> References: <286i88INN9rh@fstgds15.tu-graz.ac.at> NNTP-Posting-Host: klaava.helsinki.fi In article <286i88INN...@fstgds15.tu-graz.ac.at>, Klaus ZLOEBL <zlo...@piis10.joanneum.ac.at> wrote: >I found some time last weekend and read the khg >after looking around in the kernel sources i had to notice >that i don't fine the place where the timeout timer is set. >So could somebody be so nice and tell me where to find it. It's not there. Linux just uses the normal timer to keep the heartbeat: after all, we don't allow just anybody to mess with interrupts, so the timeout timer (I assume you meant the one that sends an NMI) isn't needed. Actually, I've been thinking of using it for a better kernel profiler, as the current profiler doesn't work too well in interrupt handlers (understatement of the week), and one based on NMI would work. I've never bothered to find out how it works, though. >Also i'd like to write a new scheduler (multilevel feedback queue) There has been one that used something like this, and reportedly worked better under heavy load. I have to admit to liking the current one, though: simple and reasonably well-behaved. >What algorithm is use for paging? It's essentially a modified "clock" algorithm, which seems to work reasonably well. I once tried a NRU-like thing, but it was so good at finding pages that belonged to programs that were waiting for user input that I gave up on it. I could probably have tuned it to give better performance, but the clock algorithm is reasonably fair, needs no tuning, and doesn't give *too* bad results. Linus
Path: gmd.de!newsserver.jvnc.net!udel!gatech!swrinde!cs.utexas.edu! uunet!news2.uunet.ca!spool.mu.edu!sgiblab!idiom.berkeley.ca.us! moonshot.west.oic.com!moonshot.west.oic.com!not-for-mail From: dil...@moonshot.west.oic.com (Matthew Dillon) Newsgroups: comp.os.linux.development Subject: Re: linux scheduler alternatives??? - MY IDEA Date: 1 Oct 1993 02:03:19 -0700 Organization: Homeless Electron Lines: 178 Distribution: world Message-ID: <28gron$f27@moonshot.west.oic.com> References: <286i88INN9rh@fstgds15.tu-graz.ac.at> NNTP-Posting-Host: moonshot.west.oic.com In article <286i88INN...@fstgds15.tu-graz.ac.at> zlo...@piis10.joanneum.ac.at (Klaus ZLOEBL) writes: >I found some time last weekend and read the khg >after looking around in the kernel sources i had to notice >that i don't fine the place where the timeout timer is set. >So could somebody be so nice and tell me where to find it. > >Also i'd like to write a new scheduler (multilevel feedback queue) > >is there a new khg ? (i haven't seen one) > >What algorithm is use for paging? > >I have played with osp (an operating system simulator), but >a real OS would be an even better playground! > >Thanks for any hints and pointers >-- > >Klaus Zloebl | E-Mail: zlo...@piis10.joanneum.ac.at >Joanneum Research | PSI : PSI%2631102911::ZLOEBL >Steyrergasse 17 | >A-8020 Graz | Phone: ++43/316/8020/243 >AUSTRIA | I have some ideas re: scheduling. Specifically, the scheduler I did for a recent embedded OS. It's real simple, yet real powerful. It requires NO list traversal calculations at any time. There is exactly ONE run queue and ONE wait queue. It's incredibly easy to manage. Essentially, it works like this: Each process is given a fixed 8 bit unsigned priority. The time-slice the process gets is: pri maxTime * ---------- sum-pri maxTime = 50mS (smaller on faster machines. The smaller this number, the greater the overhead, but the better the realtime response) pri = priority of currently running task sum-pri = aggregate priority of all RUNNABLE tasks As you may have figured out, this scheme is really quite awesome in that process priorities are all relative and queue management is minimized. In my embedded OS, a complete task switch took 50uS on a 10MHz 68000. A 386 or 486 is at least 10 times faster (a 486DX2 is approximately 20 times faster), so I would expect task switch times on the order of 5-10uS... Furthermore, realtime response can be maintained even for low priority tasks because they get some amount of time each 50mS period (more on that in a moment). Block/Unblock overhead is also greatly minimized. The only parameter the system needs to maintain is <sum-pri>, the aggregate sum of the priorities for all runnable tasks. In otherwords, you add a task's priority to <sum-pri> when you put it on the run list, subtract it when you take it off, and modify <sum-pri> when a task who is on the run list is modified (i.e. renice). The implementation is even simpler... there is NO multiplication or division actually involved in the task switch interrupt itself, saving a lot of time. Basically, you setup a fixed-rate timer interrupt at approximately 100Hz (10mS). The faster the better... my embedded operating system implemented a 2mS time slice. For me, the overhead was around 10uS/2mS (you don't switch tasks every interrupt!), or 0.5%. For a 386/486 with its greater interrupt overhead I would expect about the same percentage using a 2mS interrupt. What you do when the intrrupt occurs is a single subtraction: task->counter = task->counter - sum_pri when the counter goes negative you switch to the next task. Note that YOU LEAVE THE COUNTER NEGATIVE! You do not zero it. When you switch in the next task in the run queue you perform the following calculation: task->counter = task->counter + (task->pri * CONSTANT) Where (task->pri * CONSTANT) is PRECALCULATED (only calculated when the task's priority actually changes, which is usually never!). Now comes the tricky part. If the counter is still negative, you skip the task and go on to the next one. If the counter becomes positive you continue with the switch in and let it run. the CONSTANT is directly related to the <maxTime> variable and based on the frequency of the timer interrupt verses <maxTime>. It is effectively a constant. The end result is that the AVERAGE cpu the task gets is EXACTLY the ratio of its priority over the aggregate priority of all runnable tasks, NO MATTER WHAT INTERRUPT RATE YOU USE FOR YOUR BASE TIMER. All the interrupt rate of the timer does is provide you with finer granularity allowing you to get better real-time response from tasks in a heavily loaded system. The system has many advantages: * low overhead in a lightly loaded system because the time slices are larger. For example, two processes of equal priority with <maxTime> set to 50mS will both get 25mS time slices. * Minimum time slice in a heavily loaded system. A task will get a minimum time slice of one interrupt period. Instead of the time slice getting smaller, the task will start to skip round robin periods when the switch-in addition to task->counter fails to make it positive again. * Adjustable real-time response. Making <maxTime> smaller will generally increase your real time response at a cost of switch overhead. I suggest 50mS... That will allow the load factor to get to 25.0 with a 2mS timer interrupt before real time response starts to suffer. * Extremely low overhead queueing and dequeueing operations and no other maintenance required. Both the run and the wait queues are circular. When you move a task to the run queue you ALWAYS ADD IT TO THE END OF THE QUEUE, JUST BEHIND THE CURRENTLY RUNNING TASK. This is necessary in order to avoid task starvation due to a high rate of deschedule/reschedule operations (i.e. synchronous IPC between two tasks could bring the rest of the system to a grinding halt). But, the system will still compensate for the scheduling of a high priority task. It's automatic and its free... the act of scheduling in a high priority task will cause <sum_pri> to increase and thus INSTANTLY give currently running tasks a shorter time slice, thereby getting to the scheduled in high priority task's queue position much faster. If you need even faster response you can queue it in the front instead of the back, but I do not reccommend it due to the possibility of locking out other tasks in a runaway situation. Now, what about VM paging and thrashing? Well, that's easy. * The task is not on the run queue during a pagein operation. * You can avoid thrashing by implementing thrashing management IN THE PAGEIN CODE, NOT IN THE SWITCHER CODE! Thrash management really does belong in the pagein code. Thrash management consists simply of holding off the pagein operation for a period of time when thrashing is occuring. This causes pagein activity to effectively 'burst' (hopefully with sequential block reads) rather then attempt to run multiple pageins for multiple tasks simultaniously (major seeking). * This allows you to implement different schemes based on the type of pagein occuring... i.e. file cache verses executable text page, data page, etc... What about fast-waits? A fast wait is when you do not bother to deschedule a task for an operation that 'blocks'. Instead, you leave the task in the run-queue and 'poll' for completion, skipping the task if the poll returns FALSE. This is EXTREMELY useful for synchronous IPC applications. By avoiding the four deschedule/reschedule calls that would normally occur for a single IPC transaction you get much faster synchronous response when the task is able to run the IPC and reply the message in less then a single time slice. Of course, linux doesn't have to worry about that ... yet! I don't have time to implement this for linux myself, someone else care to? Next week, class, I'll describe the I/O subsystem for the embedded OS I did ;-) -Matt Matthew Dillon dil...@moonshot.west.oic.com 1005 Apollo Way dil...@overload.berkeley.ca.us Incline Village, NV. 89451 ham: KC6LVW (no mail drop) USA Sandel-Avery Engineering (702)831-8000 [always include a portion of the original email in any response!]