Message-ID: <3AC93417.7B7814FC@chromium.com> Date: Tue, 03 Apr 2001 04:40:03 +0200 From: Fabio Riccardi <fa...@chromium.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2 i686) X-Accept-Language: en MIME-Version: 1.0 Subject: a quest for a better scheduler Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 31912.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk!134.222.94.5! npeer.kpnqwest.net!newssrv.ita.tip.net!bofh.it!robomod X-Original-Date: Mon, 02 Apr 2001 19:23:19 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: linux-ker...@vger.kernel.org Lines: 54 Hello, I sent a message a few days ago about some limitations I found in the linux scheduler. In servers like Apache where a large (> 1000) number of processes can be running at the same time and where many of them are runnable at the same time, the default Linux scheduler just starts trashing and the machine becomes very rapidly unusable. Performance degradations are quite noticeable on a two-way SMP machine (20-30% of the CPU gets lost) and are even more glaring on a multi-cpu machine. As an example, an 8-way Compaq Proliant just crawls with linux. From the feedback I received I realized that there are at least two possible solutions to the problem: http://lse.sourceforge.net/scheduling/ http://resourcemanagement.unixsolutions.hp.com/WaRM/schedpolicy.html Indeed I've tried the patches available on the sites for the multi-queue scheduler and I was amazed by the performance improvement that I got. Both patches allow me to get to a 100% real CPU utilization on a two way machine running ~1500 processes. What those patches do is quite simple, instead of having the single global process queue present in the normal Linux scheduler, they add multiple queues (one per CPU). In this way the scheduling decision can be greatly simplified and almost made local to each CPU. No hotspots, no global locks (well, almost). Although some scalability problems are still there (there still is a global decision to make), the performance improvement obtained and the simplicity of the solution are remarkable. The HP patch is probably the most interesting, since it consists of really a few lines of code and it gets (for what I could measure) the same kind of performance improvement of the more elaborate (but still quite simple) sourceforge patch. Is there any special reason why any of those patches didn't make it to the mainstream kernel code? TIA, ciao, - Fabio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Tue, 03 Apr 2001 12:00:06 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <3AC93417.7B7814FC@chromium.com> Message-ID: <Pine.LNX.4.30.0104031035271.2794-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 79779.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <3AC93417.7B7814FC@chromium.com> X-Original-Cc: Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Tue, 3 Apr 2001 10:55:12 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Fabio Riccardi <fa...@chromium.com> Lines: 68 On Mon, 2 Apr 2001, Fabio Riccardi wrote: > I sent a message a few days ago about some limitations I found in the > linux scheduler. > > In servers like Apache where a large (> 1000) number of processes can > be running at the same time and where many of them are runnable at the > same time, the default Linux scheduler just starts trashing and the > machine becomes very rapidly unusable. it is *not* a scheduler problem. This is an application design problem. Do not expect > 1000 runnable processes to perform well. Apache's one-process-per-parallel-client-connection application model is especially bad for such kind of loads. The scheduler has more important priorities to worry about than to fix application design problems. > Indeed I've tried the patches available on the sites for the > multi-queue scheduler and I was amazed by the performance improvement > that I got. Both patches allow me to get to a 100% real CPU > utilization on a two way machine running ~1500 processes. There were many promises along the years, and none as of now met all the requirements the scheduler has to fulfill. 'many runnable processes' is not on the top of the list - if we are scheduling like mad then we have lost lots of performance already. even with such 'improvements', the many-parallel-clients performance of one-process-per-request HTTP server designs is an order slower than what is possible. having said that, improving this workload is still a useful task and we added improvements to the scheduler to improve this case. But those patches sacrifice other functionality just to get better many-runnable-processes performance, which is unacceptable. > What those patches do is quite simple, instead of having the single > global process queue present in the normal Linux scheduler, they add > multiple queues (one per CPU). In this way the scheduling decision can > be greatly simplified and almost made local to each CPU. No hotspots, > no global locks (well, almost). the problem is that the *real* scheduling complexity (and needed functionality) shows when the number of runnable processes is in the neighborhood of the number of processors. Running many processes just masks eg. load-balancing deficiencies in schedulers, so obviously the simplest and most localized scheduler 'wins'. So comparing schedulers based on many-runnable-processes load is like comparing cars solely based on the size of their fuel tanks. > Although some scalability problems are still there (there still is a > global decision to make), the performance improvement obtained and the > simplicity of the solution are remarkable. you can easily make the scheduler fast in the many-processes case by sacrificing functionality in the much more realistic, few-processes case. None of the patch i've seen so far maintained the current scheduler's few-processes logic. But i invite you to improve the current scheduler's many-processes behavior, without hurting its behavior in the few-processes case. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Subject: Re: a quest for a better scheduler Date: Tue, 03 Apr 2001 14:40:05 +0200 In-Reply-To: <3AC93417.7B7814FC@chromium.com> from "Fabio Riccardi" at Apr 02, 2001 07:23:19 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <E14kPy7-0007xx-00@the-village.bc.nu> From: Alan Cox <a...@lxorguk.ukuu.org.uk> Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 46228.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <3AC93417.7B7814FC@chromium.com> X-Original-Cc: linux-ker...@vger.kernel.org X-Original-Date: Tue, 3 Apr 2001 13:31:29 +0100 (BST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: fa...@chromium.com (Fabio Riccardi) Lines: 13 > Is there any special reason why any of those patches didn't make it to > the mainstream kernel code? All of them are worse for the normal case. Also 1500 running apache's isnt a remotely useful situation, you are thrashing the cache even if you are now not thrashing the scheduler. Use an httpd designed for that situation. Then you can also downgrade to a cheap pentium class box for the task ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Tue, 03 Apr 2001 21:20:08 +0200 From: Mike Kravetz <mkrav...@sequent.com> Subject: Re: a quest for a better scheduler Message-ID: <20010403121308.A1054@w-mikek2.sequent.com> References: <3AC93417.7B7814FC@chromium.com> <Pine.LNX.4.30.0104031035271.2794-100000@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <Pine.LNX.4.30.0104031035271.2794-100000@elte.hu>; from mingo@elte.hu on Tue, Apr 03, 2001 at 10:55:12AM +0200 Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 71163.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Tue, 3 Apr 2001 12:13:08 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Ingo Molnar <mi...@elte.hu> Lines: 55 On Tue, Apr 03, 2001 at 10:55:12AM +0200, Ingo Molnar wrote: > > you can easily make the scheduler fast in the many-processes case by > sacrificing functionality in the much more realistic, few-processes case. > None of the patch i've seen so far maintained the current scheduler's > few-processes logic. But i invite you to improve the current scheduler's > many-processes behavior, without hurting its behavior in the few-processes > case. > Maintaining the current scheduler's logic is exactly what we are trying to do in the projects at: http://lse.sourceforge.net/scheduling/ A common design goal for the the two alternative scheduler implementations at this site is to maintain the current scheduler's behavior/scheduling decisions. In the case of the priority queue scheduler, we have actually used a copy of the existing scheduler running in parallel (in the same kernel) to determine if we are making the same scheduling decisions. Currently, in this implementation we only deviate from the current scheduler in a small number of cases where tasks get a boost due to having the same memory map. The multi-queue implementation is more interesting. It is also designed to maintain the behavior of the current scheduler. However, as the runqueues get longer (and we start getting contention on the runqueue locks) it starts to deviate from existing scheduler behavior and make more local scheduling decisions. Ideally, this implementation will exhibit the behavior of the current scheduler at low thread counts and make more localized decisions as pressure on the scheduler is increased. Neither of these implementations are at a point where I would advocate their adoption; yet. Can someone tell me what a good workload/benchmark would be to examine 'low thread count' performance? In the past people have used the 'spinning on sched_yield' benchmark. However, this now makes little sense with the sched_yield optimizations introduced in 2.4. In addition, such a benchmark mostly ignores the 'reschedule_idle' component of the scheduler. We have developed a 'token passing' benchmark which attempts to address these issues (called reflex at the above site). However, I would really like to get a pointer to a community acceptable workload/benchmark for these low thread cases. -- Mike Kravetz mkrav...@sequent.com IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Tue, 03 Apr 2001 22:00:08 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <20010403121308.A1054@w-mikek2.sequent.com> Message-ID: <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 40993.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <20010403121308.A1054@w-mikek2.sequent.com> X-Original-Cc: Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Tue, 3 Apr 2001 20:47:52 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Mike Kravetz <mkrav...@sequent.com> Lines: 45 On Tue, 3 Apr 2001, Mike Kravetz wrote: > [...] Currently, in this implementation we only deviate from the > current scheduler in a small number of cases where tasks get a boost > due to having the same memory map. thread-thread-affinity pretty much makes it impossible to use a priority queue. This 'small number of cases' is actually a fundamental issue: can the 'goodness()' function be an arbitrary function of: goodness(process_prev,process_next) := f(process_prev,process_next) , or is the queue design restricting the choice of goodness() functions? Priority queues for example restrict the choice of the goodness() function to subset of functions: goodness(process_prev,process_next) := f(process_next); this restriction (independence of the priority from the previous process) is a fundamentally bad property, scheduling is nonlinear and affinity decisions depend on the previous context. [i had a priority-queue SMP scheduler running 2-3 years ago but dropped the idea due to this issue.] IMO it's more important to have a generic and flexible scheduler, and arbitrary, nonnatural restrictions tend to bite us later on. one issue regarding multiqueues is the ability of interactive processes to preempt other, lower priority processes on other CPUs. These sort of things tend to happen while using X. In a system where process priorities are different, scheduling decisions cannot be localized per-CPU (unfortunately), and 'the right to run' as such is a global thing. > Can someone tell me what a good workload/benchmark would be to examine > 'low thread count' performance? [...] lmbench's lat_ctx for example, and other tools in lmbench trigger various scheduler workloads as well. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 00:50:07 +0200 From: Mike Kravetz <mkrav...@sequent.com> Subject: Re: a quest for a better scheduler Message-ID: <20010403154314.E1054@w-mikek2.sequent.com> References: <20010403121308.A1054@w-mikek2.sequent.com> <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu>; from mingo@elte.hu on Tue, Apr 03, 2001 at 08:47:52PM +0200 Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 3375.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: fran...@us.ibm.com, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Tue, 3 Apr 2001 15:43:14 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Ingo Molnar <mi...@elte.hu> Lines: 63 On Tue, Apr 03, 2001 at 08:47:52PM +0200, Ingo Molnar wrote: > > this restriction (independence of the priority from the previous process) > is a fundamentally bad property, scheduling is nonlinear and affinity > decisions depend on the previous context. [i had a priority-queue SMP > scheduler running 2-3 years ago but dropped the idea due to this issue.] > IMO it's more important to have a generic and flexible scheduler, and > arbitrary, nonnatural restrictions tend to bite us later on. It seems like we may be talking about two different things. Our 'priority queue' implementation uses almost the same goodness function as the current scheduler. The main difference between our 'priority queue' scheduler and the current scheduler is the structure of the runqueue. We break up the single runqueue into a set of priority based runqueues. The 'goodness' of a task determines what sub-queue a task is placed in. Tasks with higher goodness values are placed in higher priority queues than tasks with lower goodness values. This allows us to limit the scan of runnable tasks to the highest priority sub-queue, as opposed to the entire runquue. When scanning the highest priority sub-queue we use almost the same goodness function as that which is used today (it does take CPU affinity into account). In fact, if we used the same goodness function I'm pretty sure we would exactly match the behavior of the existing scheduler. Hopefully, Hubertus Franke will speak up and provide more details, as he is much more familiar with this implementation than I am. In any case, I think we have almost reached the conclusion that our priority queue implementation may not be acceptable due to the extra overhead in low thread counts. > one issue regarding multiqueues is the ability of interactive processes to > preempt other, lower priority processes on other CPUs. These sort of > things tend to happen while using X. In a system where process priorities > are different, scheduling decisions cannot be localized per-CPU > (unfortunately), and 'the right to run' as such is a global thing. Agreed. This is why our multi-queue scheduler always attempts to make global decisions. It will preempt lower priority tasks on other CPUs. This implementation tends to make 'more localized' scheduling decisions as contention on the runqueue locks increase. However, at this point one could argue that we have moved away from a 'realistic' low task count system load. > lmbench's lat_ctx for example, and other tools in lmbench trigger various > scheduler workloads as well. Thanks, I'll add these to our list. -- Mike Kravetz mkrav...@sequent.com IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Message-ID: <3ACA683A.89D24DED@chromium.com> Date: Wed, 04 Apr 2001 02:20:08 +0200 From: Fabio Riccardi <fa...@chromium.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2 i686) X-Accept-Language: en MIME-Version: 1.0 Subject: Re: a quest for a better scheduler References: <20010403121308.A1054@w-mikek2.sequent.com> <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu> <20010403154314.E1054@w-mikek2.sequent.com> Content-Type: multipart/mixed; boundary="------------B46085350AE53D8B14855445" Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 82541.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Ingo Molnar <mi...@elte.hu>, fran...@us.ibm.com, Linux Kernel List <linux-ker...@vger.kernel.org>, Alan Cox <a...@lxorguk.ukuu.org.uk> X-Original-Date: Tue, 03 Apr 2001 17:18:03 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Mike Kravetz <mkrav...@sequent.com> Lines: 410 This is a multi-part message in MIME format. --------------B46085350AE53D8B14855445 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Dear all, I've spent my afternoon running some benchmarks to see if MQ patches would degrade performance in the "normal case". To measure performance I've used the latest lmbench and I have mesured the kernel compile times on a dual pentium III box runing at 1GHz with an 133MHz bus. Results (attached) show that there is no measurable difference in performace between a vanilla scheduler and a multiqueue scheduler when running only few processes (the compilation benchmark runs essentially two processes, one per CPU). I have measured the HP and not the "scalability" patch because the two do more or less the same thing and give me the same performance advantages, but the former is a lot simpler and I could port it with no effort on any recent kernel. It is indeed interesting to see that this patch was originally designed for a 2.4.0-test10 kernel, and still works fine on the latest kernels, only a minor change (one line) was required. A version of the patch for the 2.4.2-ac26 kernel is attached to this message. Given the zero impact on "normal case" performance and the huge positive impact (> 20%) in the heavy load case (70-80 runnable processess on a load of about 1400 total) I don't see why such a thing shouldn't be accepted in the mainstream scheduler. - Fabio Patch Kernel compilation times ======================== three runs of "time make -j2" -- Linux 2.4.2-ac26 + HP Multiqueue patch 216.34user 14.36system 2:00.03elapsed 192%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (482269major+691020minor)pagefaults 0swaps 216.53user 14.23system 1:57.91elapsed 195%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (482269major+691020minor)pagefaults 0swaps 217.65user 13.46system 1:58.05elapsed 195%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (482269major+691020minor)pagefaults 0swaps -- Linux 2.4.2-ac26 220.07user 14.88system 2:02.67elapsed 191%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (482269major+691019minor)pagefaults 0swaps 220.31user 14.90system 2:00.64elapsed 194%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (482269major+691018minor)pagefaults 0swaps 220.58user 14.84system 2:00.57elapsed 195%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (482269major+691018minor)pagefaults 0swaps LMBENCH-2BETA1 ============== First two: Linux 2.4.2-ac26 + HP Multiqueue patch Second Two: Linux 2.4.2-ac26 Processor, Processes - times in microseconds - smaller is better ---------------------------------------------------------------- Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ---- skinny Linux 2.4.2-a 997 0.34 0.56 3.96 5.04 27 0.89 2.72 245 1128 4044 skinny Linux 2.4.2-a 997 0.34 0.57 4.19 5.32 25 0.89 2.71 247 1150 4067 skinny Linux 2.4.2-a 997 0.34 0.58 3.90 5.00 25 0.89 2.69 249 1121 3968 skinny Linux 2.4.2-a 997 0.34 0.57 3.90 5.01 25 0.87 2.70 246 1126 4018 Context switching - times in microseconds - smaller is better ------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ----- ------ ------ ------ ------ ------- ------- skinny Linux 2.4.2-a 1.820 4.7700 12 8.0800 109 27 110 skinny Linux 2.4.2-a 1.890 4.7300 20 6.6500 109 27 110 skinny Linux 2.4.2-a 1.620 4.5900 12 6.7000 109 24 109 skinny Linux 2.4.2-a 1.700 4.6400 12 7.0600 109 26 109 *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------------- ----- ----- ---- ----- ----- ----- ----- ---- skinny Linux 2.4.2-a 1.820 7.390 15 16 52 23 91 55 skinny Linux 2.4.2-a 1.890 7.185 14 16 41 23 56 54 skinny Linux 2.4.2-a 1.620 6.793 15 16 40 21 56 54 skinny Linux 2.4.2-a 1.700 6.801 15 16 40 21 55 54 File & VM system latencies in microseconds - smaller is better -------------------------------------------------------------- Host OS 0K File 10K File Mmap Prot Page Create Delete Create Delete Latency Fault Fault --------- ------------- ------ ------ ------ ------ ------- ----- ----- skinny Linux 2.4.2-a 6.2054 0.7192 12 1.6973 451 0.672 2.00000 skinny Linux 2.4.2-a 6.2469 0.7360 12 1.7142 465 0.668 2.00000 skinny Linux 2.4.2-a 3.1992 0.7182 12 1.5857 449 0.680 2.00000 skinny Linux 2.4.2-a 6.1576 0.7119 12 1.5817 448 0.669 2.00000 *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- skinny Linux 2.4.2-a 823 233 162 434 558 271 218 558 282 skinny Linux 2.4.2-a 828 289 243 408 558 269 216 558 281 skinny Linux 2.4.2-a 839 285 249 445 558 222 147 558 219 skinny Linux 2.4.2-a 811 284 242 445 558 222 147 558 219 Memory latencies in nanoseconds - smaller is better (WARNING - may not be correct, check graphs) --------------------------------------------------- Host OS Mhz L1 $ L2 $ Main mem Guesses --------- ------------- ---- ----- ------ -------- ------- skinny Linux 2.4.2-a 997 3.009 7.0230 101 skinny Linux 2.4.2-a 997 3.010 7.0220 101 skinny Linux 2.4.2-a 997 3.010 7.0280 101 skinny Linux 2.4.2-a 997 3.009 7.0290 101 --------------B46085350AE53D8B14855445-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Message-ID: <3ACA6BF4.9A3F93A2@chromium.com> Date: Wed, 04 Apr 2001 02:40:05 +0200 From: Fabio Riccardi <fa...@chromium.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2 i686) X-Accept-Language: en MIME-Version: 1.0 Subject: Re: a quest for a better scheduler References: <E14kPy7-0007xx-00@the-village.bc.nu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 16281.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: linux-ker...@vger.kernel.org X-Original-Date: Tue, 03 Apr 2001 17:33:57 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Alan Cox <a...@lxorguk.ukuu.org.uk> Lines: 42 Alan, for the "normal case" performance see my other message. I agree that a better threading model would surely help in a web server, but to me this is not an excuse to live up with a broken scheduler. The X15 server I'm working on now is a sort of user-space TUX, it uses only 8 threads per CPU and it achieves the same level of performance of the kernel space TUX. Even in this case the performance advantage of the multiqueue scheduler is remarkable, especially on a multi-CPU (> 2) architecture. To achieve some decent disk/CPU/network overlapping with the current linux blocking disk IO limitations there is no way to avoid a "bunch of server threads". I've (experimentally) figured out that 8-16 threads per CPU can assure some reasonable overlapping, depending on the memory size and disk subsystem speed. On a 8-way machine this means 64-128 active tasks, a total disaster with the current scheduler. Unless we want to maintain the position tha the only way to achieve good performance is to embed server applications in the kernel, some minimal help should be provided to goodwilling user applications :) TIA, ciao, - Fabio Alan Cox wrote: > > Is there any special reason why any of those patches didn't make it to > > the mainstream kernel code? > > All of them are worse for the normal case. Also 1500 running apache's isnt > a remotely useful situation, you are thrashing the cache even if you are now > not thrashing the scheduler. Use an httpd designed for that situation. Then > you can also downgrade to a cheap pentium class box for the task ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Subject: Re: a quest for a better scheduler Date: Wed, 04 Apr 2001 02:40:05 +0200 In-Reply-To: <3ACA6BF4.9A3F93A2@chromium.com> from "Fabio Riccardi" at Apr 03, 2001 05:33:57 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <E14kbH2-0000qX-00@the-village.bc.nu> From: Alan Cox <a...@lxorguk.ukuu.org.uk> Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 16282.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <3ACA6BF4.9A3F93A2@chromium.com> X-Original-Cc: a...@lxorguk.ukuu.org.uk (Alan Cox), linux-ker...@vger.kernel.org X-Original-Date: Wed, 4 Apr 2001 01:35:46 +0100 (BST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: fa...@chromium.com (Fabio Riccardi) Lines: 24 > for the "normal case" performance see my other message. I did - and with a lot of interest > I agree that a better threading model would surely help in a web server, but to > me this is not an excuse to live up with a broken scheduler. The problem has always been - alternative scheduler, crappier performance for 2 tasks running (which is most boxes). If your numbers are right then the HP patch is working as well for 1 or 2 tasks too > Unless we want to maintain the position tha the only way to achieve good > performance is to embed server applications in the kernel, some minimal help > should be provided to goodwilling user applications :) Indeed. I'd love to see you beat tux entirely in userspace. It proves the rest of the API for the kernel is right - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Message-ID: <3ACA7629.E8C54D13@chromium.com> Date: Wed, 04 Apr 2001 03:20:05 +0200 From: Fabio Riccardi <fa...@chromium.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2 i686) X-Accept-Language: en MIME-Version: 1.0 Subject: Re: a quest for a better scheduler References: <E14kbH2-0000qX-00@the-village.bc.nu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 50808.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: linux-ker...@vger.kernel.org X-Original-Date: Tue, 03 Apr 2001 18:17:30 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Alan Cox <a...@lxorguk.ukuu.org.uk> Lines: 52 Alan Cox wrote: > > for the "normal case" performance see my other message. > > I did - and with a lot of interest thanks! :) > > I agree that a better threading model would surely help in a web server, but to > > me this is not an excuse to live up with a broken scheduler. > > The problem has always been - alternative scheduler, crappier performance for > 2 tasks running (which is most boxes). If your numbers are right then the > HP patch is working as well for 1 or 2 tasks too Please verify them if you have a couple of spare hours. BTW: I measured similar results for the "scalability" patches on a 2.4.1 kernel, it would be worth the effort to seriously compare them from an architectural point of view, but I don't have the time right now... > > Unless we want to maintain the position tha the only way to achieve good > > performance is to embed server applications in the kernel, some minimal help > > should be provided to goodwilling user applications :) > > Indeed. I'd love to see you beat tux entirely in userspace. It proves the > rest of the API for the kernel is right Indeed, I'm using RT sigio/sigwait event scheduling, bare clone threads and zero-copy io. If only I had a really asynchronous sendfile, or a smarter madvise that wouldn't require to map files :) My server cannot execute dynamic stuff yet, it relies on Apache for that. Running X15 and TUX in the same conditions (i.e. dynamic code in Apache) I get exactly the same score in both cases. I'm adding a TUX-like dynamic interface, I hope to get it to work by next week, then I'll make a real confrontation. Regards, ciao, - Fabio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 06:00:06 +0200 From: Mike Kravetz <mkrav...@sequent.com> Subject: Re: a quest for a better scheduler Message-ID: <20010403194700.A1024@w-mikek2.sequent.com> References: <20010403121308.A1054@w-mikek2.sequent.com> <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu> <20010403154314.E1054@w-mikek2.sequent.com> <3ACA683A.89D24DED@chromium.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <3ACA683A.89D24DED@chromium.com>; from fabio@chromium.com on Tue, Apr 03, 2001 at 05:18:03PM -0700 Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 61183.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Mike Kravetz <mkrav...@sequent.com>, Ingo Molnar <mi...@elte.hu>, fran...@us.ibm.com, Linux Kernel List <linux-ker...@vger.kernel.org>, Alan Cox <a...@lxorguk.ukuu.org.uk> X-Original-Date: Tue, 3 Apr 2001 19:47:00 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Fabio Riccardi <fa...@chromium.com> Lines: 22 On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote: > > I have measured the HP and not the "scalability" patch because the two do more > or less the same thing and give me the same performance advantages, but the > former is a lot simpler and I could port it with no effort on any recent > kernel. Actually, there is a significant difference between the HP patch and the one I developed. In the HP patch, if there is a schedulable task on the 'local' (current CPU) runqueue it will ignore runnable tasks on other (remote) runqueues. In the multi-queue patch I developed, the scheduler always attempts to make the same global scheduling decisions as the current scheduler. -- Mike Kravetz mkrav...@sequent.com IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Message-ID: <3ACAA164.BDFF9B4C@chromium.com> Date: Wed, 04 Apr 2001 06:30:05 +0200 From: Fabio Riccardi <fa...@chromium.com> X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2 i686) X-Accept-Language: en MIME-Version: 1.0 Subject: Re: a quest for a better scheduler References: <20010403121308.A1054@w-mikek2.sequent.com> <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu> <20010403154314.E1054@w-mikek2.sequent.com> <3ACA683A.89D24DED@chromium.com> <20010403194700.A1024@w-mikek2.sequent.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 74717.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Ingo Molnar <mi...@elte.hu>, fran...@us.ibm.com, Linux Kernel List <linux-ker...@vger.kernel.org>, Alan Cox <a...@lxorguk.ukuu.org.uk> X-Original-Date: Tue, 03 Apr 2001 21:21:57 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Mike Kravetz <mkrav...@sequent.com> Lines: 44 I was actually suspecting that the extra lines in your patch were there for a reason :) A few questions: What is the real impact of a (slight) change in scheduling semantics? Under which situation one should notice a difference? As you state in your papers the global decision comes with a cost, is it worth it? Could you make a port of your thing on recent kernels? I tried and I failed and I don't have enough time to figure out why, that should be trivial for you though. TIA, ciao, - Fabio Mike Kravetz wrote: > On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote: > > > > I have measured the HP and not the "scalability" patch because the two do more > > or less the same thing and give me the same performance advantages, but the > > former is a lot simpler and I could port it with no effort on any recent > > kernel. > > Actually, there is a significant difference between the HP patch and > the one I developed. In the HP patch, if there is a schedulable task > on the 'local' (current CPU) runqueue it will ignore runnable tasks on > other (remote) runqueues. In the multi-queue patch I developed, the > scheduler always attempts to make the same global scheduling decisions > as the current scheduler. > > -- > Mike Kravetz mkrav...@sequent.com > IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 09:40:05 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <20010403154314.E1054@w-mikek2.sequent.com> Message-ID: <Pine.LNX.4.30.0104040825460.1717-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 11424.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <20010403154314.E1054@w-mikek2.sequent.com> X-Original-Cc: <fran...@us.ibm.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 08:28:31 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Mike Kravetz <mkrav...@sequent.com> Lines: 28 On Tue, 3 Apr 2001, Mike Kravetz wrote: > Our 'priority queue' implementation uses almost the same goodness > function as the current scheduler. The main difference between our > 'priority queue' scheduler and the current scheduler is the structure > of the runqueue. We break up the single runqueue into a set of > priority based runqueues. The 'goodness' of a task determines what > sub-queue a task is placed in. Tasks with higher goodness values are > placed in higher priority queues than tasks with lower goodness > values. [...] we are talking about the same thing, re-read my mail. this design assumes that 'goodness' is constant in the sense that it does not depend on the previous process. and no, your are not using the same goodness() function, you omitted the prev->mm == next->mm component to goodness(), due to this design limitation. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 10:00:08 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <3ACA683A.89D24DED@chromium.com> Message-ID: <Pine.LNX.4.30.0104040835470.1708-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 61015.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!bignews.mediaways.net! unlisys!news.snafu.de!news.tele.dk!194.25.134.62!newsfeed00.sul.t-online.de! t-online.de!bofh.it!robomod References: <3ACA683A.89D24DED@chromium.com> X-Original-Cc: Mike Kravetz <mkrav...@sequent.com>, <fran...@us.ibm.com>, Linux Kernel List <linux-ker...@vger.kernel.org>, Alan Cox <a...@lxorguk.ukuu.org.uk>, Linus Torvalds <torva...@transmeta.com> X-Original-Date: Wed, 4 Apr 2001 08:53:36 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Fabio Riccardi <fa...@chromium.com> Lines: 35 On Tue, 3 Apr 2001, Fabio Riccardi wrote: > I've spent my afternoon running some benchmarks to see if MQ patches > would degrade performance in the "normal case". no doubt priority-queue can run almost as fast as the current scheduler. What i'm worried about is the restriction of the 'priority' of processes, it cannot depend on previous processes (and other 'current state') anymore. to so we have two separate issues: #1: priority-queue: has the fundamental goodness() design limitation. #2: per-CPU-runqueues: changes semantics, makes scheduler less effective due to nonglobal decisions. about #1: while right now the prev->mm rule appears to be a tiny issue (it might not affect performance significantly), but forbidding it by hardcoding the assumption into data structures is a limitation of *future* goodness() functions. Eg. with the possible emergence of CPU-level threading and other, new multiprocessing technologies, this could be a *big* mistake. the Linux scheduler is not designed for the case of 1000 runnable processes. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 15:00:08 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <3ACA6BF4.9A3F93A2@chromium.com> Message-ID: <Pine.LNX.4.30.0104041338020.4611-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 24083.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk! 194.25.134.62!newsfeed00.sul.t-online.de!t-online.de!bofh.it!robomod References: <3ACA6BF4.9A3F93A2@chromium.com> X-Original-Cc: Alan Cox <a...@lxorguk.ukuu.org.uk>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 13:51:14 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Fabio Riccardi <fa...@chromium.com> Lines: 61 On Tue, 3 Apr 2001, Fabio Riccardi wrote: > I agree that a better threading model would surely help in a web > server, but to me this is not an excuse to live up with a broken > scheduler. believe me, there are many other parts of the kernel that are not optimized for the nutcase. In this case it's the scheduler that punishes you first. Do not shoot the messenger. > The X15 server I'm working on now is a sort of user-space TUX, it uses > only 8 threads per CPU and it achieves the same level of performance > of the kernel space TUX. Even in this case the performance advantage > of the multiqueue scheduler is remarkable, especially on a multi-CPU > (> 2) architecture. then your design is still bad. TUX has no problems with the scheduler, and TUX doesnt have many runnable processes at once. And this has nothing to do with TUX being within the kernel. > To achieve some decent disk/CPU/network overlapping with the current > linux blocking disk IO limitations there is no way to avoid a "bunch > of server threads". [...] false. > [...] I've (experimentally) figured out that 8-16 threads per CPU can > assure some reasonable overlapping, depending on the memory size and > disk subsystem speed. On a 8-way machine this means 64-128 active > tasks, [...] if they are doing IO *only* then there wont be 64-128 active tasks. This is how TUX does async IO: helper threads do the IO and *only* the IO. This way cache locality is maximized. (the scheduler is irrelevant in this case.) Again you are blaming the scheduler - the poor scheduler is simply the first kernel subsystem that tells you: your design still sucks. (if you have 64-128 active tasks then you already pay a very big cache footprint price as well.) [I'm wondering how your solution was just as fast as TUX, although you claim that the scheduler was already killing things. Perhaps your test was done over localhost or was saturating network bandwidth?] > Unless we want to maintain the position tha the only way to achieve > good performance is to embed server applications in the kernel, [...] FYI, TUX related functionality is constantly being pushed out and made accessible to userspace as well. Witness zerocopy, IRQ-affinity, and tons of generic patches such as pagecache-scalability and timer-scalability. (and soon the remaining, TUX-private improvements too.) TUX is intended to be a testbed for kernel subsystems, where performance optimizations and APIs can be tested without having to export them to userspace. (exporting to userspace cannot be done lightly and makes things less flexible.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 15:10:05 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <E14kbH2-0000qX-00@the-village.bc.nu> Message-ID: <Pine.LNX.4.30.0104041351400.4611-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 39492.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <E14kbH2-0000qX-00@the-village.bc.nu> X-Original-Cc: Fabio Riccardi <fa...@chromium.com>, <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 13:57:40 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Alan Cox <a...@lxorguk.ukuu.org.uk> Lines: 27 On Wed, 4 Apr 2001, Alan Cox wrote: > The problem has always been - alternative scheduler, crappier > performance for 2 tasks running (which is most boxes). [...] it's not only the 2-task case, but also less flexibility or lost semantics. > Indeed. I'd love to see you beat tux entirely in userspace. It proves > the rest of the API for the kernel is right well, until the cost of entry into the kernel is eliminated, this is not possible - unless there are performance bugs in TUX :-) but yes, getting a userspace solution that gets 'close enough' in eg. SPECweb99 benchmarks (which is complex enough to be trusted as a generic performance metric) would be a nice thing to have. There are existing SIGIO based, multithreaded solutions (eg. phttpd), with varying success. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OF401BD38B.CF3B1E9F-ON85256A24.0048543A@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Wed, 04 Apr 2001 15:50:07 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/04/2001 09:42:05 AM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 85077.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org>, lse-t...@lists.sourceforge.net X-Original-Date: Wed, 4 Apr 2001 09:43:55 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Ingo Molnar <mi...@elte.hu>, Mike Kravetz <mkrav...@sequent.com> Lines: 126 This is an important point that Mike is raising and it also addresses a critique that Ingo issued yesterday, namely interactivity and fairness. The HP scheduler completely separates the per-CPU runqueues and does not take preemption goodness or alike into account. This can lead to unfair proportionment of CPU cycles, strong priority inversion and a potential lack of interactivity. Our MQ scheduler does yield the same decision in most cases (other than defined by some race condition on locks and counter members) It is not clear that yielding the same decision as the current scheduler is the ultimate goal to shoot for, but it allows comparision. Another point to raise is that the current scheduler does a exhaustive search for the "best" task to run. It touches every process in the runqueue. this is ok if the runqueue length is limited to a very small multiple of the #cpus. But that is not what high end server systems encounter. With the rising number of processors, lock contention can quickly become a bottleneck. If we assume that load (#running-task) increases somewhat linear with #cpus, the problem gets even worth because not only have I increased the number of clients but also the lock hold time. Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves you from that dilemma, but not everybody is signing on to this limitation. MQ and priority-list help in 2 ways. MQ reduces the average lock holdtime because on average it only inspects #running-tasks/#cpus tasks to make a local decision. It then goes on to inspect (#cpus-1) datastructures representing the next best to run tasks on those remote cpus all without holding the lock, thus substantially reducing lock contention. Note we still search the entire set of runnable tasks, however we do it in a distributed collaborative manner. The only time we deviate from the current scheduler decision is in the case when two cpus have identified the same remote task as a target for remote stealing. In that case one will win and the other cpu will continue looking somewhere else, although there might have been another tasks on that cpu to steal. priority list schedulers (PRS) only helps in reducing lock hold time, which can result in some relieve wrt lock contention, but not a whole lot. PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY. It also keeps 0-counter in a list that is never inspected. One can even go further and put YIELD tasks in a separate list, given that the sys_sched_yield already does some optimizations. The older version (12/00) posted on LSE is functionally equivalent to the current scheduler. I will put up another version this week that is based on a bitmask and which is a bit more agressive in that it ignores the MM goodness boost of 1 which in my books is merely a tie breaker between two task of equal goodness. Beyond that we have done some work on cpu pooling, which is to identify a set of cpus that form a scheduling set. We still consider in reschedule_idle all cpus for preemption but in schedule it is sufficient to only schedule within the own set. That again can limit lock hold time with MQ and we have seen some improvements. We also deploy load balacing. To summarize, we have taken great care of trying to preserve the current scheduler semantics and have laid out a path to relax some of the semantics for further improvements. I don't believe that the HP scheduler is sufficient since it is lacking load balacing, which naturally occurs in our MQ scheduler, and it lacks the interactivity requirements that Ingo pointed out. Most of these things are discussed in great detail in the writeups under lse.sourceforge.net/scheduling. I also believe we show there that the MQ performance for low thread counts is also matching the vanilla case. I further don't understand the obsession of keeping the scheduler simple. If there are improvements and I don't believe the MQ is all that complicated and its well documented, why not put it in. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mike Kravetz <mkrav...@sequent.com> on 04/03/2001 10:47:00 PM To: Fabio Riccardi <fa...@chromium.com> cc: Mike Kravetz <mkrav...@sequent.com>, Ingo Molnar <mi...@elte.hu>, Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List <linux-ker...@vger.kernel.org>, Alan Cox <a...@lxorguk.ukuu.org.uk> Subject: Re: a quest for a better scheduler On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote: > > I have measured the HP and not the "scalability" patch because the two do more > or less the same thing and give me the same performance advantages, but the > former is a lot simpler and I could port it with no effort on any recent > kernel. Actually, there is a significant difference between the HP patch and the one I developed. In the HP patch, if there is a schedulable task on the 'local' (current CPU) runqueue it will ignore runnable tasks on other (remote) runqueues. In the multi-queue patch I developed, the scheduler always attempts to make the same global scheduling decisions as the current scheduler. -- Mike Kravetz mkrav...@sequent.com IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OFB30A8B18.2E3AD16C-ON85256A24.004BD696@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Wed, 04 Apr 2001 16:10:06 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/04/2001 10:02:33 AM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 86834.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 10:03:10 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: <mi...@elte.hu> Lines: 85 I grant you that the code is not as clean as the current scheduler, so maybe you missed that part. For the priority scheduler: Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine the list index to where the task has to be enqueued. However, if you wonder down to the search_range section of the scheduler, you see that we do add the "+1" for equal mm. All I did here was take the goodness() function a part for search_range in order to insert some extra code that speeds up the code. Again, I don't want to lean to hard on the priority scheduler, because we only did this to have a reference point regarding lock contention and lock hold. This is stuff that many operating systems did years ago, most of which have moved on to add MQ to that. So that is what the LSE scheduling team is pushing. I understand the dilemma that the Linux scheduler is in, namely satisfy the low end at all cost. But I believe that in order for Linux to push into the high end we need to address the scalability of the current scheduler. Simply handwaving and declaring that lots of running tasks is a stupid idea doesn't get us there. If indeed you assume that there #running-tasks ~ #cpus then each task should alot a reasonable amount of work and any small overhead incurred should be amortizable. However as we contend if #running-tasks >> #cpus is a situation we need to deal with, the living with the current lock contention can really drag us down. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Ingo Molnar <mi...@elte.hu>@elte.hu> on 04/04/2001 02:28:31 AM Please respond to <mi...@elte.hu> Sent by: <mi...@elte.hu> To: Mike Kravetz <mkrav...@sequent.com> cc: Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> Subject: Re: a quest for a better scheduler On Tue, 3 Apr 2001, Mike Kravetz wrote: > Our 'priority queue' implementation uses almost the same goodness > function as the current scheduler. The main difference between our > 'priority queue' scheduler and the current scheduler is the structure > of the runqueue. We break up the single runqueue into a set of > priority based runqueues. The 'goodness' of a task determines what > sub-queue a task is placed in. Tasks with higher goodness values are > placed in higher priority queues than tasks with lower goodness > values. [...] we are talking about the same thing, re-read my mail. this design assumes that 'goodness' is constant in the sense that it does not depend on the previous process. and no, your are not using the same goodness() function, you omitted the prev->mm == next->mm component to goodness(), due to this design limitation. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 16:40:08 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <OF401BD38B.CF3B1E9F-ON85256A24.0048543A@pok.ibm.com> Message-ID: <Pine.LNX.4.30.0104041527190.5382-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 77167.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!bignews.mediaways.net! newsfeed.germany.net!newsfeed01.sul.t-online.de!newsfeed00.sul.t-online.de! t-online.de!bofh.it!robomod References: <OF401BD38B.CF3B1E9F-ON85256A24.0048543A@pok.ibm.com> X-Original-Cc: Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org>, <lse-t...@lists.sourceforge.net> X-Original-Date: Wed, 4 Apr 2001 15:34:22 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Hubertus Franke <fran...@us.ibm.com> Lines: 24 On Wed, 4 Apr 2001, Hubertus Franke wrote: > Another point to raise is that the current scheduler does a exhaustive > search for the "best" task to run. It touches every process in the > runqueue. this is ok if the runqueue length is limited to a very small > multiple of the #cpus. [...] indeed. The current scheduler handles UP and SMP systems, up to 32 (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different approach anyway in many other subsystems too, Kanoj is doing some scheduler work in that area. but the original claim was that the scheduling of thousands of runnable processes (which is not equal to having thousands of sleeping processes) must perform well - which is a completely different issue. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 16:40:05 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <OFB30A8B18.2E3AD16C-ON85256A24.004BD696@pok.ibm.com> Message-ID: <Pine.LNX.4.30.0104041518540.5382-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 77157.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk! 194.25.134.62!newsfeed00.sul.t-online.de!t-online.de!bofh.it!robomod References: <OFB30A8B18.2E3AD16C-ON85256A24.004BD696@pok.ibm.com> X-Original-Cc: Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 15:23:34 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Hubertus Franke <fran...@us.ibm.com> Lines: 22 On Wed, 4 Apr 2001, Hubertus Franke wrote: > I understand the dilemma that the Linux scheduler is in, namely > satisfy the low end at all cost. [...] nope. The goal is to satisfy runnable processes in the range of NR_CPUS. You are playing word games by suggesting that the current behavior prefers 'low end'. 'thousands of runnable processes' is not 'high end' at all, it's 'broken end'. Thousands of runnable processes are the sign of a broken application design, and 'fixing' the scheduler to perform better in that case is just fixing the symptom. [changing the scheduler to perform better in such situations is possible too, but all solutions proposed so far had strings attached.] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 16:40:06 +0200 From: Ingo Molnar <mi...@elte.hu> Reply-To: <mi...@elte.hu> Subject: Re: a quest for a better scheduler In-Reply-To: <OF401BD38B.CF3B1E9F-ON85256A24.0048543A@pok.ibm.com> Message-ID: <Pine.LNX.4.30.0104041524010.5382-100000@elte.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 77160.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <OF401BD38B.CF3B1E9F-ON85256A24.0048543A@pok.ibm.com> X-Original-Cc: Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org>, <lse-t...@lists.sourceforge.net> X-Original-Date: Wed, 4 Apr 2001 15:25:22 +0200 (CEST) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Hubertus Franke <fran...@us.ibm.com> Lines: 21 On Wed, 4 Apr 2001, Hubertus Franke wrote: > It is not clear that yielding the same decision as the current > scheduler is the ultimate goal to shoot for, but it allows > comparision. obviously the current scheduler is not cast into stone, it never was, never will be. but determining whether the current behavior is possible in a different scheduler design is sure a good metric of how flexible that different scheduler design is. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 17:20:07 +0200 From: Andrea Arcangeli <and...@suse.de> Subject: Re: a quest for a better scheduler Message-ID: <20010404170846.V20911@athlon.random> References: <OF401BD38B.CF3B1E9F-ON85256A24.0048543A@pok.ibm.com> <Pine.LNX.4.30.0104041527190.5382-100000@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <Pine.LNX.4.30.0104041527190.5382-100000@elte.hu>; from mingo@elte.hu on Wed, Apr 04, 2001 at 03:34:22PM +0200 X-Gnupg-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 79354.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk!194.25.134.62! newsfeed00.sul.t-online.de!t-online.de!bofh.it!robomod X-Original-Cc: Hubertus Franke <fran...@us.ibm.com>, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org>, lse-t...@lists.sourceforge.net X-Original-Date: Wed, 4 Apr 2001 17:08:47 +0200 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Ingo Molnar <mi...@elte.hu> Lines: 41 On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote: > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > Another point to raise is that the current scheduler does a exhaustive > > search for the "best" task to run. It touches every process in the > > runqueue. this is ok if the runqueue length is limited to a very small > > multiple of the #cpus. [...] > > indeed. The current scheduler handles UP and SMP systems, up to 32 > (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different > approach anyway in many other subsystems too, Kanoj is doing some > scheduler work in that area. I didn't seen anything from Kanoj but I did something myself for the wildfire: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 this is mostly an userspace issue, not really intended as a kernel optimization (however it's also partly a kernel optimization). Basically it splits the load of the numa machine into per-node load, there can be unbalanced load across the nodes but fairness is guaranteed inside each node. It's not extremely well tested but benchmarks were ok and it is at least certainly stable. However Ingo consider that in a 32-way if you don't have at least 32 tasks running all the time _always_ you're really stupid paying such big money for nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all the time is not nearly enough for those machines (and of course also the scheduling rate automatically increases linearly with the increase of the number of cpus). Now it's perfectly fine that we don't ask the embedded and desktop guys to pay for that, but a kernel configuration option to select an algorithm that scales would be a good idea IMHO. The above patch just adds a CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 17:20:08 +0200 From: Andrea Arcangeli <and...@suse.de> Subject: Re: a quest for a better scheduler Message-ID: <20010404171227.W20911@athlon.random> References: <OFB30A8B18.2E3AD16C-ON85256A24.004BD696@pok.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <OFB30A8B18.2E3AD16C-ON85256A24.004BD696@pok.ibm.com>; from frankeh@us.ibm.com on Wed, Apr 04, 2001 at 10:03:10AM -0400 X-Gnupg-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 79358.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk!195.158.233.21! news1.ebone.net!news.ebone.net!skynet.be!bofh.it!robomod X-Original-Cc: mi...@elte.hu, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 17:12:27 +0200 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Hubertus Franke <fran...@us.ibm.com> Lines: 13 On Wed, Apr 04, 2001 at 10:03:10AM -0400, Hubertus Franke wrote: > I understand the dilemma that the Linux scheduler is in, namely satisfy > the low end at all cost. [..] We can satisfy the low end by making the numa scheduler at compile time (that's what I did in my patch at least). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OF44F2A6FB.BD0D4064-ON85256A24.00543C9D@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Wed, 04 Apr 2001 17:40:06 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/04/2001 11:30:01 AM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 79665.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!news.tele.dk! 195.158.233.21!news1.ebone.net!news.ebone.net!skynet.be!bofh.it!robomod X-Original-Cc: Linux Kernel List <linux-ker...@vger.kernel.org>, lse-t...@lists.sourceforge.net X-Original-Date: Wed, 4 Apr 2001 11:28:14 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Andrea Arcangeli <and...@suse.de> Lines: 104 Yes, Andrea. We actually already went a step further. We treat the scheduler as a single entity, rather than splitting it up. Based on the MQ scheduler we do the balancing across all nodes at reschedule_idle time. We experimented to see whether only looking for idle tasks remotely is a good idea, but it bloats the code. Local scheduling decisions are limited to a set of cpus, which could coincide with the cpus of one node, or if desirable on smaller granularities. In addition we implemented a global load balancing scheme that ensures that load is equally distributed across all run queues. This is made a loadable module, so you can plug in what ever you want. I grant in NUMA it might actually be desirable to separate schedulers completely (we can do that trivially in reschedule_idle), but you need loadbalancing at some point. Here is the list of patches: MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched Pooling Extension: http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool LoadBalancing: http://lse.sourceforge.net/scheduling/LB/loadbalance.c Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Andrea Arcangeli <and...@suse.de> on 04/04/2001 11:08:47 AM To: Ingo Molnar <mi...@elte.hu> cc: Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org>, lse-t...@lists.sourceforge.net Subject: Re: a quest for a better scheduler On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote: > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > Another point to raise is that the current scheduler does a exhaustive > > search for the "best" task to run. It touches every process in the > > runqueue. this is ok if the runqueue length is limited to a very small > > multiple of the #cpus. [...] > > indeed. The current scheduler handles UP and SMP systems, up to 32 > (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different > approach anyway in many other subsystems too, Kanoj is doing some > scheduler work in that area. I didn't seen anything from Kanoj but I did something myself for the wildfire: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 this is mostly an userspace issue, not really intended as a kernel optimization (however it's also partly a kernel optimization). Basically it splits the load of the numa machine into per-node load, there can be unbalanced load across the nodes but fairness is guaranteed inside each node. It's not extremely well tested but benchmarks were ok and it is at least certainly stable. However Ingo consider that in a 32-way if you don't have at least 32 tasks running all the time _always_ you're really stupid paying such big money for nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all the time is not nearly enough for those machines (and of course also the scheduling rate automatically increases linearly with the increase of the number of cpus). Now it's perfectly fine that we don't ask the embedded and desktop guys to pay for that, but a kernel configuration option to select an algorithm that scales would be a good idea IMHO. The above patch just adds a CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OF63B30B4E.49FE1F2A-ON85256A24.0050715E@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Wed, 04 Apr 2001 17:50:04 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/04/2001 11:37:00 AM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 78517.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Linux Kernel List <linux-ker...@vger.kernel.org>, lse-t...@lists.sourceforge.net X-Original-Date: Wed, 4 Apr 2001 11:36:24 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: <mi...@elte.hu> Lines: 130 You imply that high end means thousands of processes, simply because we have shown that in our graphs as an asymptotic end. No, it could mean 5*#cpus and that is not all that absurd. This could happen with a spike in demand. TUX is not the greatest example to use, because it does static webpage serving and is hence tied into the file cache. If you move up the food chain, where middleware is active and things are a bit more complicated than sending stuff out of the filecache, having a bunch of threads hanging around to deal with the spike in demand is common practive, although you think its bad. Now coming again to MQ (forget about priority list from now on). When we scan either the local or the realtime queues we do use goodness value. So we have the same flexibility as the current scheduler if it comes to goodness() flexibility and future improvements. For remote stealing we do use na_goodness to compare for a better process to steal. Hence we would loose the "+1" information here, nevertheless, once a decision has been made, we still use preemption verification with goodness. Eitherway, being off by "+1" for regular tasks once in a while is no big deal, because this problem already exists today. While walking the runqueue, another processor can either update the counter value of task (ok that's covered by can_schedule) or can run recalculate, which makes the decision that one is about to make wrong from the point of always running the best. But that's ok, because counter, nice etc. are approximations anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs that are used to control interactivity and throughput. If you look at some of the results with our reflex benchmark. For the low thread count we basically show improved performance on the 2,4,5,6,7,8 way system if #threads < #cpus, they all show improvements. Look at the following numbers running the reflex benchmark, left most columns are number of threads, with typically 1/2 of them runnable. You can clearly see that the priority list suffers from overhead, but MQ is beating vanilla pretty much everywhere. Again, this is due because of rapid scheduler invocation and resulting lock contention. 2-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 6.725 4.691 7.387 8 6.326 4.766 6.421 12 6.838 5.233 6.431 16 7.13 5.415 7.29 4-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 9.42 7.873 10.592 8 8.143 3.799 8.691 12 7.877 3.537 8.101 16 7.688 2.953 7.144 6-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 9.595 7.88 10.358 8 9.703 7.278 10.523 12 10.016 4.652 10.985 16 9.882 3.629 10.525 8-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 9.804 8.033 10.548 8 10.436 5.783 11.475 12 10.925 6.787 11.646 16 11.426 5.048 11.877 20 11.438 3.895 11.633 24 11.457 3.304 11.347 28 11.495 3.073 11.09 32 11.53 2.944 10.898 Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Ingo Molnar <mi...@elte.hu>@elte.hu> on 04/04/2001 09:23:34 AM Please respond to <mi...@elte.hu> Sent by: <mi...@elte.hu> To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> Subject: Re: a quest for a better scheduler On Wed, 4 Apr 2001, Hubertus Franke wrote: > I understand the dilemma that the Linux scheduler is in, namely > satisfy the low end at all cost. [...] nope. The goal is to satisfy runnable processes in the range of NR_CPUS. You are playing word games by suggesting that the current behavior prefers 'low end'. 'thousands of runnable processes' is not 'high end' at all, it's 'broken end'. Thousands of runnable processes are the sign of a broken application design, and 'fixing' the scheduler to perform better in that case is just fixing the symptom. [changing the scheduler to perform better in such situations is possible too, but all solutions proposed so far had strings attached.] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Message-ID: <XFMail.20010404091254.davidel@xmailserver.org> X-Mailer: XFMail 1.4.7 on Linux Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <Pine.LNX.4.30.0104040835470.1708-100000@elte.hu> Date: Wed, 04 Apr 2001 18:20:08 +0200 From: Davide Libenzi <davi...@xmailserver.org> Subject: Re: a quest for a better scheduler Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 11263.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!cyclone-sjo1.usenetserver.com! news-out-sjo.usenetserver.com!feed2.onemain.com!feed1.onemain.com! news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.icl.net!skynet.be!bofh.it!robomod References: <Pine.LNX.4.30.0104040835470.1708-100000@elte.hu> X-Original-Cc: Linus Torvalds <torva...@transmeta.com>, Alan Cox <a...@lxorguk.ukuu.org.uk>, Linux Kernel List <linux-ker...@vger.kernel.org>, fran...@us.ibm.com, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com> X-Original-Date: Wed, 04 Apr 2001 09:12:54 -0700 (PDT) X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Ingo Molnar <mi...@elte.hu> Lines: 60 On 04-Apr-2001 Ingo Molnar wrote: > > On Tue, 3 Apr 2001, Fabio Riccardi wrote: > >> I've spent my afternoon running some benchmarks to see if MQ patches >> would degrade performance in the "normal case". > > no doubt priority-queue can run almost as fast as the current scheduler. > What i'm worried about is the restriction of the 'priority' of processes, > it cannot depend on previous processes (and other 'current state') > anymore. > > to so we have two separate issues: > >#1: priority-queue: has the fundamental goodness() design limitation. > >#2: per-CPU-runqueues: changes semantics, makes scheduler less > effective due to nonglobal decisions. > > about #1: while right now the prev->mm rule appears to be a tiny issue (it > might not affect performance significantly), but forbidding it by > hardcoding the assumption into data structures is a limitation of *future* > goodness() functions. Eg. with the possible emergence of CPU-level > threading and other, new multiprocessing technologies, this could be a > *big* mistake. This is not correct Ingo. I haven't seen the HP code but if You store processes in slots S : S = FS( goodness(p, p->processor, p->mm) ) and You start scanning from the higher slots, as soon as you find a task with a goodness G' that is equal to the max goodness in slot You can choose that process to run. Again, if You haven't found such a goodness during the slot scan but You've found a task with a goodness G' : G' >= SG - DD where : SG = max slot goodness DD = SG(i) - SG(i - 1) You can select that task as the next to spin. This was the behaviour that was implemented in my scheduler patch about 2 years ago. Beside this, I this that with such loads We've more serious problem to face with inside the kernel. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OFD82F9D22.584E5FF7-ON85256A24.005DB7E0@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Wed, 04 Apr 2001 19:30:06 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/04/2001 01:17:40 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 95793.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 13:17:38 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Davide Libenzi <davi...@xmailserver.org> Lines: 103 Well put, this how we can eliminate searching all bins or lists and that's how we do it under. http://lse.sourceforge.net/scheduling/2.4.1-pre8-prioSched. If you have a list per priority level, then you can even pick the first one you find if its on the same level. That's what I tried in a more recent implementation. Also combined that with using a bitmask to represent non-empty tasks. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Davide Libenzi <davi...@xmailserver.org>@ewok.dev.mycio.com on 04/04/2001 12:12:54 PM Sent by: davi...@ewok.dev.mycio.com To: Ingo Molnar <mi...@elte.hu> cc: Linus Torvalds <torva...@transmeta.com>, Alan Cox <a...@lxorguk.ukuu.org.uk>, Linux Kernel List <linux-ker...@vger.kernel.org>, Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com> Subject: Re: a quest for a better scheduler On 04-Apr-2001 Ingo Molnar wrote: > > On Tue, 3 Apr 2001, Fabio Riccardi wrote: > >> I've spent my afternoon running some benchmarks to see if MQ patches >> would degrade performance in the "normal case". > > no doubt priority-queue can run almost as fast as the current scheduler. > What i'm worried about is the restriction of the 'priority' of processes, > it cannot depend on previous processes (and other 'current state') > anymore. > > to so we have two separate issues: > >#1: priority-queue: has the fundamental goodness() design limitation. > >#2: per-CPU-runqueues: changes semantics, makes scheduler less > effective due to nonglobal decisions. > > about #1: while right now the prev->mm rule appears to be a tiny issue (it > might not affect performance significantly), but forbidding it by > hardcoding the assumption into data structures is a limitation of *future* > goodness() functions. Eg. with the possible emergence of CPU-level > threading and other, new multiprocessing technologies, this could be a > *big* mistake. This is not correct Ingo. I haven't seen the HP code but if You store processes in slots S : S = FS( goodness(p, p->processor, p->mm) ) and You start scanning from the higher slots, as soon as you find a task with a goodness G' that is equal to the max goodness in slot You can choose that process to run. Again, if You haven't found such a goodness during the slot scan but You've found a task with a goodness G' : G' >= SG - DD where : SG = max slot goodness DD = SG(i) - SG(i - 1) You can select that task as the next to spin. This was the behaviour that was implemented in my scheduler patch about 2 years ago. Beside this, I this that with such loads We've more serious problem to face with inside the kernel. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Wed, 04 Apr 2001 19:40:06 +0200 From: Mike Kravetz <mkrav...@sequent.com> Subject: Re: a quest for a better scheduler Message-ID: <20010404102721.B1118@w-mikek2.sequent.com> References: <20010403121308.A1054@w-mikek2.sequent.com> <Pine.LNX.4.30.0104032024290.9285-100000@elte.hu> <20010403154314.E1054@w-mikek2.sequent.com> <3ACA683A.89D24DED@chromium.com> <20010403194700.A1024@w-mikek2.sequent.com> <3ACAA164.BDFF9B4C@chromium.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <3ACAA164.BDFF9B4C@chromium.com>; from fabio@chromium.com on Tue, Apr 03, 2001 at 09:21:57PM -0700 Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 85227.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Ingo Molnar <mi...@elte.hu>, fran...@us.ibm.com, Linux Kernel List <linux-ker...@vger.kernel.org>, Alan Cox <a...@lxorguk.ukuu.org.uk> X-Original-Date: Wed, 4 Apr 2001 10:27:21 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Fabio Riccardi <fa...@chromium.com> Lines: 48 On Tue, Apr 03, 2001 at 09:21:57PM -0700, Fabio Riccardi wrote: > I was actually suspecting that the extra lines in your patch were there for a > reason :) > > A few questions: > > What is the real impact of a (slight) change in scheduling semantics? > > Under which situation one should notice a difference? I assume your questions are directed at the difference between local and global scheduling decisions. As with most things the impact of these differences depends on workload. Most multi-queue scheduler implementations make local scheduling decisions and depend on load balancing for system wide fairness. Schedulers which make global decisions handle load balancing via their global policy. The HP multi-queue implementation you are using performs no load balancing. Therefore, in a worst case situation you could have low priority tasks sharing one CPU while high priority tasks are sharing another. To be fair, I have talked to the HP people and this scheduler was never intended to be a general purpose solution. Therefore, it makes little sense to comment its merits as such. > As you state in your papers the global decision comes with a cost, > is it worth it? My primary motivation for attempting to perform the same global decisions as the current scheduler was so that meaningful comparisons could be made. Also, by using the same global decision policy I was able to avoid the issue of load balancing. In most multi-queue implementations, load balancing algorithms take considerable effort to get working in a reasonable well performing manner. > > Could you make a port of your thing on recent kernels? There is a 2.4.2 patch on the web page. I'll put out a 2.4.3 patch as soon as I get some time. -- Mike Kravetz mkrav...@sequent.com IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OFE7B61E08.EBBC35C3-ON85256A24.006740B8@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Wed, 04 Apr 2001 21:20:04 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/04/2001 03:06:57 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 67962.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Linux Kernel List <linux-ker...@vger.kernel.org>, lse-t...@lists.sourceforge.net X-Original-Date: Wed, 4 Apr 2001 15:06:02 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Mark Hahn <h...@coffee.psychology.mcmaster.ca> Lines: 65 I give you a concrete example: Running DB2 on an SMP system. In DB2 there is a processes/thread pool that is sized based on memory and numcpus. People tell me that the size of this pool is in the order of 100s for an 8-way system with reasonable sized database. These <maxagents> determine the number of agents that can simultaneously execute an SQL statement. Requests are flying in for transactions (e.g. driven by TPC-W like applications). The agents are grepped from the pool and concurrently fire the SQL transactions. Assuming that there is enough concurrency in the database, there is no reason to believe that the majority of those active agents is not effectively running. TPC-W loads have observed 100 of active transactions at a time. Ofcourse limiting the number of agents would reduce concurrently running tasks, but would limit the responsiveness of the system. Implementing a database in the kernel ala TUX doesn't seem to be the right approach either (complexity, fault containment, ...) Hope that is one example people accept. I can dig up some information on WebSphere Applications. I'd love to hear from some other applications that fall into a similar category as the above and substantiate a bit the need for 100s of running processes, without claiming that the application is broke. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mark Hahn <h...@coffee.psychology.mcmaster.ca> on 04/04/2001 02:28:42 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Subject: Re: a quest for a better scheduler > ok if the runqueue length is limited to a very small multiple of the #cpus. > But that is not what high end server systems encounter. do you have some example of this in mind? so far, noone has actually produced an example of a "high end" server that has long runqueues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Thu, 05 Apr 2001 00:30:10 +0200 From: Tim Wright <t...@splhi.com> Subject: Re: a quest for a better scheduler Message-ID: <20010404151632.A2144@kochanski> Reply-To: t...@splhi.com Mail-Followup-To: Ingo Molnar <mi...@elte.hu>, Hubertus Franke <fran...@us.ibm.com>, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> References: <OFB30A8B18.2E3AD16C-ON85256A24.004BD696@pok.ibm.com> <Pine.LNX.4.30.0104041518540.5382-100000@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.15i In-Reply-To: <Pine.LNX.4.30.0104041518540.5382-100000@elte.hu>; from mingo@elte.hu on Wed, Apr 04, 2001 at 03:23:34PM +0200 Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 52219.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod X-Original-Cc: Hubertus Franke <fran...@us.ibm.com>, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 4 Apr 2001 15:16:32 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Ingo Molnar <mi...@elte.hu> Lines: 59 On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote: > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > I understand the dilemma that the Linux scheduler is in, namely > > satisfy the low end at all cost. [...] > > nope. The goal is to satisfy runnable processes in the range of NR_CPUS. > You are playing word games by suggesting that the current behavior prefers > 'low end'. 'thousands of runnable processes' is not 'high end' at all, > it's 'broken end'. Thousands of runnable processes are the sign of a > broken application design, and 'fixing' the scheduler to perform better in > that case is just fixing the symptom. [changing the scheduler to perform > better in such situations is possible too, but all solutions proposed so > far had strings attached.] > Ingo, you continue to assert this without giving much evidence to back it up. All the world is not a web server. If I'm running a large OLTP database with thousands of clients, it's not at all unreasonable to expect periods where several hundred (forget the thousands) want to be serviced by the database engine. That sounds like hundreds of schedulable entities be they processes or threads or whatever. This sort of load is regularly run on machine with 16-64 CPUs. Now I will admit that it is conceivable that you can design an application that finds out how many CPUs are available, creates threads to match that number and tries to divvy up the work between them using some combination of polling and asynchronous I/O etc. There are, however a number of problems with this approach: 1) It assumes that this is the only workload on the machine. If not it quickly becomes sub-optimal due to interactions between the workloads. This is a problem that the kernel scheduler does not suffer from. 2) It requires *every* application designer to design an effective scheduler into their application. I would submit that an effective scheduler is better situated in the operating system. 3) It is not a familiar programming paradigm to many Unix/Linux/POSIX programmers, so you have a sort of impedance mismatch going on. Since the proposed scheduler changes being talked about here have been shown to not hurt the "low end" and to dramatically improve the "high end", I fail to understand the hostility to the changes. I can understand that you do not feel that this is the correct way to architect an application, but if the changes don't hurt you, why sabotage changes that also allow a different method to work. There isn't one true way to do anything in computing. Tim -- Tim Wright - t...@splhi.com or t...@aracnet.com or twri...@us.ibm.com IBM Linux Technology Center, Beaverton, Oregon Interested in Linux scalability ? Look at http://lse.sourceforge.net/ "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Thu, 05 Apr 2001 01:10:05 +0200 From: Christopher Smith <x...@xman.org> Subject: Re: a quest for a better scheduler Message-ID: <18230000.986424894@hellman> In-Reply-To: <20010404151632.A2144@kochanski> X-Mailer: Mulberry/2.0.8 (Linux/x86 Demo) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 40219.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-03!supernews.com!bofh.it!robomod References: <20010404151632.A2144@kochanski> X-Original-Cc: Hubertus Franke <fran...@us.ibm.com>, Mike Kravetz <mkrav...@sequent.com>, Fabio Riccardi <fa...@chromium.com>, Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Wed, 04 Apr 2001 15:54:54 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: t...@splhi.com, Ingo Molnar <mi...@elte.hu> Lines: 54 --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <t...@splhi.com> wrote: > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote: >> nope. The goal is to satisfy runnable processes in the range of NR_CPUS. >> You are playing word games by suggesting that the current behavior >> prefers 'low end'. 'thousands of runnable processes' is not 'high end' >> at all, it's 'broken end'. Thousands of runnable processes are the sign >> of a broken application design, and 'fixing' the scheduler to perform >> better in that case is just fixing the symptom. [changing the scheduler >> to perform better in such situations is possible too, but all solutions >> proposed so far had strings attached.] > > Ingo, you continue to assert this without giving much evidence to back it > up. All the world is not a web server. If I'm running a large OLTP > database with thousands of clients, it's not at all unreasonable to > expect periods where several hundred (forget the thousands) want to be > serviced by the database engine. That sounds like hundreds of schedulable > entities be they processes or threads or whatever. This sort of load is > regularly run on machine with 16-64 CPUs. Actually, it's not just OLTP, anytime you are doing time sharing between hundreds of users (something POSIX systems are supposed to be good at) this will happen. > Now I will admit that it is conceivable that you can design an > application that finds out how many CPUs are available, creates threads > to match that number and tries to divvy up the work between them using > some combination of polling and asynchronous I/O etc. There are, however > a number of problems with this approach: Actually, one way to semi-support this approach is to implement many-to-many threads as per the Solaris approach. This also requires significant hacking of both the kernel and the runtime, and certainly is significantly more error prone than trying to write a flexible scheduler. One problem you didn't highlight that even the above case does not happily identify is that for security reasons you may very well need each user's requests to take place in a different process. If you don't, then you have to implement a very well tested and secure user-level security mechanism to ensure things like privacy (above and beyond the time-sharing). The world is filled with a wide variety of types of applications, and unless you know two programming approaches are functionaly equivalent (and event driven/polling I/O vs. tons of running processes are NOT), you shouldn't say one approach is "broken". You could say it's a "broken" approach to building web servers. Unfortunately, things like kernels and standard libraries should work well in the general case. --Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Date: Fri, 06 Apr 2001 00:50:07 +0200 Subject: Re: a quest for a better scheduler Message-ID: <20010405153841.A2452@osdlab.org> References: <20010404151632.A2144@kochanski> <18230000.986424894@hellman> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.15i In-Reply-To: <18230000.986424894@hellman>; from x@xman.org on Wed, Apr 04, 2001 at 03:54:54PM -0700 From: "Timothy D. Witham" <woo...@osdlab.org> Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 35006.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!nntp-relay.ihug.net! ihug.co.nz!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de! excite.it!bofh.it!robomod X-Original-Cc: woo...@osdlab.org X-Original-Date: Thu, 5 Apr 2001 15:38:41 -0700 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: Linux Kernel List <linux-ker...@vger.kernel.org> Lines: 132 I have been following this thread and thinking that everybody has some truth in what they are saying but with the absence of a repeatable test environment there really isn't a way of arriving at a data driven decision. Given the following conditions. 1)The diversity of the problem sets that developers are working on results in a real concern that another developers solution to their performance issue might result in a worsening of my performance situation. 2)Most of the developers have a given set of tests that they use when they make changes to determine if they change did what they want. 3)The Open Source Development Lab has the faculties for providing a large scale testing environment on several different configurations that remain the same over time. I propose that we work on setting up a straight forward test harness that allows developers to quickly test a kernel patch against various performance yardsticks. If we use the same set of hardware for the generation of the performance data from patch to patch there will be a correlation between the runs allowing for a real comparison of the different changes. The following should be taken only as a starting point. As for the base hardware configurations I propose: 2 way with 1 GB main memory and 2 IDE drives 4 way with 4 GB main memory and 16 disk drives 8 way with 8 GB main memory and 24 disk drives The types of applications that I have heard people express concern are: Web infrastructure: Apache TUX Jabber Corporate infrastructure: NFS raw TCP/IP performance Samba Database performance: Raw storage I/O performance OLTP workload General usage: compile speed (usually measured by kernel compile) The OSDL also has the ability to make some of the "for fee" benchmarks available for use for testing that is done onsite (network access to OSDL machines counts as onsite) so that people don't have to purchase individual copies. Things like SECMAIL2001, SPECSFS2.0 and SEPCWEB99 come to mind. Comments, suggestions, volunteers? -- Timothy D. Witham - Lab Director - woo...@osdlab.org Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2455 (fax) On Wed, Apr 04, 2001 at 03:54:54PM -0700, Christopher Smith wrote: > --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <t...@splhi.com> > wrote: > > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote: > >> nope. The goal is to satisfy runnable processes in the range of NR_CPUS. > >> You are playing word games by suggesting that the current behavior > >> prefers 'low end'. 'thousands of runnable processes' is not 'high end' > >> at all, it's 'broken end'. Thousands of runnable processes are the sign > >> of a broken application design, and 'fixing' the scheduler to perform > >> better in that case is just fixing the symptom. [changing the scheduler > >> to perform better in such situations is possible too, but all solutions > >> proposed so far had strings attached.] > > > > Ingo, you continue to assert this without giving much evidence to back it > > up. All the world is not a web server. If I'm running a large OLTP > > database with thousands of clients, it's not at all unreasonable to > > expect periods where several hundred (forget the thousands) want to be > > serviced by the database engine. That sounds like hundreds of schedulable > > entities be they processes or threads or whatever. This sort of load is > > regularly run on machine with 16-64 CPUs. > > Actually, it's not just OLTP, anytime you are doing time sharing between > hundreds of users (something POSIX systems are supposed to be good at) this > will happen. > > > Now I will admit that it is conceivable that you can design an > > application that finds out how many CPUs are available, creates threads > > to match that number and tries to divvy up the work between them using > > some combination of polling and asynchronous I/O etc. There are, however > > a number of problems with this approach: > > Actually, one way to semi-support this approach is to implement > many-to-many threads as per the Solaris approach. This also requires > significant hacking of both the kernel and the runtime, and certainly is > significantly more error prone than trying to write a flexible scheduler. > > One problem you didn't highlight that even the above case does not happily > identify is that for security reasons you may very well need each user's > requests to take place in a different process. If you don't, then you have > to implement a very well tested and secure user-level security mechanism to > ensure things like privacy (above and beyond the time-sharing). > > The world is filled with a wide variety of types of applications, and > unless you know two programming approaches are functionaly equivalent (and > event driven/polling I/O vs. tons of running processes are NOT), you > shouldn't say one approach is "broken". You could say it's a "broken" > approach to building web servers. Unfortunately, things like kernels and > standard libraries should work well in the general case. > > --Chris > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Timothy D. Witham - Lab Director - woo...@osdlab.org Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2455 (fax) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Importance: Normal Subject: Re: a quest for a better scheduler X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: <OFE619E772.9BE9BE00-ON85256A25.007D7363@pok.ibm.com> From: "Hubertus Franke" <fran...@us.ibm.com> Date: Fri, 06 Apr 2001 01:10:06 +0200 X-Mimetrack: Serialize by Router on D01ML244/01/M/IBM(Release 5.0.7 |March 21, 2001) at 04/05/2001 06:59:17 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: robo...@news.nic.it X-Mailing-List: linux-kernel@vger.kernel.org Approved: robo...@news.nic.it (1.20) NNTP-Posting-Host: 89636.anti-phl.bofh.it Newsgroups: linux.kernel Organization: linux.*_mail_to_news_unidirectional_gateway Path: supernews.google.com!sn-xit-02!supernews.com!nntp-relay.ihug.net! ihug.co.nz!newsfeed.germany.net!newsfeed01.sul.t-online.de! newsfeed00.sul.t-online.de!t-online.de!bofh.it!robomod X-Original-Cc: Linux Kernel List <linux-ker...@vger.kernel.org> X-Original-Date: Thu, 5 Apr 2001 19:01:01 -0400 X-Original-Sender: linux-kernel-ow...@vger.kernel.org X-Original-To: "Timothy D. Witham" <woo...@osdlab.org> Lines: 200 Exellent idea.... Assuming that you have set up these environments already, what would be a real treat is to get the average runqueue length at a given time, for instance every second or so, while running some of these more sophisticated server oriented applications that you mention. From that we can see which of the various assumption regarding runqueue length is holding up, when the runqueue is not empty. This would help the current discussion trememdously as we don't seem to be able to even agree on this. Simple bincount could do. If you want a kernel patch that counts every scheduling cycle I'll write it. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fran...@us.ibm.com (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "Timothy D. Witham" <woo...@osdlab.org>@vger.kernel.org on 04/05/2001 06:38:41 PM Sent by: linux-kernel-ow...@vger.kernel.org To: Linux Kernel List <linux-ker...@vger.kernel.org> cc: woo...@osdlab.org Subject: Re: a quest for a better scheduler I have been following this thread and thinking that everybody has some truth in what they are saying but with the absence of a repeatable test environment there really isn't a way of arriving at a data driven decision. Given the following conditions. 1)The diversity of the problem sets that developers are working on results in a real concern that another developers solution to their performance issue might result in a worsening of my performance situation. 2)Most of the developers have a given set of tests that they use when they make changes to determine if they change did what they want. 3)The Open Source Development Lab has the faculties for providing a large scale testing environment on several different configurations that remain the same over time. I propose that we work on setting up a straight forward test harness that allows developers to quickly test a kernel patch against various performance yardsticks. If we use the same set of hardware for the generation of the performance data from patch to patch there will be a correlation between the runs allowing for a real comparison of the different changes. The following should be taken only as a starting point. As for the base hardware configurations I propose: 2 way with 1 GB main memory and 2 IDE drives 4 way with 4 GB main memory and 16 disk drives 8 way with 8 GB main memory and 24 disk drives The types of applications that I have heard people express concern are: Web infrastructure: Apache TUX Jabber Corporate infrastructure: NFS raw TCP/IP performance Samba Database performance: Raw storage I/O performance OLTP workload General usage: compile speed (usually measured by kernel compile) The OSDL also has the ability to make some of the "for fee" benchmarks available for use for testing that is done onsite (network access to OSDL machines counts as onsite) so that people don't have to purchase individual copies. Things like SECMAIL2001, SPECSFS2.0 and SEPCWEB99 come to mind. Comments, suggestions, volunteers? -- Timothy D. Witham - Lab Director - woo...@osdlab.org Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2455 (fax) On Wed, Apr 04, 2001 at 03:54:54PM -0700, Christopher Smith wrote: > --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <t...@splhi.com> > wrote: > > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote: > >> nope. The goal is to satisfy runnable processes in the range of NR_CPUS. > >> You are playing word games by suggesting that the current behavior > >> prefers 'low end'. 'thousands of runnable processes' is not 'high end' > >> at all, it's 'broken end'. Thousands of runnable processes are the sign > >> of a broken application design, and 'fixing' the scheduler to perform > >> better in that case is just fixing the symptom. [changing the scheduler > >> to perform better in such situations is possible too, but all solutions > >> proposed so far had strings attached.] > > > > Ingo, you continue to assert this without giving much evidence to back it > > up. All the world is not a web server. If I'm running a large OLTP > > database with thousands of clients, it's not at all unreasonable to > > expect periods where several hundred (forget the thousands) want to be > > serviced by the database engine. That sounds like hundreds of schedulable > > entities be they processes or threads or whatever. This sort of load is > > regularly run on machine with 16-64 CPUs. > > Actually, it's not just OLTP, anytime you are doing time sharing between > hundreds of users (something POSIX systems are supposed to be good at) this > will happen. > > > Now I will admit that it is conceivable that you can design an > > application that finds out how many CPUs are available, creates threads > > to match that number and tries to divvy up the work between them using > > some combination of polling and asynchronous I/O etc. There are, however > > a number of problems with this approach: > > Actually, one way to semi-support this approach is to implement > many-to-many threads as per the Solaris approach. This also requires > significant hacking of both the kernel and the runtime, and certainly is > significantly more error prone than trying to write a flexible scheduler. > > One problem you didn't highlight that even the above case does not happily > identify is that for security reasons you may very well need each user's > requests to take place in a different process. If you don't, then you have > to implement a very well tested and secure user-level security mechanism to > ensure things like privacy (above and beyond the time-sharing). > > The world is filled with a wide variety of types of applications, and > unless you know two programming approaches are functionaly equivalent (and > event driven/polling I/O vs. tons of running processes are NOT), you > shouldn't say one approach is "broken". You could say it's a "broken" > approach to building web servers. Unfortunately, things like kernels and > standard libraries should work well in the general case. > > --Chris > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Timothy D. Witham - Lab Director - woo...@osdlab.org Open Source Development Lab Inc - A non-profit corporation 15275 SW Koll Parkway - Suite H - Beaverton OR, 97006 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2455 (fax) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/