The Linux Scheduler And How It Handles More Cores [Hackaday]

View Article on Hackaday

Sometimes you read an article headline and you find yourself re-reading it a few times before diving into the article. This was definitely the case for a recent blog post by [The HFT Guy], where the claim was made that the Linux kernel has for fifteen years now been hardlocked into not scheduling for more than 8 cores. Obviously this caused a lot of double-checking and context discovery on both Hacker News and the Level 1 Techs forum. So what is going on exactly? Did the Linux developers make an egregious error more than a decade ago that has crippled Linux performance to this day?

Where the blog author takes offence is in the claim made in the Linux kernel code and documentation that the base time slice scales with the number of CPUs (or cores), pointing out the commit in which the number of CPUs taken into account was limited to a maximum of 8. So far so good, even if at this point quite a few readers had already jumped to showing that their Linux system could definitely load more than 8 cores to 100%.

As pointed out by [sirn] on the Level 1 Techs forum, this limit was intentional, as discussed on the Linux Kernel mailing list (LKML) in November and December of 2009. Essentially – as also pointed out by a few commentators in the Hacker News thread – the granularity of task switching (time slices per second) should be higher with fewer cores, to give the impression of concurrency, which becomes less important with more cores, where diminishing returns – around the 8 CPU mark – mean that task switching overhead becomes more crucial.

That means that this ‘hardcoded limit’ was put in there on purpose back in 2009, based on solid empirical evidence using many-core workstations and servers. It also shows that writing good schedulers is hard, which is why the LKML is famous for its Scheduler Wars and why you can pick alternative schedulers if you compile your own kernel. The current Completely Fair Scheduler (CFS) is also likely going to be replaced in the Linux kernel with the EEVDF scheduler as the default.