New directions for Intel
Reporting from the Intel Software Conference 2011, held in Dubrovnik this April and based in part on an interview with James Reinders, Director of Software Products.
Despite the undoubted importance of companies like Apple and Microsoft, it is Intel that has most influenced the history of the microcomputer. Back in 1971 it kick-started the whole industry with the first commercial microprocessor, and a decade later the 8088 processor became the basis for the IBM PC. Then in 1982 the release of the 80286 created an architecture that made the likes of Microsoft Windows possible. Since then each successive wave of Intel processors has brought us ever increasing clock speeds, effortlessly allowing our software to achieve more in less time.
But then, in 2004, Intel found itself unable to increase clock speed any further. As a result the company changed tack, realising that it could increase processing speed by doing more than one thing at a time, rather than trying to do one thing after another at ever increasing speed. However this introduced a new problem, as computer programs would only be able to take advantage of multi-core processors if they were rewritten to do so.
So ‘Think Parallel’ was born, as an annual event at which Intel could promote the cause of parallel programming. I reported from an early event held in Lisbon back in 2007 (see Think Parallel). Since then I have attended each year, culminating in the 2011 event held recently in Dubrovnik.
At the Lisbon event, Intel predicted that it would be producing processors containing four cores by the following year, and 64 cores by 2011. However, as it is, while there are server parts out there with 10 cores, even high-end desktops can only boast four-core processors.
Challenged on this, Reinders agreed that there had been a change of emphasis, particularly in the smartphone and tablet arena where Intel faces stiff competition from the likes of ARM. Here the most important factor has been power efficiency, which directly affects battery life. However Reinders feels that we’re getting to the point where power consumption is “good enough”, referencing a work by Clayton Christensen called The Innovator’s Dilemma which suggests that, once competing companies produce products that are all ‘good enough’ in one direction, the emphasis shifts to something else. Reinders’ argument is that the market will soon shift to performance, and that dual-core chips will become standard on mobile devices in the near future.
Power efficiency is also an issue on the server, but another issue here is bandwidth, particularly where multiple cores all access a shared memory area within a single chip. This is not so much of a problem where there are perhaps ten cores involved, but it becomes a bottleneck as core count increases. Server workloads tend to be highly parallel so there is more pressure to increase the core count on the server than there is on the client.
The solution came out of a project codenamed ‘Larrabee’ which was originally intended as a design for a high-end graphics processing unit (GPU). The idea behind the MIC (Many Integrated Cores) micro-architecture is to move beyond the shared bus used by current multi-core processors and instead provide something more like a network to allow more efficient transfer of data. As Reinders points out, “supercomputers that have thousands or even tens of thousands of processors don’t have a shared bus to memory – they use something more sophisticated.” Under MIC the cores do not communicate directly with memory but instead pass requests through a 512-bit bi-directional ring that is built into the chip itself.
Intel is expecting to release the first MIC-based processor, codenamed ‘Knight’s Corner’ and boasting at least 50 cores on a single chip, towards the end of 2012. Reinders went on to state, “It’s really easy for me to say that within a couple of years you’ll be able to buy machines with hundreds of cores, and they won’t be terribly expensive.” However he does expect these to be used primarily as servers: “You would buy it when you needed it – you’re not going to buy one for browsing the Web.”
The main development on the client side is the placing of a specialised graphics processing core onto the chip itself, alongside the current dual or quad processing cores. As Reinders explains: “Moore’s Law is about getting more transistors onto a chip. So far we’ve used it to add more and more cores, but it turns out that there’s more value if, instead of just building more cores, we also add a graphics capability, and there’s so many transistors now that we can make this a very sophisticated graphics processor.”
Like many third-party graphics cards, the graphics core itself contains 12 execution cores so that it can take advantage of the highly parallel nature of much graphics work. However sharing the same silicon means it can be connected directly to the internal shared memory, in contrast to conventional graphics cards which must communicate through the far lower bandwidth offered by the PCI Express bus.
This does mean that users won’t be able to upgrade graphics performance without upgrading the main CPU. However, Reinders doesn’t see that as a problem: “You’re going to see graphics performance that you’ve never seen before… I think we’ve got to that point in the market where it is so good that upgrading is not likely to be something that the user wants to do.” Intel competitor AMD recently purchased graphics card company ATI, and NVidia has announced its intention to combine ARM processor cores with their graphics technology, so others are thinking along similar lines.
The Cilk approach
On the software side Intel is keen to talk about a parallel programming extension to C and C++ that it has recently introduced called Cilk Plus. This is based on the Cilk language extensions developed at Massachusetts Institute of Technology (MIT) by Professor Charles Leiserson. In an effort to bring Cilk to market, Leiserson formed Cilk Arts in 2006 which was acquired in 2009 by Intel.
For the programmer, Cilk is pretty straightforward, adding just two concepts and three new keywords. The first is the ‘parallel for’ loop invoked with cilk_for, indicating that the iterations within the loop can be processed in parallel. The second is the ability to spawn a function onto a separate processing thread. This involves the keywords cilk_spawn which indicates that a function can safely be executed in parallel, and cilk_sync to indicate that processing should not continue until all spawned functions have completed.
Behind the scenes, Cilk employs a ‘task stealing’ schedule algorithm. To explain, Reinders asked us to imagine a program that hands an eight-core processor a collection of 80 tasks that need to be processed in parallel. “The question then becomes, how do you divide up that work?” One solution, known as ‘static partitioning’, is to simply hand 10 tasks to each core, but that turns out to be inefficient as any core that finishes its workload early ends up idle. Another approach is to put all 80 tasks into a pile and let cores take from the pile as they need work. However that requires a locking mechanism to ensure that two cores don’t try and take the same task, and locks don’t scale well. What MIT discovered is that the most efficient algorithm is to scatter the work across all eight cores, each with its own task queue, and let cores steal from another’s queue if they run out of work – hence ‘task stealing’.
The Cilk programming extensions and the scheduler are part of Intel’s Parallel Building Blocks (PBB) which are included with Intel Parallel Studio 2011 and Cluster Studio 2011. However the keyword approach could be made more effective if the task stealing scheduler is implemented in the compiler itself. Intel has now done this with its C++ Compiler 12.0, but as Reinders points out, it takes time for such technologies to make their way into other C and C++ compilers, such as those from Microsoft or the GNU Project. Nevertheless, Reinders assured us “it is going great; we’ve had tremendous feedback. There’s a lot of interest in using it in education, because it’s very simple to teach, and we’ll see other compilers offering it in time.”
As the name implies, Cilk Plus goes beyond Cilk, in this case because Intel has added data parallelism to the mix. Reinders explains, “We’re convinced that the C language should have some provision for handling vectors and array arithmetic, much as Fortran 90 does. Here the ‘Plus’ part refers to some new array notation that we’ve put in. We took the attitude that if we’re adding functionality to the compiler we might as well have it understand how to efficiently handle large-scale vectors and arrays.”
Intel C++ Compiler 12.0 comes with Intel’s Parallel Studio and Parallel Composer 2011 products. You can find out more about Cilk at http://software.intel.com/en-us/articles/intel-cilk-plus/.