Limitations in processor design mean the ‘free lunch’ is over for software developers, as we come to terms with the fact that chips simply aren’t going to get much faster. Matt Nicholson went to Think Parallel, Intel’s conference held in Lisbon in April 2007, to find out more.
Originally published on DNJ Online, Jun 2007
Up until now, programmers have had it easy. Ever since the first 16-bit processors appeared in the late 1970s, processor speeds have been increasing at an exponential rate, doubling every two years, or more recently, every 18 months. The modern Intel Pentium 4 offers 10,000 times the processing speed of the 8086, found in the original IBM PC. However the essential architecture has remained unchanged, which means programs which ran on last year’s processors continue to run on today’s, only faster. They may need to be recompiled to take advantage of the latest technology, but in the end it has always been hard disk or network access that caused the bottleneck and that wasn’t a coder’s problem.
However, as we found out at Think Parallel, Intel’s EMEA Channel Conference 2.0 held in Lisbon in April 2007, that’s about to change. Here Herb Sutter, Microsoft Software Architect and chair of the C++ Standards Committee, told us in the opening keynote, “Your free lunch is over.” Up until recently, clock speed has increased along with processing speed, from the 4.77MHz of the 8086 to the 3.2GHz of the Pentium 4. However Intel abandoned plans for a 4GHz processor in 2004, and since then maximum processor speeds have increased to only around 3.8GHz.
Why is this? As James Reinders of Intel Software told us, it is not because Moore’s law, which states that the density of transistors doubles every two years, no longer applies. Intel’s current generation of processors are constructed using a 65nm process and Intel plans to introduce a 45nm process in 2008, followed by 32nm in 2010 and 22nm in 2012. It is because faster processor speeds require too much power and generate too much heat, which cannot be dissipated fast enough. Higher transistor density can be used to build larger caches, which reduce the memory bottleneck, or to put graphic processors or network interfaces on the chip, but when it comes to cramming more processing cycles into each second, we’re rapidly approaching a brick wall.
The solution is to put more than one processing core on a single chip. Rather than attempt to do one thing after another at an ever higher speed, try doing more than one thing at the same time. Build a two-core processor and you should be able to do twice as much without increasing clock speed. Put in four cores and you should be able to do four times as much. Most new desktops and even laptops are now dual-core and Intel is talking about introducing four core processors by 2008, and 64 cores by 2011.
The concurrency catch
However there is a catch. Most of us write programs on the assumption that instructions will be executed sequentially. Try and execute more than one instruction at the same time and you can run into problems – particularly if each thread is writing to and reading from the same memory locations.
Herb Sutter went on to discuss the ‘three pillars’ of parallelisation. First of all there is the move from synchronous operation, where a program makes a function call and waits for the response, to asynchronous operation, where code can get on with something else until the result is ready. Then there are concurrent collections, where the same operation needs to be done on a collection of objects. If there are no dependencies then such operations can be executed in parallel. Finally there are ‘mutable shared state’ situations, where threads could find themselves treading on each others toes, perhaps creating deadlocks or race conditions where the value of a variable becomes dependent on which thread got there first. This is where parallelisation becomes hard.
Reinders emphasised the need to ‘think parallel’ from the start. He used the example of sending out invitations to a party which involves a series of tasks, starting with folding the invite, stuffing it in an envelope, sealing the envelope, looking up and writing the guest’s address, stamping the envelope and finishing by putting it in the mail.
If we had six volunteers we could speed up the task in a number of ways. We could put each person in charge of their own pipeline, but that could cause problems at the addressing stage, ensuring that the same address didn’t end up on more than one envelope. We could give each person a separate task, so creating a pipeline with the first person folding an invite and passing it to the second, and so on down the line, but that could result in a bottleneck at the more time-consuming addressing stage.
A better solution would be to ‘parallelise’ those tasks that take more time and serialise the others to take advantage of resources, so perhaps allocating two people to the addressing stage while having one handle both the folding and the stuffing stage. If another six volunteers turned up, making a dozen in all, we could get through the process even quicker by putting four people on the addressing stage, and so forth.
As Reinders pointed out, we are currently at what is arguably the most difficult point in the transition to multi-core in that most programs that are written for parallel processing run slower on single-core processors than programs written specifically for serial processing. However once multi-core processors become the norm, programmers will find that they once again benefit from a free lunch – provided they have written code that scales to the number of cores available.
Intel has tools to facilitate parallel programming at all stages of the software cycle. These include support for the OpenMP standard, which defines pragmas that you use in your source code to specify parallel operations, and for MPI (Message Passing Interface) which is aimed more at clusters:
Intel Threading Building Blocks 1.1 gives C++ developers a library of functions with names like ParallelFor, ParallelSort and ConcurrentQueue. Math Kernel Library (MKL) 9.1 and Integrated Performance Primitives (IPP) 5.2 provide math and media-related functions. All are optimised for multi-core clients running Linux, Windows or Mac OS. MKL has a Cluster Edition and also supports Fortran. There is also the Intel MPI Library 3.0 which supports both C++ and Fortran but is currently only available for Linux.
Intel Thread Checker 3.1 finds deadlocks and data races, VTune Analyzer 9.0 is good for identifying areas that could be optimised for parallel processing, while Thread Profiler 3.1 shows how threads are actually being used as the program runs. All work with Windows and Linux and support OpenMP. Intel Trace Analyzer and Collector 7.0 helps analyze MPI programs with support for C, C++ and Fortran on Linux.
Intel has Fortran and C++ compilers that are optimised for multi-core operation and support OpenMP. Version 10.0 of its compiler range was announced at the conference (see News section for more details). Cluster OpenMP support, which allows OpenMP to be used in cluster as well as multi-core environments, is available for the Linux versions.