Hardware acceleration of password recovery has been a hot topic for quite some time already. We were the first to adopt widely available graphic cards for this purpose and we’re proud of this. Today I’d like to share some thoughts on hardware acceleration for password recovery, its past, present, and future. I will also cover the most frequently asked questions regarding GPUs.
In past years password security in many products improved considerably and this led to a significant drop of password recovery efficiency. Microsoft Office is a good example of this: for documents created with Office 97-2003 and saved with default security settings we can check passwords at rate of about half a million per second on a single PC. Changing default settings to something stronger drops this rate to tens of thousands passwords per second. Switching to Office 2007 improves password security even further and password recovery rate is only around hundreds of passwords per second. The same trend can be seen in other products; almost everyone is slowing down password recovery today because this is good for security (one noticeable exception being Adobe — password recovery for its latest Acrobat 9 is orders of magnitude faster than for any previous version).
When speed of password recovery dropped to the level where it became difficult to perform even basic dictionary attacks, vendors turned to hardware accelerators. Two accelerator families appeared in 2007 — Tableau and Pico. Both are based on FPGA and both have exceptionally high price tag which has limited their adoption.
Another very important milestone for 2007 was public release of CUDA, a technology which effectively turns a graphic card into a high-performance co-processor capable of solving almost any type of task. Great thing about this technology is that it was designed to run on mainstream devices and backed by millions of graphics cards already sold.
The first stable version of CUDA was released in July’07 and as soon as in October’07 we have released updated version of EDPR with GPU (Graphics Processing Unit) support. Probably, EDPR was the first commercial application to use CUDA.
But CUDA is not the only solution for doing computations on GPU. ATI, an NVIDIA’s competitor and AMD’s branch, offers ATI Stream, a very similar technology aimed on ATI cards. In my opinion, this technology is far less stable and far less usable than CUDA. Its potential is pretty good — ATI cards clearly outperform NVIDIA cards at compute-intensive tasks — but it seems that ATI just don’t put much effort in developing it. Again, we probably were the first to support ATI Stream in commercial application: this winter we’ve released Wireless Security Auditor which can use both NVIDIA and ATI cards.
Current GPUs are much faster than CPUs from the same price range (i.e. a high-end graphic card vs. a high-end CPU) in terms of peak performance. This animation shows the evolution of peak computing power of GPUs (green) and CPUs (blue):
Vertical axis scale is switched to logarithmic at the end to show that Moore’s law is preserved and performance growth is exponential. The direct consequence is that performance gap between GPUs and CPUs is likely to increase further.
The answer why GPUs are faster is simple: because they are designed to be faster. Designing a processor involves many trade-offs, and priorities are just different for CPU and GPU. For CPU flexibility is more important, i.e. it must be fairly good at all possible tasks. Most of the code running on our computers is not optimized at all and this is why modern CPUs employ various techniques to optimize code execution on-the-fly, including branch prediction, out-of-order execution and more. Those units take precious chip resources (area, transistors), so that less of them are left to be devoted to actual execution units.
A GPU is a very different story with performance being the main goal. It lacks many units which are present in a CPU but it dedicates more resources to execution units and thus provides better performance. It also has different memory hierarchy without «automatic» cache memory. The downside is that GPU programming model is more restrictive and need to be understood to be exploited efficiently. Writing good code requires more effort from a developer.
Those architectural peculiarities can cause slowdowns instead of expected speedups. For example, GPUs are very sensitive to memory access patterns, and if code reads or writes memory at arbitrary locations, its performance will be severely impacted. Fortunately, most password verification schemes are built around hash functions (MD4, MD5 and SHA-1 and SHA-2) which have tiny memory footprint and thus are very efficient on a GPU. Password verification can also employ ciphers, such as DES (Windows LM hash) or RC4 (Microsoft Office) which is not so good for a GPU, but there are mitigation strategies for those as well. In case of DES we can use the special technique known as bit-slicing to reduce number of memory accesses. Unfortunately, this isn’t possible for RC4. The only option left for RC4 is try to optimize memory access pattern and allocate data structures in the fastest memory, but still speedup we can get for RC4 is not as impressive as for the hash functions.
Now we know that GPUs are significantly faster than CPUs at certain tasks, but what about other accelerators, like the already mentioned commercial FPGA-based devices? Those are really expensive and their price/performance is far from that of GPUs. I’ve talked about that recently on Troopers’09 and here are relevant slides (jump to slides31-35):
I don’t think FPGA will reach price/performance of a GPU anytime soon. GPUs are manufactured on much larger scale than FPGA (NVIDIA claims 100 million CUDA-capable devices are already sold) and this allows for much lower price per unit.
Probably the only FPGA solution that can compete with GPUs (or even beat it— I haven’t done precise calculations) in terms of price/performance is COPACOBANA. This state-of-the-art piece of hardware is based on Xilinx Spartan3 FPGA, costs about €10’000 and can brute-force full 56-bit DES key in a week. However, the new release of COPACOBANA is based on much more expensive Virtex-4 FPGA and thus its price/performance is not so brilliant.
After we’ve done with FPGA let’s compare a GPU to IBM Cell processor which can be found in millions of PlayStation 3 (PS3) gaming consoles. It received much attention when in late 2007 Nick Breese announced that it is capable of cracking MD5 hashes at a rate of 1.4 billion passwords per second (which was later «improved» to 1.9 billion). That was much faster (10-20x) than GPUs of that time, but at the same time it was clear for anyone who can do simple math that there’s an error in this benchmark. Let’s see: PS3 have 6 CPU ‘cores’ called SPU, each core can process data in 128-bit registers, i.e. 24 MD5 hashes are computed in parallel. Computing MD5 hash consists of computing 64 steps and each step requires four additions, one cyclic shift, and two to four logical operations. Let it be 8 operations for each step, then computing hash value requires 64*8=512 cycles. PS3 processor runs at 3.2 GHz, so it can do (3.2 GHz / 512 cycle/hash) * 24 hashes in parallel = 6.25 million * 24 = 150 million hashes/sec. Even this simple calculations show us that results were flawed. It was later confirmed — Nick updated his presentation (jump to page 46) to reflect that actual «cracking rate» was only 80 million and not 1.9 billion as stated before. How this compares to GPUs? GTX 285 can do about 580 million MD5 hashes per second, and GTX 295 approaches one billion.
Nonetheless, IBM Cell is a great platform for computing, and PS3 may be very affordable and thus an attractive alternative to GPUs on certain tasks, especially if those tasks do not fit GPU constraints.
How GPU computing will look tomorrow? That’s not an easy question.
Probably the most important thing to happen in hardware field is Intel’s Larrabee. It will be a graphic card with x86 CPU cores. Each core is based on good old Pentium (P54C) design. Cores do not support out-of-order execution to save the chip area, but they support 64-bit computing. Each core supports 4-way Hyper-Threading to hide latencies of memory accesses. But the most important part is that each core is equipped with 512-bit wide vector processing unit, so, for example, now we can compute not 4 but 16 MD5 hashes in parallel!
Unfortunately, it is not easy to estimate Larrabee performance: there’s no official data on the number of cores on card and their clock frequencies. Rumored numbers are 16-24 cores at 1.7-2.5 GHz. I’m not sure if P54C cores will not melt down at 2.5 GHz, but 2.0 GHz seems pretty reasonable to me. So what kind of performance can we expect here? Let’s take same MD5 password recovery benchmark. My Core2 E4500 (2 cores @ 2.2GHz) can do 32 million MD5 hashes per second using SSE2, this translates roughly to (2.2 GHz * 2 cores * 4 values per register) / 32 million = 550 32-bit operations per hash value computed. Then Larrabee with 16 cores running at 2.0 GHz will be able to compute (2.0 GHz * 16 cores * 16 values per register) / 550 = 930 million hashes/sec. This is roughly equivalent to GTX 295. Not very impressive, especially taking into account amount of marketing buzz around Larrabee, but we will definitely have to wait for more official information to be revealed.
But even if actual performance is twice better than this estimate and keeping in mind that Larrabee will have to compete with future generations of GPUs (and not with the current ones) I wouldn’t expect it to have easy time. NVIDIA prepares for the battle by readying very promising NVIDIA GT300 architecture, which will feature MIMD computing model (as opposed to SIMD used by all current GPUs) and up to 512 cores.
Anyway, competition is good and NVIDIA has finally got a good opponent. I’m sure Intel vs. NVIDIA battle will make a lot of headlines next winter.
GPU computing is undergoing kind of standardization process. Hardware manufacturers have agreed on specifications of OpenCL and now adding support for it into their SDKs and drivers. OpenCL is a framework for creating applications that run on different GPUs. Today GPU software developers have to write different code for ATI and NVIDIA GPU. Adoption of OpenCL will allow developers to write code once for all supported platforms. This will definitely reduce time required to implement new features in GPU code and will make it easier for other manufacturers to join GPU computing — all they need to do is implement OpenCL layer, no need to re-invent CUDA or ATI Stream.
As you can see, GPU computing is very young and its landscape is not completely formed. If you are a researcher or scientist and you have computationally challenging task — it may be a good idea to take a look on GPUs. And as usual, if you have any questions you’re welcome to ask!