Archive for SHA1

Sure, we’re brute-forcing, but what’s the rush?

Ironically, in the process of writing a really fast implementation of the WPA2 encryption scheme, it was necessary (or at least easier) to first write a really slow implementation.

Here is my wpa2slow Python module.

To be fair, the actual speed at which this hashes isn’t the point. The point is that I have an entirely native Python implementation of how the entire WPA2 algorithm works, with an emphasis on code being clean, easy to read, and forgoing any speed optimisations that hinder legibility. Judging from the quantity of misinformation I’ve seen while researching this topic, someone will find this project useful. A lot of people have been having trouble trying to fit all the pieces together, over the years.

At first, I included the pieces as mock objects for unit testing. Each of the pieces (SHA1, HMAC, PBKDF2, etc) were built up over a period of several months, vaguely mimicking the interface format of equivalent existing hashlib libraries. Unfortunately, all of the standard Python hash libraries are really inconsistent!

For example, to hash a string with SHA1, I’d use:

output = hashlib.sha1(str)

Makes sense. But to then hash it with HMAC-SHA1, I’d do:

output = hmac.new(secret, value, hashlib.sha1)

Which is pretty weird, in this context. It makes more sense when you realise that HMAC can work with multiple sub-algorithms (commonly SHA1 or MD5), and you can go multiple rounds with adding more salt. That’s not the case in my implementation, however, so copying the format was a mistake.

One thing that rests solely on my shoulders is the massive amount of debugging information, hacky fixes, and generally poor code I’d written while focusing on an entirely different problem.

If I had written it cleanly from the beginning, it would have been a lot less work, that was silly of me.

 

Anyway. It’s pretty decent now. Check it out, read the docs, or install it!

pip install wpa2slow

Unpacking WPA2

As discussed in a previous article, WPA2 encryption is comprised of three different algorithms layered on top of each other. I went over them in a very brief overview, so here is a more in-depth discussion on how I optimised and implemented them on an FPGA.

Also linked in that previous article, it’s worth going back and reading this presentation again. It really is a great overview of each of the three algorithms and how they fit together, without muddying the waters with the low-level details.

Disclaimer: This is my process for understanding and breaking them down. I will be using non-standard terminology, and there are certainly other ways of internalising these algorithms. My methods aren’t the only methods.

The lowest building block, SHA1, has four distinct sections stages of note: Load, Process Load, Process Buffer, and Output.

In a completely linear implementation, it would look like this:

SHA1 Linear

I’m assuming a bus width of 32 bits. The input/output blocks on the end are fixed in size, and cannot overlap. Not if we want to maintain half-duplex compatibility with the parent device. That may or may not be important, but it sure does simplify the implementation for now. 165 clock cycles total.

We have a little more wiggle room with the red blocks. Because the buffer can be processed as it is getting populated by the Process Load stage, we can merge them. Extrapolated, and using the input/output as the bottleneck, we come up with this:

SHA1 Parallel (2)

 

Notice that there’s still only one green section running at any single point along the horizontal. That’s the critical path, and dictates how many parallel operations we want to run at once.

It works out to 106 clock cycles for one cycle, assuming we run 5 hashes in parallel. So, 22 clocks per hash. Way better.

hmacpseudo

The next block up, the HMAC function, contains two SHA1 functions, which is sort of annoying. The second/outer call is also dependent on the output of the first/inner one, so I can’t parallelise that at all. To fill up the 5 SHA1 operations I have going at a time, I must also be running 5 separate HMAC operations.

There is one optimisation to be done, though, mentioned in the presentation. It’s not obvious how it’s useful in a concurrent environment, but bear with me.

Part of the algorithm requires two buffers to contain the ‘secret’ portion of the HMAC input, padded with 0x36 or 0x5c up to 64 bytes. The parent algorithm, PBKDF2, iterates two HMAC functions, 4092 times, all with the same secret variable. That can be calculated once and used for the entire loop.

If everything is done in parallel, then one would expect that the calculation could be done on-the-fly with little to no penalty. When you consider that each round of SHA1 is 64 bytes, however, you realise that a padded 64-byte input message, followed by the ‘secret’ variable requires two complete rounds of the SHA1 function to come up with the final result. But the first round of SHA1 results in the same output, every time, and can be precalculated. This turns the HMAC algo from what is effectively four SHA1 operations into only two.

 

pbkdf2pseudo

 

Note that they’ve actually written this algorithm incorrectly, there is an additional XOR to combine each stage of x1/x2 into a final result, but no matter for this discussion.

The PBKDF2 part is by far the most expensive step, because the area of most FPGAs will be too small to unroll 8192 copies of the SHA1 algorithm (which requires minimum (80 words * 4 bytes * 8 bits = 2560) bits of buffer space). One more optimization could be done, given the right conditions, and it’s a doozy:

The final, very last operation in the whole mess is to append x2 to x1. No hashing after that. That means the final output is two groups of 5 words (40 bytes total). This is the Pairwise Master Key. It is statistically unlikely that the first 20 bytes will be correct, but the second group would be wrong. This means that if we have the PMK and only need to verify it, we can do these calculations with exactly half the silicon area, or twice the speed.

Unfortunately, in this particular use-case we don’t have it, but it’s a great trick to keep up our sleeve.

 

 

So what do we have in this use-case?

 

We’ve gone from the wifi SSID and passphrase, known as the master key (MK), and from that, generated the Pre-Shared Key (PSK), which is known as the Pairwise Master Key in this particular implementation of the PBKDF2 algorithm.

To verify the PMK against the passphrase, something called the “Pairwise Key Expansion” is calculated. From the captured WPA2 packets, we have a few variables: client MAC address, AP MAC address, client nonce, AP nonce, and the Message Integrity Check (MIC), packet body.

The first four variables, along with the PMK get combined together in kind of an annoying way, and then compared against the MIC.

They just call it the pseudo-random function (PRF), and it goes kinda like this:

a = "Pairwise key expansion";
b = min(APMac, CMac) . max(APMac, CMac) . \
    min(APNonce, CNonce) . max(APNonce, CNonce);
r = "";
for(i = 0; i < 4; i++) {
    r = r . HMAC_SHA1(PMK, a . "\0" . b . chr(i));
}
return r[0:64];

 

Yeah, the actual string is used in the PRF.

This gives us the Pairwise Temporal Key (PTK), which is then combined with part of the (encrypted) body of the packet we captured:

 

mic = HMAC_SHA1(ptk[0:16], data[60:121]);

 

And then compare this to the MIC we’ve already captured! Easy, right?

No! It’s a huge pain and would add another HMAC_SHA1 block that we don’t really have room for. This will probably be implemented on the microcontroller firmware or the host software of my system.

WPA2 HMAC SHA1 PMK FPGA VHDL

Hell yeah, acronyms.

Restartin’ dis.

I’ve got the old working SHA1 encryption algorithm going. It’s up on GitHub, here. I was waffling between parallel and serial loads and outputs. SHA1 takes a 512 bit input and outputs 160 bits, and no FPGA that I have access to will have that many pins.

Here’s a fallacy I got stuck on for a while:

If I use completely parallel I/O, it’s faster. Therefore, I should try to use as much parallel communication as possible to maximise speeds.

 

The truth is, though, that most modern high-speed circuits use serial data. When you’re clocking at very high frequencies, propagation errors become very difficult to mitigate. With many data lines, you have no way of knowing when exactly data arrives at the lines. It’s better, easier, more reliable to just send one bit at a time and pump them through really fast. That’s a core principle of pipelining; you don’t even need to wait for a signal to reach its destination before you’re sending the next one.

 

With this project, I don’t think I can go faster than 100MHz with the FPGAs I have. That’s right on the edge of where efficient pipelining starts to matter, so I’ll be experimenting with a mix of serial and parallel interfaces to see how far I can stretch it.

 

Right now at this moment, though, I want to get the computationally-intensive parts of the WPA2 protocol offloaded to the FPGA. I won’t worry about speed juuust yet.

 

 

Here’s how WPA2 works, from a security analysis perspective:

 

 

Some definitions

 

  • Access Point (AP) – Usually the router, in these systems
  • SSID – AP name. This is the router name that you connect to
  • Master Key (MK) – AP password. This is the key we ultimately want to find
  • Pairwise Master Key (PMK) – This is the cryptographically hard part of the whole transaction. It’s derived from the MK and SSID
  • Pairwise Transient Key (PTK) – The cleartext portion of the authentication to verify the correct PMK
  • PBKDF2 (wiki, RFC) – Password-Based Key Derivation Function 2 – A cryptographic function, used like PBKDF2(PRF, Password, Salt, c, dkLen) where PRF is any given hash function, c is the quantity of iterations, and dkLen is the output length
  • HMAC (RFC) – Hash-based message authentication code – Intermediate function applied to the SHA1 hash
  • SHA1 (RFC) – Standard hashing function. Uses a one-way algorithm to generate non-reversible output

 

The AP has the password, from which it can derive the Pairwise Master Key (PMK). To verify that both sides of the transaction have this, a PTK is generated using the PMK and nonces (randomly generated, one-time-use numbers).

This PTK is transmitted over cleartext, and is part of the 4-way handshake.

 

So, in short, if you combine the proper PMK with the proper PTK, the whole conversation breaks wide open and you can associate with the AP.

If you’re watching the WPA traffic, then you can easily capture the 4-way handshake and therefore the PTK.

 

That means that, for my purposes anyway, the PMK is the important unknown variable, and the computationally hard part. Here‘s how you calculate it, from top to bottom:

PBKDF2:

x1 = HMAC_SHA1(MK, SSID + "1");
x2 = HMAC_SHA1(MK, SSID + "2");
f1 = x1
f2 = x2
for(i = 1; i < 4096; i++) {
    x1 = HMAC_SHA1(MK, x1);
    x2 = HMAC_SHA1(MK, x2);
    f1 ^= x1
    f2 ^= x2
}
return f1 + f2;

 

HMAC:

This is more complicated. Given HMAC(secret, value)

  •  Initialise two 64-byte variables, Bi and Bo with secret, padded with zeros on the end
  • XOR each of byte of Bi with 0x36, and XOR each of byte of Bo with 0x5C
  • Append value to Bi
  • Append SHA1(Bi) (in Ascii!) to Bo
  • Return SHA1(Bo)

In short:

HMAC_SHA1(s, v) = SHA1((s ⊕ 0x5c)||SHA1((s ⊕ 0x36)||v))

 

And finally, with the SHA1 function, I have covered it in previous posts. This portion is already done.

I need another dev kit

Recently, I got given an FPGA development kit to continue working on my SHA1 system. The end-goal is to eventually integrate that into a PCI-E card, but we’ll see how that works out.

 

The FPGA kit I have is the Digilent Basys2, and it uses a Xilinx Spartan 3E chip. The lowest-tier, 100k gate model.

 

I’ve updated the current code, here, consisting of the parallel behavioral SHA1 implementation, a serial input wrapper that parallelizes the data and acts as an interface, a simulation testbench, and a physical testbench specific to my board. Unfortunately, this code, which is the absolute simplest building block of the final project is about 600% too large for my dev board. It takes up about 12,000 logical slices, and the Spartan 3E version I have only has about 2,000. For further reference, a Papillio Pro, which is very popular right now, uses a version of the Spartan 6 family that has about 9,000 slices.

My goal right now is to, without spending a whole lot of money, to get a development board, or even a bare FPGA on a breakout board with at least 15,000 slices so that I can benchmark this implementation of the algorithm. After I do that with the first iteration, I can write a concurrent version, benchmark it, and start tweaking. I’d really like to get into low-level optimization.

Stretch goals involve getting a massive FPGA to do a lot more with, but those can start to run upwards of $150 just for the chip alone.

I don’t even need to keep these boards, I just need to use them briefly as a proof-of-concept.

Further adventures of implementing SHA1 encryption on an FPGA

A few years ago, I designed and simulated a SHA1 encryption machine using CMOS logic. I actually built everything using Altera’s FPGA tools, Quartus. With standard propagation of the 7400 series, it worked out to about 1.8MHz max clock speed to prevent race conditions.

 

Implementing the system right in VHDL and running it on the FPGA itself, I think I can get it faster. The previous system also relied on a serial workflow that limited the data to one hash at a time. If I parallelize everything, I can likely get it up to 80 time faster. A very very cursory look suggests that SHA1 has been done a million times by a million different people, but I’m not looking at their work. This isn’t about them.

After trying to figure out how I figured out the SHA1 protocol last time, it looks like the only document I used was the original RFC 3174 spec. There are a lot crappy pseudo-code and diagrams out there, but man, that paper is just so well-written, nothing compares.

I also found an old spreadsheet in my Drive account in which I’d written out the first few rounds of the hashing process.

Initially, because I’m relearning the algorithm at the same time that I’m relearning VHDL, I implemented it completely sequentially. The project is here. It’s a proof-of-concept, not actually something that is usable, at the time of writing. It will likely evolve into something cool.

Foiled Again

Pretty soon after I wrote that last post, I got sent up north for work for a long while. Which is great, don’t get me wrong, I like money as much as the next guy. It really limits my ability to physically build, though.

So I missed the contest entry deadline. Totally not surprised. I’d need another month to finish, but I’m putting this on the backburner for now.

So far, the only thing I’ve finished is the simulation. That may not sound like much, but let’s run through some numbers:

The design consists of 50 functional blocks representing (I think I counted right) 199 discrete CMOS ICs
The input, which is five words or 160 bits long, is clocked in 1 word at a time.
The output, also five words, pops out like magic 80 clocks later.
I haven’t done a proper timing analysis, but some back-of-the-envelope calculations (literally – I did this on the blank side of an envelope) give me a typical propagation delay of 545ns. This allows for a clock speed of about 1.8MHz, and tests about 22,000 combinations per second.

That’s pretty fast, and quite okay for a first draft. But it can be improved.
Right now, this circuit is set up into two cyclical stages. Only one input can be processed at a time as previous iterations of the data affects itself.
If the design were to be linearized, proper pipelining could make the design 80 times faster. There are also definitely some improvements to be had in designing for higher speed chips and in-depth timing analysis to figure out how tight you could get away with for pipelining.

But as I said; I’m killing this for now.

Shortly after I designed this, my harddrive failed and lost the original simulation files. I use cloud backup services now. Fortunately I had screencaps, though. Here is a pure hardware SHA1 encoder:

Declarative Statement Goes Here

As I sit here, riding shotgun in my work truck on the way to a job site, there is very little that I can do to be productive. I feel like I’m wasting my time.
What I can do is set some final goals for my project, and also maybe do some some musing (in the most pretentious language possible).

But first, some background:

As implied, I spend a lot of my time working on the road. That means a lot of cheap motel rooms, and a lot of unreliable internet access. When you’re stuck for the coldest six weeks of the year in the township of Bumfuck, Northern BC, internet access becomes pretty important.

As often as motel wifi is flaky, my internet-enabled devices can usually see a password-protected residential network nearby. Naturally, trying to break into someone else’s router would be immoral, but it does present an interesting problem. How easy would it be to make a portable wifi cracker?

To make a long story short, I decided to make a very fast WPA-PSK brute forcer. “Very fast” is a little misleading here, because despite potentially having the ability to test hundreds of passwords a second, breaking a properly chosen password is not something that can be done while you sit around and twiddle your thumbs. It takes a little longer. Think “aging a good cask of scotch” timeframes.

By this point, the problem has become purely academic. But that’s okay. I decided to do it anyway. But just as I settled in to write some serious VHDL code, the 7400 competition reared its beautiful head.
So I decided to do something slightly insane and drop the FPGA.

I’m designing a CMOS SHA1 encrypter. The PBKBF2 and MAC portion of WPA2’s encryption scheme won’t be done, the SHA1 will stand alone.

Insane quantities of chips aren’t without precedent for me. My final project in school was a poster child for function creep.

It didn't work.

Fig 1: This is what happens when you let the team member with Aspergers add “just one more thing”

 

As I start to get my teeth into this project, I’m coming to the realization (hope) that I don’t have 16 straight hours of soldering in my future. While there are 512 bits for the input, and 160 for a default seed, and then another 160 for the output, and also a clock source that has trigger each stage sequentially- wait. That does seem like a lot of soldering.

Fortunately, my requirements are pretty tightly defined so not a lot of function creep is possible. There’s only the problem of reliably verifying that the board works by passing the 16-word input and reading the 5-word output. I’d use an Arduino and some shift registers, but I don’t have any ATMegas with bootloaders on them. Maybe I’ll use a PIC and RS232 with a MAX232 and some custom Windows software…

Oh man.
I am in trouble.