01-21-2012, 09:42 PM
Welcome to the second hashcat development report
NOTE: If you follow the hashcat twitter you already know some of this.
I will start with the most interessting thing first!
On a stock clocked hd7970 the 100 Mhash/s mark is finally broken! This applies to both programs, oclHashcat-lite and oclHashcat-plus. Even in multihash mode. Proof-pic: https://hashcat.net/misc/goodbye_des.png
The hd7970 also has an impressive overclocking potential. Even with stock cooler you can easily overlock it to 1125mhz. If I do this, the cracking speed increases to 126 Mhash/s.
What I did is that I have optimized the descrypt / DES(Unix) / Traditional DES (or whatever you want to call this algorithm) kernel for the GCN architecture. This basically means I removed all the vector datatype code and replaced them with scalar datatypes. In future there will be two different kernels for this algorithm. One for the old vector datatype architecture and one for the new scalar architecture. Dont worry, both programs oclHashcat-lite and oclHashcat-plus will automatically choose which one to use.
To make 100% sure it works I have generated 6000000+ descrypt hashes out of my dictionaries. Then I let them crack while running the hashes against the dictionaries out of which they have been generated. They were all cracked, 100%.
So why is this an outstanding performance improvement? Lets do a quick comparison. A stock clocked hd5870 on this algorithm makes around 40 Mhash/s. The comparable next generation card was the hd6970. This one makes around 46 Mhash/s. So we had an improvement of 15%. Now, the hd7970 makes 104 Mhash/s. This is an improvement of 126% to the hd6970 (or 160% to the hd5970). Improvements around 20% are normal, but more than 100%? I think this is impressive.
If you imagine running a rig with 8x hd7970 @ 1125 this will break the 1000 Mhash/s mark. In other words, ~85 days to brute-force through the complete descrypt keyspace.
Another nice thing to mention is that with this GPU the attack on so called memory-intensive algorithms was reduced 4 times to the last GPU generation. While memory-lookups are still the best approach to slow down GPU based cracking I did not expect this card to be that strong in this field.
There are some more interessting "facts" on the specs of the new NVidia Kepler here: https://semiaccurate.com/2012/01/19/nvidi...ar-winner/ If they are true, we can expect even more power from this card, especially in cracking memory-intensive algorithms. NVidia was always faster than AMD when looking up data from GPU memory in GPGPU. You can see this when it comes to multihashing.
The hd7970 has been released about 2 weeks ago and since then I got more than 20 requests for access to beta versions which are able to utilize it. I am afraid requests will increase with each day. This brings me to an important topic. When to release oclHashcat-lite v0.09 and oclHashcat-plus v0.08? In theory they are good to go. However, there is a major problem: There is no linux driver.
You might think, who cares, I am on Windows? Well that might be true, but my development system is on Linux. As long as there is no Linux driver I can not efficiently search for new optimizations. Currently I have to compile the kernel using the SDK which is able to produce GCN code and then move the new kernel to my Windows Desktop/Workstation and run it on there and do trial and error.
Another problem is that: Imagine I release the current version and the linux driver comes out but is incompatible to the released kernels. This would require another release.
On the other hand, nearly all requests came from people who bought these cards just for running oclHashcat* on it. But I dont want people to get access to beta just to get access to the latest versions. Being a beta-tester means doing tests, reporting bugs, suggest changes, help the project, etc.
In oclHashcat-lite I have finally added the requested algorithm for "IPB2+, MyBB1.2+". The performance is pretty good for such an ugly algorithm. This is because it looks like the guys who invented it did not think a lot about it.
The original schema is: MD5(MD5($salt).MD5($pass)).
The problem with this is, for each hash, the MD5($salt) portion can be precomputed on start.
Sp it becomes just: MD5($salt.MD5($pass))
We end up in nearly the same speed as the newer vBulletin which is MD5(MD5($pass).$salt). But since the plaintext for the final MD5() is always of size 64 we can do some additional optimizations that can not be done on vBull.
Conclusion: MD5(MD5($salt).MD5($pass)) can be computed faster the more easy looking MD5(MD5($pass).$salt). Ouch!
In oclHashcat-lite another major change is the full switch to CUDA v4.1 which are in "productive beta stage".
Since I use my GTX 560Ti for testing the CUDA kernels I found out it is neccessary to use vector datatypes. Ok, thats nothing new. But then, switching to vector datatypes in combination with CUDA v4.1 brought massive cracking performance improvement on sm_21 cards. Now my GTX 560Ti cracks MD5 with 1660 Mhash/s. Here is a preview listing: https://pastebin.com/SKgijei1
In oclHashcat-lite the "hamonization" between the --hash-type modes between oclHashcat-plus and oclHashcat-lite is done. Both programs share the same numbers if they share the same algorithm. This change still need to be done hashcat CPU, but this has lower priority for me.
Same applies to the --help screen. The design is now analogue to the new one from oclHashcat-plus.
In oclHashcat-lite I found out that one of my optimizations that I am using in raw SHA1 kernel can be ported to the MySQL algorithm which is just a double iterated SHA1 algorithm. That means the first iteration is now computing using my "special" optimization. The second, which can not be ported, stays near to the reference implementation.
On my stock clocked hd5970 this improved performance from 1410 Mhash/s to 1515 Mhash/s.
In oclHashcat-plus I have added some of the missing features from the todos.txt. The most important thing was the backport of the GPU-based password candidate generator from oclHashcat-lite v0.08.
Many people complained about bad Real-to-GPU ratio. If you encounter the same problem, this implementation will fix it. The current version oclHashcat-plus v0.07 generates the plaintexts on the host (the PC) into a memory segment and handles it like it was a dictionary which was loaded from disc. After that it has to copy it.
The new generator is a real GPU kernel, like the cracking kernels. But it does not crack any hashes. It just generates the plaintexts to be checked directly on GPU memory. That saves CPU utilization and PCI-Express overhead, since the data does not need to be copied.
In oclHashcat-plus Another major change was the full switch to AMD APP SDK 2.6. I stick to AMD APP SDK v2.4 in oclHashcat-plus v0.07 because it was producing a bit faster code in single hashes. The reason for this can be found here: https://forums.amd.com/forum/textthread.c...id=1276470.
But, SDK 2.4 does a bad job in multihash. I did not notice this during the tests I did before releasing v0.07. Sorry for that! I recompiled kernels with SDK 2.6 and did some other optimizations that leads to faster multihash cracking than with regular oclHashcat. Thanks again to BlandyUK for reporting and confirming.
In oclHashcat-plus some days ago I wrote about my optimized register utilization on WPA/WPA2 kernels on Twitter.
This optimization lead to a 4% cracking performance increase. I will not mystify what I did, so here it is: It is done by splitting the main loop kernel into two parts. In WPA, we are computing two HMAC-SHA1 digests in parallel. By splitting these both into two different kernel saved registers and since the launching overhead is cheap this give some improvement. Now I just run both kernels in a sequence.
On my hd5970 this is from 158k->165k, on the hd7970: 128k->133k.
I spend some time on implementing the requested features. If you want to keep track of them, here is a list: https://hashcat.net/wiki/feature_requests
and last but not least:
This last one goes back to an suggestion from chort. He wrote a nice article about how to use it and got lots of credits for it. I also recommend reading it: https://rants.smtps.net/2012/01/Conductin...shcat-plus
So far so good, later!!
NOTE: If you follow the hashcat twitter you already know some of this.
I will start with the most interessting thing first!
On a stock clocked hd7970 the 100 Mhash/s mark is finally broken! This applies to both programs, oclHashcat-lite and oclHashcat-plus. Even in multihash mode. Proof-pic: https://hashcat.net/misc/goodbye_des.png
The hd7970 also has an impressive overclocking potential. Even with stock cooler you can easily overlock it to 1125mhz. If I do this, the cracking speed increases to 126 Mhash/s.
What I did is that I have optimized the descrypt / DES(Unix) / Traditional DES (or whatever you want to call this algorithm) kernel for the GCN architecture. This basically means I removed all the vector datatype code and replaced them with scalar datatypes. In future there will be two different kernels for this algorithm. One for the old vector datatype architecture and one for the new scalar architecture. Dont worry, both programs oclHashcat-lite and oclHashcat-plus will automatically choose which one to use.
To make 100% sure it works I have generated 6000000+ descrypt hashes out of my dictionaries. Then I let them crack while running the hashes against the dictionaries out of which they have been generated. They were all cracked, 100%.
So why is this an outstanding performance improvement? Lets do a quick comparison. A stock clocked hd5870 on this algorithm makes around 40 Mhash/s. The comparable next generation card was the hd6970. This one makes around 46 Mhash/s. So we had an improvement of 15%. Now, the hd7970 makes 104 Mhash/s. This is an improvement of 126% to the hd6970 (or 160% to the hd5970). Improvements around 20% are normal, but more than 100%? I think this is impressive.
If you imagine running a rig with 8x hd7970 @ 1125 this will break the 1000 Mhash/s mark. In other words, ~85 days to brute-force through the complete descrypt keyspace.
Another nice thing to mention is that with this GPU the attack on so called memory-intensive algorithms was reduced 4 times to the last GPU generation. While memory-lookups are still the best approach to slow down GPU based cracking I did not expect this card to be that strong in this field.
There are some more interessting "facts" on the specs of the new NVidia Kepler here: https://semiaccurate.com/2012/01/19/nvidi...ar-winner/ If they are true, we can expect even more power from this card, especially in cracking memory-intensive algorithms. NVidia was always faster than AMD when looking up data from GPU memory in GPGPU. You can see this when it comes to multihashing.
The hd7970 has been released about 2 weeks ago and since then I got more than 20 requests for access to beta versions which are able to utilize it. I am afraid requests will increase with each day. This brings me to an important topic. When to release oclHashcat-lite v0.09 and oclHashcat-plus v0.08? In theory they are good to go. However, there is a major problem: There is no linux driver.
You might think, who cares, I am on Windows? Well that might be true, but my development system is on Linux. As long as there is no Linux driver I can not efficiently search for new optimizations. Currently I have to compile the kernel using the SDK which is able to produce GCN code and then move the new kernel to my Windows Desktop/Workstation and run it on there and do trial and error.
Another problem is that: Imagine I release the current version and the linux driver comes out but is incompatible to the released kernels. This would require another release.
On the other hand, nearly all requests came from people who bought these cards just for running oclHashcat* on it. But I dont want people to get access to beta just to get access to the latest versions. Being a beta-tester means doing tests, reporting bugs, suggest changes, help the project, etc.
In oclHashcat-lite I have finally added the requested algorithm for "IPB2+, MyBB1.2+". The performance is pretty good for such an ugly algorithm. This is because it looks like the guys who invented it did not think a lot about it.
The original schema is: MD5(MD5($salt).MD5($pass)).
The problem with this is, for each hash, the MD5($salt) portion can be precomputed on start.
Sp it becomes just: MD5($salt.MD5($pass))
We end up in nearly the same speed as the newer vBulletin which is MD5(MD5($pass).$salt). But since the plaintext for the final MD5() is always of size 64 we can do some additional optimizations that can not be done on vBull.
Conclusion: MD5(MD5($salt).MD5($pass)) can be computed faster the more easy looking MD5(MD5($pass).$salt). Ouch!
In oclHashcat-lite another major change is the full switch to CUDA v4.1 which are in "productive beta stage".
Since I use my GTX 560Ti for testing the CUDA kernels I found out it is neccessary to use vector datatypes. Ok, thats nothing new. But then, switching to vector datatypes in combination with CUDA v4.1 brought massive cracking performance improvement on sm_21 cards. Now my GTX 560Ti cracks MD5 with 1660 Mhash/s. Here is a preview listing: https://pastebin.com/SKgijei1
In oclHashcat-lite the "hamonization" between the --hash-type modes between oclHashcat-plus and oclHashcat-lite is done. Both programs share the same numbers if they share the same algorithm. This change still need to be done hashcat CPU, but this has lower priority for me.
Same applies to the --help screen. The design is now analogue to the new one from oclHashcat-plus.
In oclHashcat-lite I found out that one of my optimizations that I am using in raw SHA1 kernel can be ported to the MySQL algorithm which is just a double iterated SHA1 algorithm. That means the first iteration is now computing using my "special" optimization. The second, which can not be ported, stays near to the reference implementation.
On my stock clocked hd5970 this improved performance from 1410 Mhash/s to 1515 Mhash/s.
In oclHashcat-plus I have added some of the missing features from the todos.txt. The most important thing was the backport of the GPU-based password candidate generator from oclHashcat-lite v0.08.
Many people complained about bad Real-to-GPU ratio. If you encounter the same problem, this implementation will fix it. The current version oclHashcat-plus v0.07 generates the plaintexts on the host (the PC) into a memory segment and handles it like it was a dictionary which was loaded from disc. After that it has to copy it.
The new generator is a real GPU kernel, like the cracking kernels. But it does not crack any hashes. It just generates the plaintexts to be checked directly on GPU memory. That saves CPU utilization and PCI-Express overhead, since the data does not need to be copied.
In oclHashcat-plus Another major change was the full switch to AMD APP SDK 2.6. I stick to AMD APP SDK v2.4 in oclHashcat-plus v0.07 because it was producing a bit faster code in single hashes. The reason for this can be found here: https://forums.amd.com/forum/textthread.c...id=1276470.
But, SDK 2.4 does a bad job in multihash. I did not notice this during the tests I did before releasing v0.07. Sorry for that! I recompiled kernels with SDK 2.6 and did some other optimizations that leads to faster multihash cracking than with regular oclHashcat. Thanks again to BlandyUK for reporting and confirming.
In oclHashcat-plus some days ago I wrote about my optimized register utilization on WPA/WPA2 kernels on Twitter.
This optimization lead to a 4% cracking performance increase. I will not mystify what I did, so here it is: It is done by splitting the main loop kernel into two parts. In WPA, we are computing two HMAC-SHA1 digests in parallel. By splitting these both into two different kernel saved registers and since the launching overhead is cheap this give some improvement. Now I just run both kernels in a sequence.
On my hd5970 this is from 158k->165k, on the hd7970: 128k->133k.
I spend some time on implementing the requested features. If you want to keep track of them, here is a list: https://hashcat.net/wiki/feature_requests
- maskprocessor: Allow users to --increment while generating password lists or rules.
- maskprocessor: Allows the user to start incrementing from a chosen character length.
- maskprocessor: Enable user to start or re-start from progress number during brute force.
- hashcat: Upper case the first letter and every letter after a space in the same line.
- oclHashcat-plus: Upper case the first letter and every letter after a space in the same line.
- oclHashcat-plus: Add the BSSID, rules files and hash file used to status screen.
- oclHashcat-plus: If hash-parser rejects a hash, print the offending line number in error message
- oclHashcat-plus: Let user choose his own sperator char like in hashcat CPU
- oclHashcat-plus: Add md5(md5($pass)) and call it e107
and last but not least:
- hashcat-utils: Add an “cut -b†alternative which is able to work with negative offsets
This last one goes back to an suggestion from chort. He wrote a nice article about how to use it and got lots of credits for it. I also recommend reading it: https://rants.smtps.net/2012/01/Conductin...shcat-plus
So far so good, later!!