08-18-2020, 09:21 AM
Hi there,
I want to share an idea, which Im currently scripting on. I'm looking for people who want to give some input to improve the idea.
My problem
I cant find a good german wordlist for hashcracking. All wordlists I found are bad in some kind (strange, unrealistic words, not long enough...).
But brutforcing human-selected passwords are really challenging because of fantasy-words, words from personal context, and so on.
My Idea
If I would analyze some megabytes of text from a language (say: wikipedia) and create a statistic, how likely a character will follow to an other character, then I can create strings (I don't want say "words") with an defined overall likelihood. By increasing this overall-likelihood, I can generate strings that look like german words.
Current state
In my tests, the most common german 8char string is 'stendend'. Of corse, this is no german word, but is looks very "german" and its pronounceable.. By increasing the maximal allowed likelihood for the generator, it generates a lot of words.
The 8char-strings with an overall-likelyhood of 1 are:
stendeng, stendere, stendind, stengend, sterende, stindend, schenden, andenden
-> a lot of nonsense here!
some examples of calculated likelihoods of words:
hashcat 56
firefox 64
kölnerdom 61
langestrasse 29
suppentopf 49
bierfass 35
schnapps 41
ollesding 53
hundedreck 42
So in the likelihood-area of 20-70, there are very much realistic german words that are high-potential passwords. But the lists in this area are several gigabytes large.
A list of all words with a length of 6-8 char and a likelihood of 0-59 are about 60gig large. And combined with some hashcat-rules (Capitalize, append numbers..) there is a lot of work for hashcat.. If you go further with likelihood and word-length, the list-size of course increases drastically. And after a full generation of the wordlist, I got an full brutefore list with all possible combinations (charsetsize^8) but this list is ordered by something like a hit-chance.
State of the code
I have an ugly python-script which does the job done. It parses an input textfile for statistic-creation and generates words with defined length and likelihoods. Its about 150loc. Of course, I'm willing to share it, but its too ugly at the moment. At the moment, it is nothing more than an idea which might be good.
My questions
- What do you think about this idea?
- Are they ideas for more optimizations or other approaches?
- Are people here who have done some experiments in the same context?
- every feedback are welcome!
- or, simply: Do you have a good german wordlist?
So, good hunting!
PyDreamer
I want to share an idea, which Im currently scripting on. I'm looking for people who want to give some input to improve the idea.
My problem
I cant find a good german wordlist for hashcracking. All wordlists I found are bad in some kind (strange, unrealistic words, not long enough...).
But brutforcing human-selected passwords are really challenging because of fantasy-words, words from personal context, and so on.
My Idea
If I would analyze some megabytes of text from a language (say: wikipedia) and create a statistic, how likely a character will follow to an other character, then I can create strings (I don't want say "words") with an defined overall likelihood. By increasing this overall-likelihood, I can generate strings that look like german words.
Current state
In my tests, the most common german 8char string is 'stendend'. Of corse, this is no german word, but is looks very "german" and its pronounceable.. By increasing the maximal allowed likelihood for the generator, it generates a lot of words.
The 8char-strings with an overall-likelyhood of 1 are:
stendeng, stendere, stendind, stengend, sterende, stindend, schenden, andenden
-> a lot of nonsense here!
some examples of calculated likelihoods of words:
hashcat 56
firefox 64
kölnerdom 61
langestrasse 29
suppentopf 49
bierfass 35
schnapps 41
ollesding 53
hundedreck 42
So in the likelihood-area of 20-70, there are very much realistic german words that are high-potential passwords. But the lists in this area are several gigabytes large.
A list of all words with a length of 6-8 char and a likelihood of 0-59 are about 60gig large. And combined with some hashcat-rules (Capitalize, append numbers..) there is a lot of work for hashcat.. If you go further with likelihood and word-length, the list-size of course increases drastically. And after a full generation of the wordlist, I got an full brutefore list with all possible combinations (charsetsize^8) but this list is ordered by something like a hit-chance.
State of the code
I have an ugly python-script which does the job done. It parses an input textfile for statistic-creation and generates words with defined length and likelihoods. Its about 150loc. Of course, I'm willing to share it, but its too ugly at the moment. At the moment, it is nothing more than an idea which might be good.
My questions
- What do you think about this idea?
- Are they ideas for more optimizations or other approaches?
- Are people here who have done some experiments in the same context?
- every feedback are welcome!
- or, simply: Do you have a good german wordlist?
So, good hunting!
PyDreamer