Thursday, 26 December 2013

Searching with Bloom Filters in Haskell


This post is primarily about my experience with the bloomfilter package in Haskell.

Coding it up

Coding it up was much of a muchness, aside from a late realisation that there's no Hashable instance for Text which caused a quick change from Text to ByteString, but aside form that, it was very easy.

Even querying the index is easy, I shall post up the code when I am back at home, since I'm with my parents over the Christmas period.

Overall, it's roughly 38 lines, but around 7 lines of that are simple module imports.


To test it, I grabbed roughly 2.5MB of data from Project Gutenberg, and simply started querying the index. I haven't done any "proper" performance testing so I've just been feeling it out with no timers.

I didn't bother forcing anything, so the first query can take a little while as the index is built, but it's nothing prohibitive. After that, every search feels instantaneous.

After building with -rtsopts, I had it generate a heap profile, which showed that the peak memory allocated with my test set was roughly 8MB, but that fell back down quite rapidly.


I suspect that a carefully used deepseq, and ByteString.Lazy could really improve the memory foot print of the system, and remove the initial "hang" when doing the first search. I shall have to investigate this at a later date.


If I could just get access to the underlying UArray Int Hash in a way that I could manipulate, I could make searching more probabalistic, but also much faster.

Similarly, I could extend the usage of a bloom filter into a similarity metric, but unfortunately, I'm not entirely sure how to use the underlying representation at this time.

I basically need access to the cardinality of the underlying representation, and the ability to union and intersect the underlying bit arrays. Once I have these (or worked out how to do it on the underlying representation) performing similarity matching becomes very simple.


Haskell's bloom filter library seems to be very fast whilst maintaining an easy to use facade.

However this is not flexible enough to really play with bloom filters in all the ways they can be very easily. Perhaps basing the underlying representation on the bitset package, or offering a "serialisable" version which can be easily manipulated would be all it takes.

Wednesday, 25 December 2013

Client-Side Similarity Matching


This post is entirely inspired by a post I had recently seen on HackerNews, specifically, "Writing a full-text search engine using Bloom filters", by Stavros Korokithakis. I then went off and did a bit of Googling, and came up with a paper by Jain, Dahlin and Tewari.

The Full-Text Search

Korokithakis' method is actually reasonably ingenious, take your documents, tokenise them, and enter them into a bloom filter. This gives a highly compact representation of the set of tokens in your document which is extremely efficient to query. Take the bloom filter, and add it to a map of DocumentId ->  BloomFilter.

To search, one tokenises the query,  iterates the values (BloomFilters) of the map querying if the query tokens appear in the bloom filter. If they do, the document is considered a hit in the search.

It doesn't take much imagination to extend the tokenising to stop-word filtering, stemming, n-grams, and all the other lovely methods used by more traditional full-text search engines to make the search more usable by humans, and to extend the "relevance" beyond hit/no hit (e.g. more hits = better is one improvement to the described scheme that I can think of).

Similarity Searching

This is where Jain, Dahlin and Tewari's paper comes in handy. They used bloom filters for detecting similar web pages when crawling a site.

Their method was to take index the web pages using a bloom filter, but using a different tokenising method, and then to simply count the proportion of matching bits in the resulting bloom filters. They then have some nice analysis to show how "similar" two documents can be considered to be if their bloom filters have a high proportion of set bits.

Onto the Client

This is very simple, you parcel up your index, but instead of searching it, you simply take the current document's already computed bloom filter (no need to send over the pesky tokenisation code!) and do a bitwise AND on the other elements of the index. The ones with the highest cardinality are the ones which are most similar to your selected article.

This is also the kind of calculation that could be done once at at index generation time.

Monday, 25 November 2013

My Experience with Temporary Blindness

How it Happened

For those who don't know me personally, I wear some rather strong glasses. I have a natural focal length of something like 13cm (That is to say, anything beyond 13 centimeters is just a blur) without my glasses. In recent years, I've switched to contact lenses, continuous wear ones to be specific. These allow me to lead a life which is reasonably free of glasses, to engage in certain sports that were just prohibitively difficult before.

Fast forwards to 12th of November, and I take my lenses out because I had some pain in my eye. The next morning, at around 0530, I work up with an unexpectedly painful eye. It had swelled shut. So bad was the pain that my other half took me to the accident and emergency department to get it looked at.

I was told that I had a serious infection of the cornea, or an ulcer, and was at risk of losing the eye. I was given muscle relaxant eye drops, antibiotic eye drops, local anesthetic at the hospital and discharged. The basic effects of the (what was confirmed at a later appointment) infection was that any movement of the eye was quite painful, including the movement of my iris (hence the muscle relaxant). Any movement, as triggered by my "good" eye will also cause me a lot of pain, so using either eye is right out.

So I was essentially stuck in a position where my least painful route for the day was to wear a blindfold, take my medications at the allocated times and take normal, over-the-counter painkillers. This was my state for two days. I could not see at all.

I took the time off work, predictably, since I ride my bike into work, this could've ended badly!

My Life, Blind

I work at the local university as a researcher, this actually sounds far more impressive than it is -- I normally just end up writing Java on a Linux box. At home, I rely on Linux Mint 14.

Overall,  the experience was patchy. I asked my other half to do a quick google to find out what the screen reading software for Linux Mint 14 was, and the result was something called "Orca". I used my limited knowledge of keyboard shortcuts to get to a terminal, run aptitude install orca, logout and in, and start Orca.

The terminal worked mostly well, except for the fact that I use zsh, which lead it to reading out a lot of superflourous stuff that you tend just filter out when reading it, but only read when you need. This is not the case with a screen reader -- everything (including my verbose prompt) is read at you all the time, after every command. At first it is very overwhelming.

I tried to use Miro to get some podcasts to pass the time,. and while using aptitude to install it was easy, using Miro was anything but. I ended up switching to my usual media player which was Banshee. Banshee required a restart to get it to pick up the fact that I was using a screen reader. It however worked ok after this. It was easy enough to navigate, however, it was very difficult to select a specific podcast episode (I could select things like, favourite songs, podcasts, all items, new items, various podcast "channels" that I'd subscribed to.

So far I've been getting through the Haskell Cast episodes  that I have wanted to listen to for a while. I plan to listen to a few more podcasts in my suddenly spare time.

Moving around my flat has actually been perfectly ok, it just took me a little longer. I mostly walked around my flat with my hands stuck out or a hand on  a wall. I could still do everything in the bathroom, but I was disbarred from my own kitchen. I didn't have to leave my house yet, except for going to the hospital, for which I had a guide.

The evening after I got back from hospital, I had a friend round to visit. luckily, she knew about my predicament and brought round a board game. It would seem counter-intuitive that a board game would be any good, but she brought round Tokyo Monster Something=or-other. It actually worked really well with my other half telling me what I had rolled, and who was currently in Tokyo and other information. I can't imagine any of my boardgames being even slightly accessible, and we definitely couldn't have played Xbox or watched something on Netflix.  But that's somewhat ok, since  haven't exactly tailored my living room for that.

Even Pidgin worked really well -- I use Google Talk at work, DukGo's XMPP server for comms with my friends, so I was well served by Pidgin and it worked flawlessly with the screen reader. In terms of accessibility, it's ease of use excelled even that of the preferences panels of Orca!


This did not stop me getting quite isolated. I couldn't pass the time on my Xbox, or watching TV, since I don't think Netflix offers audio description on it's offerings.

I couldn't use my kindle, I've noticed now that there aren't even headphones on the thing, so I couldn't read any of my books that I've bought.

Worse than this, however, was the things which did not work with a screen reader. The main culprit on Linux Mint 14 was FireFox (Fx).

I could alt-tab through windows, and have the title of each item read to me perfectly, and when I switched to Fx, I was greeted with the most deafening silence of the whole experience. I tried as much as I could remember of getting to the preferences panel blind, but I think I just probably set some obscure settings.

I ended up using Links in the terminal. If you think this is painful when sighted, try it with Orca! Some websites "worked", but it did mean sitting through a huge list of navigation (or worse, adverts) on every page load. A few pages offered a "Skip to Content" link (I think /. was one of them), which was met with absolute glee.

Sites like reddit were quite unusable, mostly for the list of subreddits over the top, and HackerNews worked; I could read headlines after waiting through an amount of navigation that was almost (but not quite) unbearable. The links that I followed were a mixed bag, and actually working Links can be a bit hit and miss, especially when you're disorientated, and you're not sure if you've managed to exit the terminal window all together or something else entirely.

When there's a lot of navigation on a page, it becomes hard to stop yourself just zoning our whilst it is read at you, and then you realise that you're in the middle of the content, and you need to backup some how.

For reference, I also tried to install Chromium, and was met with substantially the same problems -- silence.

Even making notes for this blog were not a walk in the park. I ended up using GEdit. Actually writing the document was reasonably easy. It was more difficult to save it. After Hitting Ctrl-S, I was a little overwhelmed and confused by the amount of things it said to me, but I ended up navigating via tab (Also worth noting "push button" is very difficult to discern when said by a screen reader).


I would say that prior to this issue, I was still sensitive to accessibility issues, having a sight problem means that I often end up zooming pages when my eyes are tied, or I have to wear my glasses at work. But nothing could've prepared me for how incredibly disorientating it is to be blind at a modern computer.

Overall, computers seem somewhat usable if you're not used to not having sight, and are dumped into having to learn how to use a screen reader, but if you're unfortunate, you'll be stuck cut-off from the web. This should not be possible with a distribution  like Linux Mint and a modern web browser like Firefox. I'm glad that I was able to get around, and that with some help I was able to get my podcasts playing, and write notes about my experiences,but it shows that it is overly difficult to get things to just work when you really need it.

Saturday, 14 September 2013

Let's Build a Crypto Ecosystem


Let's pretend for a moment that we didn't have such fantastic crypto primitives such as AES, SHA3 and so on, and we'd not (for some reason) never needed crypto, how would we go about building a 'modern' crypto system?

Firstly, this is meant as a basic overview of how most crypto primitives that are in use today are put together. Not the "most interesting" bits, but the pieces that do all the heavy lifting -- the symmetric ciphers.

What is a cipher suite, and what exactly is AES-128-CBC-HMAC-SHA1 ?  We'll get into what these sort things mean, but first we have to start somewhere.

Jumping Off

Block ciphers typically take as input a key and a "block". A block is just a fixed-width collection of bits. The Data Encryption Standard (DES) takes 64-bits, and the Advanced Encryption Standard (AES) takes 128-bits. The great thing about this is that they are reasonably easy to reason about, lend themselves to parallel usage (depending on the usage) and also relatively easy to implement in hardware.

For a really basic block cipher, we'll be pinching Heys' SPN cipher from his Linear & Differential Cryptanalysis Tutorial. I've chosen it because it's reasonably easy to understand, implement and the design is small and freely available.

Heys' paper does not specify a key schedule. I am going to decree that the key is 80-bits long, which gets broken into 5 16-bit subkeys to be fed wholesale into the relevant parts of the cipher.

I have put my implementation up at on a Gist. We can use this a a basic implementation for some fun work.

Note, I've used this cipher for an additional reason: to drive home the fact that you never, ever write your own crypto. That includes plugging together AES into AES-CBC mode, or making your own C-MAC from a block cipher. Just don't do it, this blog post is for demonstration purposes only.

If you write your own crypto in production, you will lose the security game, your data is not safe.

With that out of the way, let's move on.


If we want to go down the route of academic correctness, we must first convince ourselves that our fancy new cipher is a Psuedo-Random Permutation (PRP).  That basically means that an expected adversary (i.e. one with normal, classical computers) cannot distinguish the cipher from a random permutation function.

We aren't going to go into what that exactly means, but a decent way to see how good a PRP our cipher makes is to turn it into a PRNG and see how well it handles the FIPS 140-2 PRNG tests. These just throw a small battery of statistical tests at our random data, and see how it fares. For reference, let's see about /dev/urandom:

[Sat 13/09/14 21:52 BST][pts/3][x86_64/linux-gnu/3.5.0-37-generic][5.0.0]

zsh/2 936  (git)-[master]-% cat /dev/urandom | rngtest
rngtest 2-unofficial-mt.14
Copyright (c) 2004 by Henrique de Moraes Holschuh
This is free software; see the source for copying conditions.  There is NO warranty; not even for MERCHANTABIL

rngtest: starting FIPS tests...
rngtest: bits received from input: 109576192
rngtest: FIPS 140-2 successes: 5474
rngtest: FIPS 140-2 failures: 4
rngtest: FIPS 140-2(2001-10-10) Monobit: 0
rngtest: FIPS 140-2(2001-10-10) Poker: 0
rngtest: FIPS 140-2(2001-10-10) Runs: 3
rngtest: FIPS 140-2(2001-10-10) Long run: 1
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=1.051; avg=42.559; max=19073.486)Mibits/s
rngtest: FIPS tests speed: (min=956.476; avg=53182.979; max=71806.066)Kibits/s
rngtest: Program run time: 4499000 microseconds

A few failures are to be expected; it's supposed to be random, so "extreme events" do occasionally occur; they just should occur too regularly ;-)

Let's make our cipher into a PRNG. We're going to simply write to a file E(0,i) where i is [1..b] so we end up with ~b random bytes. A quick implementation of this is available

Let's run our small test:

[Sat 13/09/14 21:51 BST][pts/3][x86_64/linux-gnu/3.5.0-37-generic][5.0.0]

zsh/2 934  (git)-[master]-% cat randomBytes.txt | rngtest 
rngtest 2-unofficial-mt.14
Copyright (c) 2004 by Henrique de Moraes Holschuh
This is free software; see the source for copying conditions.  There is NO warranty; not even for MERCHANTABIL

rngtest: starting FIPS tests...
rngtest: entropy source exhausted!
rngtest: bits received from input: 2097152
rngtest: FIPS 140-2 successes: 0
rngtest: FIPS 140-2 failures: 104
rngtest: FIPS 140-2(2001-10-10) Monobit: 104
rngtest: FIPS 140-2(2001-10-10) Poker: 104
rngtest: FIPS 140-2(2001-10-10) Runs: 104
rngtest: FIPS 140-2(2001-10-10) Long run: 71
rngtest: FIPS 140-2(2001-10-10) Continuous run: 0
rngtest: input channel speed: (min=3.328; avg=74.461; max=19073.486)Mibits/s
rngtest: FIPS tests speed: (min=1.056; avg=30.812; max=97.813)Mibits/s
rngtest: Program run time: 93448 microseconds

Note the line "rngtest: FIPS 140-2 failures: 104" and how many failures are listed, now back to /dev/urandom. Take that as a warning; this isn't even the DieHarder tests.


A note on notation: E(k,P) means encryption a plaintext P under key k. D(k,C) means decrypting ciphertext C under key k. So for example, the statement D(k,E(k,P)) means encrypting and then decrypting a plaintext, P, under the same key. Incidentally, for any cipher, this should always result in P.

Now, assuming we have our Heys' Cipher implemented, we can just start throwing bytes at it, right? So we have plaintext P = P1, P2, ..., PN (our plaintext is composed of several plaintext "blocks", each of length 16 bits), so we can just compute C = E(k, P1), E(k, P2), ..., E(k, PN) and have a sufficiently scrambled message. Decryption in this case is obvious.

Well, you'd think, but it's actually a really, really Bad Idea. This is a cipher mode all to itself, and it is known as Electronic Code Book, or ECB.

This is not so good. We know that E(k, p) is deterministic. This means that if C1 = C, and the attacker knows they were both encrypted under the same key, then the attacker knows that E(k, P1) = E(k, P) => P1 = P. The attacker can then do all sorts of fun things, such as re-sending the same block to cause havoc.

So, how do we modify things so that we don't get this issue? First, let's modify our notation slighty, so that P[i] means the i'th block of the plaintext P, and we'll also use '+' to mean xor.

A common way so to use something called Chained Block Cipher (CBC) mode. This is, in notation, expressed as C[i] = E(k, C[i-1] + P[i]).

This basically means that for each block you wish to encrypt, you xor the result of encrypting the previous block with the current plaintext block. This works surprisingly well, but it has a fatal problem. How do you do P[1], the first block?

Well, the normal way is to just generate a completely random block and to package that up in the clear with your message and send it on. This is known as an Initialisation Vector, or IV.

There are also other issues with this, one of them is performance. You can't access C[i] until C[i-1] is computed. This mode is inherently serial, you can't get any faster by throwing processors at it, you have to simply buy a faster processor. Not good in today's CPU market.

How do you get around this? Well, another mode of operation is Counter (CTR)  mode. Conceptually, this is much simpler, and very similar to how I turned the Heys Cipher into a PRNG. It is expressed simply as C[i] = E(k, f(IV, i)) + P[i].

In words, this is encrypting an IV (often just 'nonce') with a counter of some sort. The resulting stream is then xor'd with each plaintext block. This is basically turning the block cipher into a stream cipher, and can be used as such. It should be noted that the function f in the above expression is almost always simple addition, however this was not always the case.
This has a few benefits, the most obvious being that it is easily parallelised. 

So, we can easily construct a working cipher system, right?

Well, not quite. In my next post, I'll go over authentication. Authentication is a crucial part of any modern cipher system. If you're involved in designing a cryptosystem, and you're not thinking about authentication, you probably should not be designing a cryptosystem.

Monday, 22 July 2013

Fifty Shades Follow Up

So, my previous post managed to make it to the front page of Hacker News! Woo! I essentially achieved my end goal which was to provoke some reasonably thoughtful debate on the topic of censorship of pornography.

Firstly, I'd like to thank those that took the time to read the post, and more so, I'd like to thank those who took part so vigorously in the debate, even if you didn't read the whole post ;-)

I would, however, like to address some of the recurring comments that came up on the Hacker News thread and the comments of this blog.

I would definitely like to see more intelligent discussion on the subject, even though the subject of censorship has been done to death. I personally find the idea that censorship can ever be a good thing an extremely hard pill to swallow, but bring me peer-reviewed evidence and I shall re-assess my position.

Onto the points made by the commentors.

You Didn't Read The Book

Yes! That's absolutely right, I do explain my assertion that the book contains depictions of rape, and they're not based on personally reading the book.

The assertions I have passed on were from a section towards the end of the novel where the protagonist declares that she never wanted anything from Mr. Grey, and that she felt forced to sign some form of "sex contract".

Rape issues aside, it's a blatant violation of standard BDSM practices. If a practitioner is not enjoying it and providing enthusiastic consent then someone is doing something horribly wrong.

The assertion was that this passage alone puts the whole book in a much darker context. I did not relay that assertion very well, as I wanted to focus on the outcomes of the logic and I had assumed that many of the people who'd read the book felt the same way about the acts portrayed in the novel.

It's Absurd to Claim Books will be Banned

"This is the first I've heard about books being banned. I'm extremely skeptical of this claim." --alayne

Well, clearly, you didn't read to the end of the post:

"... thus bringing this absurdist tirade to a close ..."
It is an absurdist piece -- of course it will be absurd! The piece is designed to stimulate discussion on the how far you can actually stretch the definition of "online pornography" and "depicting rape".

In a legal context, I presume this will be settled by precedence, but that's a worrying state of affairs. The discretion of a judge will be the arbiter of what's "pornographic", what's "online" and so on. Chose your battles and judges right (or alternatively, bring it to court often enough so statistically you get a friendly judge), and you'll be able to enforce your personal morality on the law.

BBC News is not a Legal Source

Definitely, and thankfully so! However, nor is the Oxford English Dictionary, which was referenced, so I don't know why people chose to run with BBC News being the problem.

That aside, I was working with what information I had to hand to demonstrate a point, not to make an actual legal assessment of the situation. I assume that something that looks vaguely like Mr. Cameron's vague comments will make it to a bill some time in the future, but it will not be as simple as BBC News has made it out to be. And I hope an actual lawyer will put out a good analysis of it at that time.

Any subject this complex, with so many tensions running high on both sides will undoubtedly have compromises and loop-holes that ensure that the legislation is primarily stifling to the general populace and is not fit for purpose.

A Clockwork Orange is not Porn

To be absolutely clear, I was referring to the book, wherein the protagonist takes two 10 year old girls from a shop, drugs and rapes them. The protagonist also rapes another woman, but my memories grow fuzzy at this point. I have not seen the film, but I gather a world of difference exists between the book and film version when it comes to the rapes. From the comments, I gather that the film may also come under this new legislation, but it's not quite as bad as the book.

I would say that is definitely true that A Clockwork Orange is not porn to me, however, depending on the wording of the law, this could go either way. There could be one set of rules for "online pornography" and one for the "depiction of rape or sexual abuse". This would lead to some difficulties with works like A Clockwork Orange.

Part of the problem is that the line between what is "Art" and what is "Pornography" is so blurred that sometimes, art is porn, and porn is art.

This is not to mention the uncomfortable issue that one man's rubbish is another's gold. Just because you don't find it "pornographic" or "titillating", it doesn't mean to say that another will not.

Fifty Shades Does Not Contain Rape

This has it's own special section, separate from above as I want to tackle something very different and actually very damaging:

"Been a while since I read it and, honestly, I wasn't really taking that much notice but I don't remember any forcible or coercive rape in FSOG." --zimpenfish

As covered above, I think that the protagonist was coerced, but that is besides my point. The comment subtly claims that rape must be forcible or coercive. Rape is sex without consent of one or more of the parties involved by my personal definition. This differs from the legal version and is much broader, but I feel it fits the "spirit" of the word better.

I feel that the kind of comment above implicitly narrows the definition of rape in a very bad way.

Work X is Also in Danger!

By my logic, almost any work that deals with rape or sexual abuse and has been distributed in any form "online" could be in danger. However, my list of works in potential danger is woefully short! I only included the ones off the top of my head early in the morning, so it's bound to be short.

Bringing the total number of works which falls into this category up helps in a few ways:

  1. I chose Fifty Shades because it's widely recognised. The more works we can get, the more recognition we get, and the more uncomfortable we can make people when they realise the amount sexual abuse, rape and child abuse is actually explored in literature.
  2. More people will have these works in their houses, their local libraries, their places of study. This strengthens the point that if everyone is a criminal, then no one is by simply increasing the number of criminals to near-total coverage.

To be honest, this is some of my favourite feedback. I love hearing people's passion surrounding various artistic works; even those which deal with difficult subjects; come rushing to the surface when given even the gentlest of prods. Makes me feel that I'm not so alone in my ravenous appetite for books!

Fifty Shades has no Merit

I am not a great fan of Fifty Shades in terms of literary merit, I'll grant that. However, Fifty Shades has done something that the British public and media have struggled with for many years. Our culture is incredibly sexualised, and yet no one seems to discuss it in a sane way.

We have a sizable kink subculture that never gets discussed, and is often left out of discussions. Fifty Shades has changed that forever. Kink is finally becoming more accepted as the norm, and discussing sex and what is good during sex is no longer the sole domain of Cosmopolitan (and similar) magazines which carry with them unbearably bad advice.

That aside, you may benefit from some old poetry.

They Will Treat Male Rape Differently

Our culture at the moment does treat male rape differently. This unfortunate state of affairs could be extended to the proposed laws, and have some odd consequences.

The film "The Shawshank Redemption" does strongly imply and discuss the consequences of male rape in a prison environment. The treatment of the matter is neither funny nor in any way less "serious" than similar treatment of women in similar situations. However, I am aware that one film does not define our culture.

However, that all aside, why does gender matter in this discussion? Why bring up this particular issue in our culture that is only tangentially related?

The issue is clearly serious, but I don't feel that it's entirely relevant to this discussion. If the law does make distinctions, fair enough, let's play ball. Until then, I don't think it's relevant in this particular discussion.

Fifty Shades of Grey Made Illegal in the UK

BBC News has today reported:

"In addition, Mr Cameron will say possessing online pornography depicting rape will be illegal, bringing England and Wales in line with Scotland."
This move; applauded by book lovers, English Literature graduates and English teachers all over the country; was made by Mr. Cameron ostensibly to protect children, but he has luckily snuck in the bill the ability to ban Fifty Shades of Grey! Thank the Tory-gods!

The Logic

Put simply, the comment from BBC News above shows that possessing any "online pornography" which "depicts rape" is now a crime. 

Fifty Shades of Grey is unquestionably pornographic. It contains multiple graphic descriptions of sexual acts, and therefore the novel as a whole can be considered pornographic.

The next question is "What does it mean to depict something". I don't have a legal dictionary on hand, but the Oxford English Dictionary has this to say about it:


[with object]
  • represent by a drawing, painting, or other art form
And given that prose can definitely be used as a form of art -- especially when portraying the fictional -- we can conclude that 50 Shades of Grey does "depict rape".

Fifty Shades of Grey may have a get-out-of-censorship-free-card in that it is not online. It is a physical book. Well, mostly. It's available for Kindle, via Amazon's Whispernet. This is definitely on-line. So it is definitely on-line pornography that depicts rape.

The more difficult question is quite a simple one: does what is depicted in the novel constitute rape? I've never read the book; however; I have had some particularly revealing passages read to me. They show that the woman in the novel did not consent to the majority of the sexual acts carried out in the book.

This puts E. L. James, Amazon and a large portion of middle class women squarely in the "illegal" category of the law.

Everyone's a Criminal

Given that many contemporary novels (and less-contemporary) novels do deal with sexual abuse and rape, some of which explore the idea of "rape fantasy" (Herein referred to by the more accepted name, "ravishment") between consenting partners means that a large portion of those with an interest in BDSM and kink will fall into this category.

That is not to mention those who simply study the novels, or get no particular pleasure out of them, and those who read the novels because they'll read almost anything will all be suddenly criminals.

Let me be absolutely clear. Joe Haldeman's "Forever War" depicts a lot of sex (Some of it could be considered to be coerced), Joe Scalzi's "Old Man's War" depicts a lot of sex (although none that I remember was coerced or forced), A Scanner Darkly has a fair bit of sex. These are SF&F books, with no focus on sexuality or pornography, yet under some stretched definitions could fall foul of this law. 

These are books that I regularly recommend to family members, usually over the age of 16 because of their violent (not sexual) content.

Also falling into firmly into this category is Burgess' "A Clockwork Orange". Once again, it's available for Kindle, it definitely depicts rape, and it's read world-wide by millions.

And it's well known that if we're all criminals, none of us are, thus bringing this absurdist tirade to a close and showing the law for what it really is: Propaganda codified as law.

Thursday, 16 May 2013

Revisiting FreeBSD & Java


Some time ago, I attempted to use FreeBSD to do a small amount of development work. I documented the frustration I encountered.

It turns out that I think I was wrong!


It seems that the usual way to get packages on FreeBSD is to use the ports collection. However, this is not the only way to get packages

FreeBSD, like its rivals, OpenBSD, NetBSD and DragonflyBSD (And many proprietary Unix systems) has a tool for adding packages -- binary packages. This was found out after some time of hanging around in the OpenBSD IRC channel, and setting up an extremely locked-down Minecraft Server. That's another blog post!

Get it Right

Simply put:

#> pkg_add -r jre

Will get you a Java Runtime environment (OpenJDK 1.6.0_32). If you need the JDK, simply call:

#> pkg_add -r jdk

Instead. Job done. No faffing about trying to compile ports, no agreeing to Oracle's draconian licensing, just pure Java.

Bonus Section

We all love build systems! Automate everything, right? Well, it turns out that Maven3 is available under FreeBSD (But not Debian 6.0, my work machine). It's a similarly difficult task for getting Maven3 installed:

#> pkg_add -r maven3

And that's it. 

It's also similarly easy to get Java running on OpenBSD, just see the pkg_add man page for OpenBSD and set your PKG_PATH to your favourite mirror.

Tuesday, 16 April 2013

Salt with your Fries?


When storing a password or other sensitive data, a salt is a must. This post deals with the nitty-gritty of salt generation.

If you're here for simple advice, use bcrypt with a nice wrapper which handles salt generation for you.

Properties of a Salt

A decent salt must satisfy two minimal requirements: 
  • Unique
  • Unpredictable 
Uniqueness is fairly obvious, if your password database uses a static salt for all passwords, then it only defeats pre-computed dictionaries and rainbow tables. It does not protect you after your database is leaked, and your attacker knows the salt -- they can simply re-compute the dictionary using the salt and often break many passwords this way. Having a different salt for each password will force the attacker to create a dictionary for every password in the database.

Unpredictability is not immediately obvious. However, take a well known open-source system that uses a sequential integer as the primary key in the database and it's salt. This clearly satisfies uniqueness. However, if an account (say, for example, admin) is always the first account on the system and therefore always uses the salt "1". An attacker can pre-compute their dictionary with salt "1", when the database is leaked, the attacker can simply look up the hash value in the dictionary and see the password.

This however, presents a major problem. If I can't predict the value of a salt, how can I be sure it will not collide & violate uniqueness? 

The first, naive solution, is to simply throw away colliding salts. However, once you have a certain number of passwords, you'll spend all your time generating salts. 

The better solution is to ignore collisions, but make them extremely unlikely. With sufficient randomness, we can also ensure that collisions happen extremely rarely, even in a system which uses sharding or other "splitting" methods. It is this solution that we will explore. 

Unique Violations

The primary strength of a salt lies with the fact that an attacker must attack each password individually, or for our purposes, almost individually.

If each password can be expected to share a salt with two or three other passwords, then this is quite bad, as the amount of passwords that an attacker can crack by trying the salts is non-zero.

Ideally, an adversary picking a salt at random from a leaked set of passwords should have a negligible chance of picking a salt which will have suffered a collision, hence attacking any single salt gives them a negligible chance of breaking more than one password with that salt.

A simple manipulation of the Birthday Paradox allows us to determine the minimum bit-length for any salt conforming to this requirement, the equation that derived is:

b = log2(n2 - n) - log2(p) - 1

Where p is the probability we're looking for, n is the expected size of the password database and b is the bit-length of the salt.

So, for our requirement on a system that's expected to use 10,000 passwords, and have an attacker not expect to have more than one collision in that set would need p set to 10-5.

b = log2(1010 - 105) - log2(10-5) - 1
b = 48.83

Or about 48-bits of entropy, minimum.  

Existing Standards

NIST and RSA both give recommendations on the minimum salt length to be used in any key derivation function. NIST (NIST-SP800-132) recommends 128-bits and RSA (PKCS #5 v2.1) recommends only 64-bits. 

Most BCrypt implementations, when asked to get a salt for themselves will return a 128-bit salt, which is more than enough, and happens to comply with NIST's recommendations.

It is probably the case that RSA's recommendations are not sufficient for use in large organisations.


Simply put, if in doubt, just use bcrypt.

However, for your organisation, you can easily calculate the minimum bit-length required to ensure the desired amount of salt collisions. I would always plonk myself down on the safe-side of town and go with NIST's recommendations of 128-bits.

You should always bear in mind that poor random number generation from your OS or other source can lead to more collisions. The post above assumes good, uniformly distributed salts.

Tuesday, 26 February 2013

Cracking Postgres Passwords


This article is a brief overview of using HashCat to recover a Postgres 8.4 password.

Hash Format

The format for Postgres is md5(secret || username) where || denotes string concatenation. 


HashCat as a tool is more traditionally associated with oclHashCat-lite and oclHashCat-plus, which delegates the work to the GPU, significantly increasing the cracking speed.

At this point, I am using HashCat because of it's simplicity and flexibility in attack modes and hashing algorithms, however, most of what is written here could be translated for a suitably configured oclHashCat-lite and run on the GPU.

Retrieving the Hashes

In Postgres, if you have read access to pg_roles, you can simply use:

SELECT username,passwd FROM pg_roles;

This will show you a result like:

   usename   |               passwd                |
 postgres    | md50f7e63611a98a9936285c7d609b044da |

The field "md50f7e63611a98a9936285c7d609b044da" is the md5 hash. Simply strip it down to just the hash (removing the "md5" from the front to), and that is what you're attacking.

Using HashCat

If you download HashCat from their site, you'll likely be able to follow these instructions exactly, but these instructions were written using HashCat v0.42.

First, you must make a hash file:

echo 0f7e63611a98a9936285c7d609b044da:postgres > hashfile

Then you may run HashCat:

./hashcat-cli32.bin -m 10 hashfile -a 3 ?a?a?a?a?a?a

The flags we're using are as follows:
  • -m 10 : Use the hashing scheme md5(pass.salt)
  • -a 3 : Use the brute-force attack mode.
  • ?a?a?a?a?a?a?a : Search all passwords up to 7 characters from the character set [a-zA-Z0-9] + symbols.


Mine is still running vs. a password of my choosing, but after around 10 minutes, HashCat reported that it was now trying to crack all 6-character passwords and would take on the order of 16 hours to try them all.

HashCat's current status is that it's trying 12.22M words on my CPU every second. I'll update this blog when real results are found. I suspect at this rate, I'll give up after a couple of days, as all 7 character passwords will take on the order of 2 months.


Having finished running HashCat on my machine with the above arguments, the search finished after a full 15 hours, however, it did not find the password, as it was longer than 6 chars. For this experiment, I mainly wanted to confirm that I could get HashCat to run and attempt to crack a password. The hash given above is for a *very* long password, that actually has more entropy than an MD5 hash!

Wednesday, 6 February 2013

Doing Passwords Wrong with PayPal

What They've Done Wrong

PayPal do a lot to argue that their system is secure, however, their outward-facing practices don't necessarily live up to their own hype. If you can't make a security conscious user feel secure about their password policy, how do you think that one might feel about them having my EMV card details?

Let's look at the problems with their passwords.

Minimum Length

PayPal's minimum length is too low at 8 chars. This gives the minimal password an entropy of approximately 38-bits. Most folks can brute-force that on their phone these days. Even DES thought that 56-bits was good enough in '77. Times have changed (no, they haven't gone backwards!)

Maximum Length

They enforce a maximum length which is too low (it exists!?): 20 chars. This gives the absolute maximum entropy of a password of around 132 bits. I like to make my passwords around 256 bits. It's 2013, let me make a passphrase that's more than 20 characters, it's not that big of a performance hit!

No Copy Paste

In my opinion, this is a huge sin. As a person who uses a password manager, I use an application to generate a very long (~257 bits of entropy), very random password. I then copy and paste this password (having never seen it) into the password field.

This allows me to easily avoid re-using passwords across multiple sites, and ensures that I always pick a strong password.

Allowing Bad Passwords

PayPal does not (client-side!) check if the password you have types is a bad password. In fact, it allows 'password' and '12345678' as passwords. Yes it labels these as "weak", but it doesn't prevent you using them.

Dense Strength Checker

The strength checker offers a false sense of security. I can make it claim an 8 character password is strong. It simply is not! Even the best 8 character password is at most 53-bits of entropy.

What They Should be Doing

  • Minimum length: 12 chars
  • No Maximum Length
  • Allow Copy/Paste for users who use password managers (such as myself)
  • DropBox keeps a list of really bad passwords, and warns against them by checking client-side.

How To Deal With This

you only (unfortunately) have two options:

  • Complain to PayPal. I already have done, and I'll post the follow up.
  • Leave PayPal.
If PayPal don't correct these simple issues, I will most likely be leaving PayPal, as I do not trust them with my details if they can't manage passwords correctly.