xml signatures and character encodings
Fri, Nov 7, 2008
While I was implementing the final piece of the puzzle for Guanxi2, I came across an interesting problem to do with verifying the signature on the UK Federation metadata. I just couldn’t verify the signature on it. First, let me get one problem out of the way. There’s a C14N bug in xml-security 1.4.0. 1.4.1 and 1.4.2 are fine but 1.4.1 fails to verify the signature. So I upgraded to 1.4.2 and that was ok. I then refactored the test code into the main Guanxi code and it failed to verify again. Here’s why.
The above code works just fine. I can verify the signature on the metadata after converting to an EntitiesDescriptorDocument. However, the code I had previously used fails to verify the signature:
The more astute reader should now be rolling on the floor, in paroxysms of laughter. I haven’t specified the encoding. A quick call to:
yielded MacRoman, the default charset on OS X. So there are two ways to fix it. In code:
or using Tomcat, in catalina.sh:
Obviously the code way of doing it is the only way as I don’t want people having to mess around with character encodings in their servlet container. The code knows it should be UTF-8, so it should set it accordingly.
But that’s only half the story. I wanted to dig a little deeper and find out why InputStream seems to be charset agnostic, while InputStreamReader breaks the signature verification. Turns out InputStream is a handle on the raw bytes being read. I use XMLBeans for SAML processing, so passing an InputStream to EntitiesDescriptorDocument.Factory.parse() allows the underlying parser to assume the default encoding of UTF-8. There are various ways to work out the encoding but in the absence of an encoding declaration it should be either UTF-8 or UTF-16. Which one can be inferred from the byte length of the first char of the start tag. Passing an InputStreamReader without a specific charset to EntitiesDescriptorDocument.Factory.parse() means the parser will get filtered bytes back from the reader. As there are some exotic chars in the metadata, these get corrupted as it’s using MacRoman and the digest fails to compute and the signature fails to verify.
I don’t get this problem on SAML responses coming from entities in the federation as I use this:
Base64 encoding protects the payload over the ‘net and decoding it gets you the original bytes which represent the SAML Response. StringReader then gives you access to the characters represented by the bytes. Strings in Java are UTF-16 so no encoding filtering is taking place to “corrupt” special characters in the response.