Details
Description
In tracking down a bug with unicode characters in our app that uses Jetty, I discovered that if a client mis-encodes URL request parameter unicode characters with javascript escape(), the invalid encoding will throw no errors, but generate some very broken strings.
Here's an example:
http://server/path?invalid=data‡here
(although preview tells me that jira is mucking up the char, which is codepoint 8225, the double-dagger, or UTF-8: %E2%80%A1)
if the client goes off and generates this (say for use with JQuery), but uses escape() instead of encodeURIComponent(), the resultant string is:
http://server/path?invalid=data%u2021here
Java's URLDecoder handles this correctly with UTF-8:
URLDecoder.decode("http://server/path?invalid=data%u2021here", "UTF-8");
and throws the following:
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u2"
org.mortbay.util.UrlEncoded.decodeTo(...,"UTF-8") appears to catch the invalid escape, and throw a NumberFormatException at line 653.
however, decodeUtf8To() does not catch this error, and produces the following string:
'd', 'a', 't', 'a', '\u0002', '2', '0', '2', '1'
It appears that this can be corrected by one of:
- switching HttpURI.decodeQueryTo() to only use the safer version of the decoder
- for the internal UTF-8 decoder add checks for only being [a-zA-Z0-9]
- for TypeUtil.convertHexDigit() to throw an IllegalArgumentException (or similar) for invalid hex characters
For our use, we'll probably use the first and run a locally-built patched copy for now.
My further question is the existence of the internal UTF-8 decoder. Is this for performance reasons? legacy? to get around the UnsupportedCharsetException?
the reason we have our own URI and UTF-8 handling is part legacy, part performance and part to avoid past bugs in the library.
I've followed your suggestion #3.
thanks