GeoServer
  1. GeoServer
  2. GEOS-254

WFS Dispatcher mangles character encoding

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.3
    • Fix Version/s: 1.2.4
    • Component/s: WFS
    • Labels:
      None
    • Number of attachments :
      0

      Description

      From Artie Konin, his bug fix should be rolled in, and we should look into his suggestions:

      I was testing GeoServer's 1.2.1 WFS functionality and ended up with
      the fact that `like` and `equals` filters stops working when I post
      any windows-1251 encoded characters in the pattern.
      No need to say I set the right encoding both in XML declaration and
      in request's `Content-Type` header.

      After speding some time trying to get GeoServer logging system
      to differentiate between individual loggers, I realized that this is too
      complicated matter for me and started using old good
      System.out.println()'s in any suspicious part of source code.
      Eventually I found the offending section in WfsDispatcher's
      doPost() method.
      It uses FileWriter and then FileReader classes to save the incoming
      XML into temporary file and then reading it back. Examining Java API
      specification I learned that the above two classes are suitable when
      "the default character encoding and the default byte-buffer size
      are acceptable". In my case that assumptions about default character
      encodings lead to a sad results It looks like my non-US ASCII
      XML file content was mangled during either file writing or reading,
      or maybe even both Strangely indeed that this behaved similar both
      at Linux and my Windows machine, though at later one the default encoding
      is "windows-1251".
      So I returned to Java documentation and was lucky enough to
      circumvent the problem. The changes I made in doPost()'s code are
      below:

      Was:

      BufferedReader tempReader = new BufferedReader(request.getReader());

      // REVISIT: Should do more than sequence here
      // (In case we are running two GeoServers at once)
      // - Could we use response.getHandle() in the filename?
      // - ProcessID is traditional, I don't know how to find that in Java
      sequence++;
      temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

      FileWriter out = new FileWriter(temp);
      int c;

      while ((c = tempReader.read()) != -1)

      { out.write(c); }

      tempReader.close();
      out.close();

      BufferedReader disReader = new BufferedReader(new FileReader(temp));
      BufferedReader requestReader = new BufferedReader(new FileReader(
      temp));

      Became:

      InputStream is = new BufferedInputStream(request.getInputStream());

      // REVISIT: Should do more than sequence here
      // (In case we are running two GeoServers at once)
      // - Could we use response.getHandle() in the filename?
      // - ProcessID is traditional, I don't know how to find that in Java
      sequence++;
      temp = File.createTempFile("wfsdispatch" + sequence, "tmp");

      BufferedOutputStream out = new BufferedOutputStream(
      new FileOutputStream(temp));

      int c;

      while (-1 != (c = is.read())) { out.write(c); }

      is.close();
      out.flush();
      out.close();

      String req_enc = guessRequestEncoding(request);

      BufferedReader disReader = new BufferedReader(
      new InputStreamReader(
      new FileInputStream(temp), req_enc));
      BufferedReader requestReader = new BufferedReader(
      new InputStreamReader(
      new FileInputStream(temp), req_enc));

      Where `guessRequestEncoding()` is a convenience method to resolve the
      character encoding of the XML markup contained within incoming
      request:

      protected String guessRequestEncoding(HttpServletRequest request) {

      String def_enc = "UTF-8";

      String enc = getXmlEncoding();

      if (null == enc) {

      enc = request.getHeader("Content-Type");

      if (null == enc)

      { enc = def_enc; }

      else {
      if (-1 == enc.indexOf("="))

      { enc = def_enc; }

      else

      { enc = enc.substring(enc.lastIndexOf("=") + 1).trim(); }

      }

      }

      return enc;
      }

      protected String getXmlEncoding() {

      try {

      StringWriter sw = new StringWriter(60);
      BufferedReader in = new BufferedReader(new FileReader(temp));

      int c;
      while ((-1 != (c = in.read())) && (0x3E != c))

      { sw.write(c); }

      in.close();

      Pattern p = Pattern.compile("encoding\\s*\\=\\s*\"([^\"]+)\"");
      Matcher m = p.matcher(sw.toString());
      if (m.find())

      { return m.toMatchResult().group(1); }

      else

      { return null; }

      } catch (IOException e)

      { return null; }

      }

      After above changes all is working perfectly both on Linux and
      Windows. I'm almost happy

      However I have notes considering that code:

      1. It looks like changing BufferedReader to BufferedInputStream at the
      beginning was not really necessary, as BufferedReader itself doen't
      perform any codepage conversions. But at the other hand streams are
      just more realiable from my point of view You get the data exactly
      as it comes.

      2. Both `guessRequestEncoding()` and `getXmlEncoding()` functions
      looks pretty ugly even to me But that is the best I can do with my
      current level of Java acquaintance :/

      3. Placing "Content-Type" header check before the reading the XML
      declaration should be faster, but I think that encoding specified at
      the above declaration is simply more adequate.

      4. Is there any less ugly way to extract encoding info from the
      incoming XML data? `getXmlEncoding()` looks like a pregnant mammoth

        Activity

        Hide
        Chris Holmes added a comment -

        Ok, I'm resolving this issue, but not closing it, because I can't seem to get a test to confirm it's working. I think I rolled in all the changes. I went ahead and changed BufferedReader to BufferedInputStream, I wrote the stuff awhile ago, and input stream is probably slightly better. I cleaned up guessRequestEncoding a bit, but then realized that my cleanup was really all minor style issues, the type that jalopy wont even pick up, so it was silly for me to edit it. The code may be a bit ugly, but I couldn't get it any prettier, at least not in 2 minutes, and spending more time than that is a bit silly, so it's fine. I very much agree with using the encoding in the xml header first. I did a slight change to have the Pattern as a static variable, so that it would only get compiled once, not each time it was called. But other than that everything looked great. And I found no better way to get the xml encoding for incoming xml. Seems like it should be there, somewhere in java, but I couldn't find it. Oh yeah, and there was one little thing that didn't compile right, matcher.toMatchResult() - it's a java 1.5 method, and GeoServer targets 1.4. I just did matcher.group(1), and that seemed to work, I got the encoding correctly. Let me know if there's a different way to do it.

        Show
        Chris Holmes added a comment - Ok, I'm resolving this issue, but not closing it, because I can't seem to get a test to confirm it's working. I think I rolled in all the changes. I went ahead and changed BufferedReader to BufferedInputStream, I wrote the stuff awhile ago, and input stream is probably slightly better. I cleaned up guessRequestEncoding a bit, but then realized that my cleanup was really all minor style issues, the type that jalopy wont even pick up, so it was silly for me to edit it. The code may be a bit ugly, but I couldn't get it any prettier, at least not in 2 minutes, and spending more time than that is a bit silly, so it's fine. I very much agree with using the encoding in the xml header first. I did a slight change to have the Pattern as a static variable, so that it would only get compiled once, not each time it was called. But other than that everything looked great. And I found no better way to get the xml encoding for incoming xml. Seems like it should be there, somewhere in java, but I couldn't find it. Oh yeah, and there was one little thing that didn't compile right, matcher.toMatchResult() - it's a java 1.5 method, and GeoServer targets 1.4. I just did matcher.group(1), and that seemed to work, I got the encoding correctly. Let me know if there's a different way to do it.
        Hide
        Artie Konin added a comment -

        Ok, Chris, I've tested your WfsDispatcher and it seems to be working just fine. So I guess this should be closed now.
        And sorry for this Tiger stuff in regexps, didn't noticed "since 1.5" remark Anyway my implementation of the encoding detection was more of an example

        Show
        Artie Konin added a comment - Ok, Chris, I've tested your WfsDispatcher and it seems to be working just fine. So I guess this should be closed now. And sorry for this Tiger stuff in regexps, didn't noticed "since 1.5" remark Anyway my implementation of the encoding detection was more of an example

          People

          • Assignee:
            Chris Holmes
            Reporter:
            Chris Holmes
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: