GuessEncoding
  1. GuessEncoding
  2. GUESSENC-7

7bit vs. UTF-8 forbidden characters detection

    Details

    • Type: Improvement Improvement
    • Status: Open Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Labels:
      None
    • Environment:
      all
    • Number of attachments :
      0

      Description

      Hi all,

      thanks for the very useful guessencoding tool. One idea for an improvement: There should be a method in CharsetToolkit that indicates that the test buffer contains only characters up to 0x7F.

      For now you can't decide between these two cases when enforce8bit=true:

      1. UTF-8 invalid bytes -> returns configured default charset
      2. buffer contains no high bit, but after that there are possibly 8-bit or UTF-8 characters in File -> returns configured default charset, too

      But in the first case I'd like to proceed with the default (8bit) charset, whereas in the second case I don't want to fall back to default, perhaps the buffer was just too short. Here I'd like to give UTF-8 a chance for the whole stream.

      Switching to enforce8bit = false is no solution in case platform default charset is US-ASCII, because then
      1. -> returns US-ASCII
      2. -> returns US-ASCII too because enforce8bit=false

      Proposed use case for a new method would be:

      Charset used;
      if(CharsetToolkit.seems7BitOnly(f, bufferLength))

      { used = Charset.forName("UTF-8"); }

      else

      { used = CharsetToolkit.guessEncoding(f, bufferLength); }

      Downside is that we cannot return guessed Charset and boolean validUTF8 status in one move and have to read f twice.

      Thank you
      Dominik

        Activity

        There are no comments yet on this issue.

          People

          • Assignee:
            Guillaume Laforge
            Reporter:
            Dominik Krupp
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: