jira.codehaus.org

  • Log In Access more options
    • Online Help
    • Keyboard Shortcuts
    • About JIRA
    • JIRA Credits
    • What?s New
  • Dashboards Access more options (Alt+d)
  • Projects Access more options (Alt+p)
  • Issues Access more options (Alt+i)
  • GuessEncoding
  • GUESSENC-7

7bit vs. UTF-8 forbidden characters detection

  • Log In
  • Views
    • XML
    • Word
    • Printable

Details

  • Type: Improvement Improvement
  • Status: Open Open
  • Priority: Minor Minor
  • Resolution: Unresolved
  • Labels:
    None
  • Environment:
    all

Description

Hi all,

thanks for the very useful guessencoding tool. One idea for an improvement: There should be a method in CharsetToolkit that indicates that the test buffer contains only characters up to 0x7F.

For now you can't decide between these two cases when enforce8bit=true:

1. UTF-8 invalid bytes -> returns configured default charset
2. buffer contains no high bit, but after that there are possibly 8-bit or UTF-8 characters in File -> returns configured default charset, too

But in the first case I'd like to proceed with the default (8bit) charset, whereas in the second case I don't want to fall back to default, perhaps the buffer was just too short. Here I'd like to give UTF-8 a chance for the whole stream.

Switching to enforce8bit = false is no solution in case platform default charset is US-ASCII, because then
1. -> returns US-ASCII
2. -> returns US-ASCII too because enforce8bit=false

Proposed use case for a new method would be:

Charset used;
if(CharsetToolkit.seems7BitOnly(f, bufferLength)) { used = Charset.forName("UTF-8"); } else { used = CharsetToolkit.guessEncoding(f, bufferLength); }

Downside is that we cannot return guessed Charset and boolean validUTF8 status in one move and have to read f twice.

Thank you
Dominik

Activity

  • All
  • Comments
  • Work Log
  • History
  • Activity
There are no comments yet on this issue.

People

  • Assignee:
    Guillaume Laforge
    Reporter:
    Dominik Krupp
Vote (0)
Watch (0)

Dates

  • Created:
    04/Feb/11 6:01 AM
    Updated:
    04/Feb/11 6:01 AM
  • Atlassian JIRA (v5.0.4#731-sha1:3aa7374)
  • Report a problem
  • Powered by a free Atlassian JIRA open source license for Codehaus. Try JIRA - bug tracking software for your team.