Details
-
Type:
Improvement
-
Status:
Open
-
Priority:
Minor
-
Resolution: Unresolved
-
Labels:None
-
Environment:all
-
Number of attachments :
Description
Hi all,
thanks for the very useful guessencoding tool. One idea for an improvement: There should be a method in CharsetToolkit that indicates that the test buffer contains only characters up to 0x7F.
For now you can't decide between these two cases when enforce8bit=true:
1. UTF-8 invalid bytes -> returns configured default charset
2. buffer contains no high bit, but after that there are possibly 8-bit or UTF-8 characters in File -> returns configured default charset, too
But in the first case I'd like to proceed with the default (8bit) charset, whereas in the second case I don't want to fall back to default, perhaps the buffer was just too short. Here I'd like to give UTF-8 a chance for the whole stream.
Switching to enforce8bit = false is no solution in case platform default charset is US-ASCII, because then
1. -> returns US-ASCII
2. -> returns US-ASCII too because enforce8bit=false
Proposed use case for a new method would be:
Charset used;
if(CharsetToolkit.seems7BitOnly(f, bufferLength))
else
{ used = CharsetToolkit.guessEncoding(f, bufferLength); }Downside is that we cannot return guessed Charset and boolean validUTF8 status in one move and have to read f twice.
Thank you
Dominik