History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: BOO-708
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Snaury
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Boo

Support for PEP-263 like encoding specifications

Created: 03/Apr/06 04:14 AM   Updated: 26/May/06 02:15 AM
Component/s: Parser, Compiler
Affects Version/s: 0.7.6
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments: 1. File df-encoding.diff (8 kb)
2. File encoding.diff (7 kb)
3. File encoding2.diff (8 kb)
4. File encoding3.diff (15 kb)
5. File encoding4.diff (11 kb)
6. Text File FileEncodingAutoDetection.cs (5 kb)
7. Text File FileEncodingAutoDetection2.cs (5 kb)

Environment: boo-0.7.6.2160


 Description  « Hide
Currently Boo always opens source files in utf-8 encoding, which is not always a good thing (at least for me). I'd like to propose a patch that allows a kind of PEP-263* style encoding specifications in source files. Unfortunately, I couldn't do it the way I wanted due to a strange bug in MS.NET, where StreamReader.CurrentEncoding is always equal to an encoding passed as parameter, even if there was BOM detected and it in fact uses UTF8 or Unicode encoding. The only drawback is that every file gets opened twice (sadly no way I know to rewind StreamReader, and I don't know whether loading a whole file into memory to stop this is a good idea). Also, probably it would be good (or maybe not) to throw an exception if encoding, specified in source file, could not be found. Also, I didn't want to break compatibility, so if encoding is not specified, it defaults to UTF-8 as it was before (however, just in case, the first open is done using ASCII encoding (if there's not BOM), don't know if that's good or bad).

 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Snaury - 03/Apr/06 05:49 AM
Well, after reading documentation some more, I found that there is a way to rewind StreamReader, as well as I found that StreamReader does not automatically close underlying stream, which lets me just create a new instance around BaseStream with new encoding without the need to Close the file. Also, I found that it wasn't bug of .NET, it's just that encoding is detected only after first read, which I solved with Peek(). Hope this patch is better.

Daniel Grunwald - 03/Apr/06 07:53 AM
See my comment to BOO-706 for auto-detecting the difference between UTF-8 without BOM and ISO-8859-1 (Latin 1)

Snaury - 03/Apr/06 08:42 AM
Hi Daniel,

I don't think such autodetection is a good thing. Imagine that sources in Windows ANSI encoding floating around the web without the encoding specified. For example, my PC has different default encoding than most European PCs have, so if I receive a source file with accented characters and compile it with autodetection as in BOO-706, I'll end up with my native characters instead of accented ones. Specifying encoding within the file is much better. Although I can't find it documented anywhere, StreamReader doesn't close underlying stream upon finalization, so my second patch works pretty well (if anyone is ever concerned about it, I have a small RestrictedStream wrapper around Stream, which prevents closing of underlying Stream if needed, but I don't think it's really necessary). Besides, reading a whole file (I doubt there would be sources more that 500KB ) byte by byte is not the best idea...

Look into my second patch, I think it would serve better.


Daniel Grunwald - 03/Apr/06 08:49 AM
There are lots of files in Latin 1 without any encoding specified (this is the default in most text editors), and there are also lots of UTF-8 files without BOM (default for .NET StreamWriter). The compiler should be able to detect the difference between those two.

Snaury - 03/Apr/06 10:41 AM
Hi Daniel,

I've uploaded a new patch (btw, it's against 0.7.6.2160, but it shouldn't matter much).

Now it does both, autodetection of utf-8 similar to BOO-706, and PEP-263 style encoding specifier. =_=


Snaury - 03/Apr/06 06:01 PM
I hope I didn't annoy anyone with so many and so frequent updates. Anyway, here's (I think) a more improved version of my patch (encoding4.diff), as well as the class which does actual job as a standalone file (FileEncodingAutoDetection.cs). Currently I'm just inlining it into needed files, but when (and if) it goes into trunk you'll probably want to make it a standalone file, somewhere like under Boo.Lang.Compiler/IO...

Snaury - 04/Apr/06 10:08 AM
class FileEncodingAutoDetection: cleanup, error checking.

Snaury - 26/May/06 02:15 AM
df-encoding.diff: The latest patch I'm using.