groovy
  1. groovy
  2. GROOVY-2749

eachLine and splitEachLine work without encoding

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.5
    • Fix Version/s: 1.5.6, 1.6-beta-1
    • Component/s: None
    • Labels:
      None
    • Number of attachments :
      0

      Description

      several methods where added named eachLine and splitEachLine. But on InputStream no encoding is given, thus a InputStreamReader created there uses the default encoding, which might be wrong in many case. I personally would add the methods with the encoding variant and remove the ones without encoding for InputStream and URL. not using an encoding makes sense for String, because the encoding is already done then, in fact using an ecoding there could be seen as wrong. But not using an encoding on InputStream is wrong too. It is not portable and will kick you later. This means:

      eachLine:

      InputStream.eachLine(Closure) --> should be removed
      InputStream.eachLine(String encoding, Closure) --> should be added
      URL.eachLine(Closure) --> should be removed
      URL.eachLine(String encoding, Closure) --> could be added, but I would prefer:
      URL.toInputStream() ---> I prefer that, because then eachLine from InputStream can be used

      splitEachLine:

      InputStream.splitEachLine(String sep, Closure closure) --> should be removed
      InputStream.splitEachLine(String encoding, String sep, Closure closure) --> should be added
      URL.splitEachLine(String sep, Closure closure) --> should be removed
      URL.splitEachLine(String encoding, String sep, Closure closure) --> could be added, again I prefer URL.toInputStream().. or URL.input

      then all these each method should not return void, instead they should return what the closure returns the last time it was called. This allows writing a collecting closure which has to be added.. something like this

      class Collector extends Closure {
        def closure
        def values = []
        def doCall(Object[] args) {
          if (closure!=null) closure(*args)
          values << args
          return values
        }
      }
      

        Activity

        Hide
        Paul King added a comment -

        Changes made as requested. It does beg the question though, we sometimes are going to return the last closure result. Other times we return the self object to allow chaining. It would be nice to have a consistent rule for this.

        Show
        Paul King added a comment - Changes made as requested. It does beg the question though, we sometimes are going to return the last closure result. Other times we return the self object to allow chaining. It would be nice to have a consistent rule for this.
        Hide
        Paul King added a comment -

        I'll add back in the non-encoding versions this evening. I won't default them to UTF-8, I'll just put them back the way they were.

        Show
        Paul King added a comment - I'll add back in the non-encoding versions this evening. I won't default them to UTF-8, I'll just put them back the way they were.
        Hide
        Russel Winder added a comment -

        This is/was a breaking change and should not have appeared in 1.5.5.

        Show
        Russel Winder added a comment - This is/was a breaking change and should not have appeared in 1.5.5.
        Hide
        Paul King added a comment -

        Yes, it was an oversight that they were removed. I made some additions which should have caused no problems then at the last minute, we refactored some of the additions and two existing methods were caught up in that by accident. Jochen probably didn't realise that the methods were existing when he asked me to remove/refactor them (I moved all of the eachLine methods together in the source file which no doubt made that harder to detect). I should have realised this too when I did the change. The last minute nature of this change didn't help.

        I don't see this as a major catastrophe. Just one of those things that happen sometimes. My take on this is that we really do need to have formal release candidates even on minor releases (though they could be over much shorter durations than we do for major releases).

        Show
        Paul King added a comment - Yes, it was an oversight that they were removed. I made some additions which should have caused no problems then at the last minute, we refactored some of the additions and two existing methods were caught up in that by accident. Jochen probably didn't realise that the methods were existing when he asked me to remove/refactor them (I moved all of the eachLine methods together in the source file which no doubt made that harder to detect). I should have realised this too when I did the change. The last minute nature of this change didn't help. I don't see this as a major catastrophe. Just one of those things that happen sometimes. My take on this is that we really do need to have formal release candidates even on minor releases (though they could be over much shorter durations than we do for major releases).
        Hide
        Paul King added a comment -

        The other lesson for me is that we need to up the coverage of our tests. If this change had caused a test to fail, it would have been spotted immediately and not have slipped through undetected.

        Show
        Paul King added a comment - The other lesson for me is that we need to up the coverage of our tests. If this change had caused a test to fail, it would have been spotted immediately and not have slipped through undetected.
        Hide
        Russel Winder added a comment -

        Agreed on all points

        So I guess the question is whether:

        def process = command.execute ( )
        process.in.eachLine ( doSomething )
        

        should work or not.

        Show
        Russel Winder added a comment - Agreed on all points So I guess the question is whether: def process = command.execute ( ) process.in.eachLine ( doSomething ) should work or not.
        Hide
        Guillaume Laforge added a comment -

        Despite the discussions on why System.in.readLine() was deprecated a while ago, perhaps we could also revisit this.
        By default, System.in.readLine() could be restored and use the default system encoding, and we could add variants which also explicitely take an encoding in parameter.
        System.in.readLine() is pretty handy.

        Show
        Guillaume Laforge added a comment - Despite the discussions on why System.in.readLine() was deprecated a while ago, perhaps we could also revisit this. By default, System.in.readLine() could be restored and use the default system encoding, and we could add variants which also explicitely take an encoding in parameter. System.in.readLine() is pretty handy.
        Hide
        Paul King added a comment -

        OK, added back in the two accidentally refactored-out-of-existence methods.

        For good measure, I also added InputStream.splitEachLine(String sep, Closure closure) back in as it was now a hole in the set of available methods. Caveat Emptor as for all the other methods without the charset.

        I also added in these as they can help deal with the issues around StreamDecoder usage with InputStreamReader:
        withReader(URL url, String charset, Closure closure)
        withReader(InputStream in, String charset, Closure closure)
        newReader(InputStream self, String charset)

        I am not sure how to handle System.in.readLine(). In hindsight, if I was starting from scratch with such methods in Groovy, I would probably get rid of all line processing methods on InputStreams and URLs and just have ways to get Readers from them and have all line processing in terms of readers.

        Show
        Paul King added a comment - OK, added back in the two accidentally refactored-out-of-existence methods. For good measure, I also added InputStream.splitEachLine(String sep, Closure closure) back in as it was now a hole in the set of available methods. Caveat Emptor as for all the other methods without the charset. I also added in these as they can help deal with the issues around StreamDecoder usage with InputStreamReader: withReader(URL url, String charset, Closure closure) withReader(InputStream in, String charset, Closure closure) newReader(InputStream self, String charset) I am not sure how to handle System.in.readLine() . In hindsight, if I was starting from scratch with such methods in Groovy, I would probably get rid of all line processing methods on InputStreams and URLs and just have ways to get Readers from them and have all line processing in terms of readers.
        Hide
        Alexander Veit added a comment -

        File and URL have an eachLine method. So I wonder why File and URL do not have an encoding property.

        This would be quite natural, in my opinion, since Groovy extends the java.io.File semantics to also cover the file contents.

        For File this could probably be done by implementing groovy.io.File extends java.io.File.

        Show
        Alexander Veit added a comment - File and URL have an eachLine method. So I wonder why File and URL do not have an encoding property. This would be quite natural, in my opinion, since Groovy extends the java.io.File semantics to also cover the file contents. For File this could probably be done by implementing groovy.io.File extends java.io.File.
        Hide
        Paul King added a comment -

        I guess it might be possible but eachLine etc. sets up a reader under the covers and it is that reader that has an encoding. You could just as easily do an eachByte on a File or URL in which case no encoding would be involved.

        Show
        Paul King added a comment - I guess it might be possible but eachLine etc. sets up a reader under the covers and it is that reader that has an encoding. You could just as easily do an eachByte on a File or URL in which case no encoding would be involved.
        Hide
        Alexander Veit added a comment -

        Yes, the reader is being constructed from the File via the CharsetToolkit.

        If I understand the code correctly, CharsetToolkit tries to determine the encoding from byte order marks or the file character data. This is a good approach for text/plain files even though it may fail for non-unicode charsets. If Groovy would provide File with a charset property (by deriving an own File class from java.io.File or by a generic property extension mechanism) this, if present, could override the charset choosen by CharsetToolkit.

        I do not know if the benefits of may proposal would outweigh the drawbacks of increased complexity in the Groovy runtime. So I think it needs some more discussion.

        Show
        Alexander Veit added a comment - Yes, the reader is being constructed from the File via the CharsetToolkit. If I understand the code correctly, CharsetToolkit tries to determine the encoding from byte order marks or the file character data. This is a good approach for text/plain files even though it may fail for non-unicode charsets. If Groovy would provide File with a charset property (by deriving an own File class from java.io.File or by a generic property extension mechanism) this, if present, could override the charset choosen by CharsetToolkit. I do not know if the benefits of may proposal would outweigh the drawbacks of increased complexity in the Groovy runtime. So I think it needs some more discussion.
        Hide
        blackdrag blackdrag added a comment -

        How would such a Groovy specific file class be better than the encoding property java uses to set the file default encoding?

        Show
        blackdrag blackdrag added a comment - How would such a Groovy specific file class be better than the encoding property java uses to set the file default encoding?
        Hide
        Alexander Veit added a comment -

        For the same reason as InputStreamReader has constructors with charsets.

        The default file encoding is a per-VM property whereas the files a VM processes may have many different encodings.

        Show
        Alexander Veit added a comment - For the same reason as InputStreamReader has constructors with charsets. The default file encoding is a per-VM property whereas the files a VM processes may have many different encodings.
        Hide
        blackdrag blackdrag added a comment -

        I guess I did not explain myself very good... there are different concepts of files and in Java a file is more like a reference to something that might exist on the file system. For the contents of the file you have streams and reader. Therefor the encoding is something the reader and/or stream have to handle, not the file itself. There is a property to support a default encoding, so he lazy programmer does not have to use the encoding all other the place. I must say, I personally prefer the explicit usage of the encoding, because I am often on systems using multiple encodings and relaying on the default encoding has often been proofed to be a bad idea. To say it more strict, I generally dislike the usage of a default encoding in any kind... but well, it is not really my job to educate other people here. Only, when adding a new concept to Groovy I would of course prefer one that has not this kind of thinking error built in. now your "files" differ from the Java concept. They are a concept on their own, and it seems you want to use them instead of normal files. But in this case it is nothing we have to add to java.io.File. No, not right... you want to use it in places where File is used. Right? But we can not change the bytecode for for example InputStreamReader, that if it uses a FileInputStream, that uses "our" files, that not the default encoding for Java is used, but the encoding you did set on the file. So this would only work well in the Groovy sphere.. and in that case.. why use java.io.File directly? It would be so easy to write a small File handling API for reading ext files on its own .

        So to answer your question why File and URL have no encoding property: Because they are references to things that may or may not exist, and that may or not be text data in a certain encoding. They could be binary data as well. And not to forget, that encoded text might be read as binary data too!

        From the technical side we have the problem, that we can not attach foreign per instance properties. We can also not change the bytecode of libs, just to support an idea of files that conflicts with the original idea of files in Java.

        Show
        blackdrag blackdrag added a comment - I guess I did not explain myself very good... there are different concepts of files and in Java a file is more like a reference to something that might exist on the file system. For the contents of the file you have streams and reader. Therefor the encoding is something the reader and/or stream have to handle, not the file itself. There is a property to support a default encoding, so he lazy programmer does not have to use the encoding all other the place. I must say, I personally prefer the explicit usage of the encoding, because I am often on systems using multiple encodings and relaying on the default encoding has often been proofed to be a bad idea. To say it more strict, I generally dislike the usage of a default encoding in any kind... but well, it is not really my job to educate other people here. Only, when adding a new concept to Groovy I would of course prefer one that has not this kind of thinking error built in. now your "files" differ from the Java concept. They are a concept on their own, and it seems you want to use them instead of normal files. But in this case it is nothing we have to add to java.io.File. No, not right... you want to use it in places where File is used. Right? But we can not change the bytecode for for example InputStreamReader, that if it uses a FileInputStream, that uses "our" files, that not the default encoding for Java is used, but the encoding you did set on the file. So this would only work well in the Groovy sphere.. and in that case.. why use java.io.File directly? It would be so easy to write a small File handling API for reading ext files on its own . So to answer your question why File and URL have no encoding property: Because they are references to things that may or may not exist, and that may or not be text data in a certain encoding. They could be binary data as well. And not to forget, that encoded text might be read as binary data too! From the technical side we have the problem, that we can not attach foreign per instance properties. We can also not change the bytecode of libs, just to support an idea of files that conflicts with the original idea of files in Java.
        Hide
        Alexander Veit added a comment -

        I fully agree with what you say about Java file objects as references (names) and the explicit usage of encodings with streams.

        As said above, methods like File.eachLine, File.append, and so on, seemingly extend the file semantics to also cover the contents of the file. This was the reason to think about file properties such as charset that might complete the notion of file = name + metadata + content.

        Technically a extension to java.io.File (and other classes) could be achieved e.g. by providing a class that extends java.io.File, or through a more general approach, by aggregating per instance properties to arbitrary objects. The latter could be implemented without any changes to bytecode, e.g. using weak hash maps with the object instances as keys. However, as already said, I'm not quite sure if one would like to have such constructs in the implementation of a programming language since they increase complexity and lead to additional synchronization overhead.

        Show
        Alexander Veit added a comment - I fully agree with what you say about Java file objects as references (names) and the explicit usage of encodings with streams. As said above, methods like File.eachLine, File.append, and so on, seemingly extend the file semantics to also cover the contents of the file. This was the reason to think about file properties such as charset that might complete the notion of file = name + metadata + content. Technically a extension to java.io.File (and other classes) could be achieved e.g. by providing a class that extends java.io.File, or through a more general approach, by aggregating per instance properties to arbitrary objects. The latter could be implemented without any changes to bytecode, e.g. using weak hash maps with the object instances as keys. However, as already said, I'm not quite sure if one would like to have such constructs in the implementation of a programming language since they increase complexity and lead to additional synchronization overhead.

          People

          • Assignee:
            Paul King
            Reporter:
            blackdrag blackdrag
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: