JRuby (please use github issues at http://bugs.jruby.org)
  1. JRuby (please use github issues at http://bugs.jruby.org)
  2. JRUBY-6685

Encoding problem when using JRuby 1.7.0.preview1 + Nokogiri under Windows

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: JRuby 1.7.0.pre1
    • Fix Version/s: JRuby 1.7.0.pre2
    • Component/s: Encoding
    • Labels:
    • Environment:
      Windows XP Pro SP3 (running in a VMware 4 VM on OS X)
      Java(TM) SE Runtime Environment (build 1.6.0_30-b12)
      Java HotSpot(TM) Client VM (build 20.5-b03, mixed mode, sharing)
    • Number of attachments :
      0

      Description

      Considering the following test snippet:

      # coding: utf-8
      
      require 'nokogiri'
      
      xml = "<foo>bar</foo>"
      puts "XML string encoding: #{xml.encoding}"
      
      doc = Nokogiri::XML::Document.parse(xml)
      puts "Parsed root name encoding: #{doc.root.name.encoding}"
      

      Running this using JRuby 1.6.7.2 (in 1.9 mode) gives the expected:

      XML string encoding: UTF-8
      Parsed root name encoding: UTF-8
      

      Running this using JRuby 1.7.0.preview1 gives:

      XML string encoding: UTF-8
      Parsed root name encoding: Window-1252
      

      Someone mentioned in the IRC channel that it might be related to JRUBY-6679 but I'm not really sure, since it clearly works for me using JRuby 1.6.7.2.

        Activity

        Hide
        Thomas E Enebo added a comment -

        I think this is the same problem as JRUBY-6694. When doing a 'new String(...)' it uses file.coding (or locale charset). So it is taking UTF-8 String and trying to make a Windows-1252 string. Windows-1252 cannot represent all characters in UTF-8 so you see crud (at times) and you see the wrong encoding. Yoko, I believe this is something you need to fix in nokogiri itself. This is what I did in a stringFor() method:

            ByteList bytes = new ByteList(value.getBytes(RubyEncoding.UTF8), UTF8Encoding.INSTANCE);
            RubyString string = RubyString.newString(runtime, bytes);
        
        Show
        Thomas E Enebo added a comment - I think this is the same problem as JRUBY-6694 . When doing a 'new String(...)' it uses file.coding (or locale charset). So it is taking UTF-8 String and trying to make a Windows-1252 string. Windows-1252 cannot represent all characters in UTF-8 so you see crud (at times) and you see the wrong encoding. Yoko, I believe this is something you need to fix in nokogiri itself. This is what I did in a stringFor() method: ByteList bytes = new ByteList(value.getBytes(RubyEncoding.UTF8), UTF8Encoding.INSTANCE); RubyString string = RubyString.newString(runtime, bytes);
        Hide
        Yoko Harada added a comment -

        Ah, I see. 1.9 mode needs two steps to convert Java String to RubyString.

        Currently, Nokogiri does

        RubyString string = RubyString.newString(runtime, str);
        

        only. It seems to work only for 1.8 mode.

        Actually, above code fixed the problem. Thanks, Tom.

        I pushed the change in rev. c3953c2 on Nokogiri master.

        Luc, if you have a chance, try Nokogiri master.

        Show
        Yoko Harada added a comment - Ah, I see. 1.9 mode needs two steps to convert Java String to RubyString. Currently, Nokogiri does RubyString string = RubyString.newString(runtime, str); only. It seems to work only for 1.8 mode. Actually, above code fixed the problem. Thanks, Tom. I pushed the change in rev. c3953c2 on Nokogiri master. Luc, if you have a chance, try Nokogiri master.
        Hide
        Thomas E Enebo added a comment -

        Yoko, you can also try on any OS by doing:

        jruby -J-Dfile.coding=Windows-1252 my_file.rb
        

        I am going to resolve this since it is an issue with 1.9 string construction in Nokogiri and JRuby is closing out blockers (this will be fixed in next Nokogiri release.

        Show
        Thomas E Enebo added a comment - Yoko, you can also try on any OS by doing: jruby -J-Dfile.coding=Windows-1252 my_file.rb I am going to resolve this since it is an issue with 1.9 string construction in Nokogiri and JRuby is closing out blockers (this will be fixed in next Nokogiri release.
        Hide
        Yoko Harada added a comment -

        I'm pretty sure it should be "file.encoding"

        jruby -J-Dfile.encoding=Windows-1252 -Ilib ../Canna/src/snippet/JRUBY6685.rb 
        XML string encoding: UTF-8
        Parsed root name encoding: UTF-8
        

        and

        java -jar ~/Projects/nokogiri/tmp/jruby-complete.jar -J-Dfile.encoding=Windows-1252 -Ilib ../Canna/src/snippet/JRUBY6685.rb 
        warning: -J-Dfile.encoding=Windows-1252 argument ignored (launched in same VM?)
        XML string encoding: UTF-8
        Parsed root name encoding: UTF-8
        

        JRUBY6685.rb is a given example.
        I used below to run the example:

        jruby -v
        jruby 1.7.0.preview2.dev (ruby-1.9.3-p203) (2012-07-25 aaceaa2) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_33) [darwin-x86_64-java]
        
        java -version
        java version "1.6.0_33"
        Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720)
        Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)
        
        Show
        Yoko Harada added a comment - I'm pretty sure it should be "file.encoding" jruby -J-Dfile.encoding=Windows-1252 -Ilib ../Canna/src/snippet/JRUBY6685.rb XML string encoding: UTF-8 Parsed root name encoding: UTF-8 and java -jar ~/Projects/nokogiri/tmp/jruby-complete.jar -J-Dfile.encoding=Windows-1252 -Ilib ../Canna/src/snippet/JRUBY6685.rb warning: -J-Dfile.encoding=Windows-1252 argument ignored (launched in same VM?) XML string encoding: UTF-8 Parsed root name encoding: UTF-8 JRUBY6685.rb is a given example. I used below to run the example: jruby -v jruby 1.7.0.preview2.dev (ruby-1.9.3-p203) (2012-07-25 aaceaa2) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_33) [darwin-x86_64-java] java -version java version "1.6.0_33" Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720) Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)
        Hide
        Luc Heinrich added a comment -

        Hi, sorry for the late follow up, kids and vacations and stuff

        I tried Yoko's changes in Nokogiri master and yes, everything seems to be fine now:

        % jruby -v
        jruby 1.7.0.preview2 (1.9.3p203) 2012-08-07 4a6bb0a on Java HotSpot(TM) 64-Bit Server VM 1.6.0_33-b03-424-11M3720 [darwin-x86_64]
        
        % jruby test.rb
        jruby test.rb
        Nokogiri version: 1.5.6.rc1
        XML string encoding: UTF-8
        Parsed root name encoding: UTF-8
        

        I need to test on Windows too but so far so good. Thanks!

        Show
        Luc Heinrich added a comment - Hi, sorry for the late follow up, kids and vacations and stuff I tried Yoko's changes in Nokogiri master and yes, everything seems to be fine now: % jruby -v jruby 1.7.0.preview2 (1.9.3p203) 2012-08-07 4a6bb0a on Java HotSpot(TM) 64-Bit Server VM 1.6.0_33-b03-424-11M3720 [darwin-x86_64] % jruby test.rb jruby test.rb Nokogiri version: 1.5.6.rc1 XML string encoding: UTF-8 Parsed root name encoding: UTF-8 I need to test on Windows too but so far so good. Thanks!

          People

          • Assignee:
            Thomas E Enebo
            Reporter:
            Luc Heinrich
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: