JRuby

UTF-8 chacters don't pass through hpricot gracefully since Jruby 1.1.6

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: JRuby 1.2
  • Fix Version/s: None
  • Component/s: HelpWanted, Parser
  • Description:
    Hide

    UTF-8 characters no longer pass gracefully through hpricot (after jruby 1.1.6)

    The following code sample, tested with UTF-8 encoding, has input string containing unicode mdash:

    Unable to find source-code formatter for language: ruby. Available languages are: javascript, sql, xhtml, actionscript, none, html, xml, java
    require 'rubygems'
    require 'hpricot'
    
    input = "<p>TUCSON, Ariz. — The driver</p>"
    puts input
    
    doc = Hpricot.parse( input )
    
    puts doc.inner_html
    {code:ruby}
    
    Here is comparative output:

    % ruby ./utf8_sample_2.rb
    <p>TUCSON, Ariz. — The driver</p>
    <p>TUCSON, Ariz. — The driver</p>
    david) /opt/dist/jruby-1.1.6/bin/jruby ./utf8_sample_2.rb
    <p>TUCSON, Ariz. — The driver</p>
    <p>TUCSON, Ariz. — The driver</p>
    % /opt/dist/jruby-1.2.0/bin/jruby ./utf8_sample_2.rb
    <p>TUCSON, Ariz. — The driver</p>
    <p>TUCSON, Ariz. — The driver</p>
    % /opt/dist/jruby-1.3.0/bin/jruby ./utf8_sample_2.rb
    <p>TUCSON, Ariz. — The driver</p>
    <p>TUCSON, Ariz. — The driver</p>

    
    

    Where jruby 1.2.0 and 1.3.0 show a mangled mdash (—).

    Show
    UTF-8 characters no longer pass gracefully through hpricot (after jruby 1.1.6) The following code sample, tested with UTF-8 encoding, has input string containing unicode mdash:
    Unable to find source-code formatter for language: ruby. Available languages are: javascript, sql, xhtml, actionscript, none, html, xml, java
    require 'rubygems'
    require 'hpricot'
    
    input = "<p>TUCSON, Ariz. — The driver</p>"
    puts input
    
    doc = Hpricot.parse( input )
    
    puts doc.inner_html
    {code:ruby}
    
    Here is comparative output:
    % ruby ./utf8_sample_2.rb <p>TUCSON, Ariz. — The driver</p> <p>TUCSON, Ariz. — The driver</p> david) /opt/dist/jruby-1.1.6/bin/jruby ./utf8_sample_2.rb <p>TUCSON, Ariz. — The driver</p> <p>TUCSON, Ariz. — The driver</p> % /opt/dist/jruby-1.2.0/bin/jruby ./utf8_sample_2.rb <p>TUCSON, Ariz. — The driver</p> <p>TUCSON, Ariz. — The driver</p> % /opt/dist/jruby-1.3.0/bin/jruby ./utf8_sample_2.rb <p>TUCSON, Ariz. — The driver</p> <p>TUCSON, Ariz. — The driver</p>
    
    
    Where jruby 1.2.0 and 1.3.0 show a mangled mdash (—).
  • Environment:
    Hide
    Linux x64
    jruby 1.3.0 (ruby 1.8.6p287) (2009-06-03 5dc2e22) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_02) [amd64-java]
    jruby 1.2.0 (ruby 1.8.6 patchlevel 287) (2009-03-16 rev 9419) [amd64-java]
    jruby 1.1.6 (ruby 1.8.6 patchlevel 114) (2008-12-17 rev 8388) [amd64-java]
    *** LOCAL GEMS ***
    hpricot (0.6.164)

    Show
    Linux x64 jruby 1.3.0 (ruby 1.8.6p287) (2009-06-03 5dc2e22) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_02) [amd64-java] jruby 1.2.0 (ruby 1.8.6 patchlevel 287) (2009-03-16 rev 9419) [amd64-java] jruby 1.1.6 (ruby 1.8.6 patchlevel 114) (2008-12-17 rev 8388) [amd64-java] *** LOCAL GEMS *** hpricot (0.6.164)

Activity

Hide
Thomas E Enebo added a comment - 15/Jun/09 3:48 PM

This could be a parser bug so I will start there and assign it to me...

Show
Thomas E Enebo added a comment - 15/Jun/09 3:48 PM This could be a parser bug so I will start there and assign it to me...
Hide
David Kellum added a comment - 16/Jun/09 2:17 PM

Thanks for picking this one up, and I hope it is but suspect its not the ruby parser, as the original system obtained the Unicode text from file system, not baked into Ruby as this test case demonstrates (but the behavior is the same.)

Show
David Kellum added a comment - 16/Jun/09 2:17 PM Thanks for picking this one up, and I hope it is but suspect its not the ruby parser, as the original system obtained the Unicode text from file system, not baked into Ruby as this test case demonstrates (but the behavior is the same.)
Hide
Ludwig Hähne added a comment - 04/Jul/09 11:38 AM

Hit the same problem. For some reason Hpricot tries to reencode already encoded UTF-8 characters:

JRuby 1.3.0:

irb(main):049:0> s = "ÖL"
=> "\303\226L"
irb(main):050:0> d = Hpricot(s)
=> #<Hpricot::Doc "\303\203\302\226L">

It looks as if the correct \303\226 UTF8 character representation is split into two new unicode characters with the Bytes interpreted as Unicode code points, e.g. \303 => U+00C3 => C3 83

Show
Ludwig Hähne added a comment - 04/Jul/09 11:38 AM Hit the same problem. For some reason Hpricot tries to reencode already encoded UTF-8 characters: JRuby 1.3.0: irb(main):049:0> s = "ÖL" => "\303\226L" irb(main):050:0> d = Hpricot(s) => #<Hpricot::Doc "\303\203\302\226L"> It looks as if the correct \303\226 UTF8 character representation is split into two new unicode characters with the Bytes interpreted as Unicode code points, e.g. \303 => U+00C3 => C3 83
Hide
Anthony Eden added a comment - 10/Jul/09 7:08 PM

Unfortunately I'm seeing this as well. Are there any known workarounds?

Show
Anthony Eden added a comment - 10/Jul/09 7:08 PM Unfortunately I'm seeing this as well. Are there any known workarounds?
Hide
Thomas Olausson added a comment - 13/Jul/09 4:06 PM

Seems to happen in Hpricot.scan

# jruby 1.3.1
jruby -rubygems  -e "require 'hpricot'; Hpricot.scan('spinäl tap') {|x| p x}"
[:text, "spin\303\203\302\244l tap", nil, "spin\303\203\302\244l tap"]

# jruby 1.1.6
jruby -rubygems  -e "require 'hpricot'; Hpricot.scan('spinäl tap') {|x| p x}"
[:text, "spin\212l tap", nil, "spin\212l tap"]
Show
Thomas Olausson added a comment - 13/Jul/09 4:06 PM Seems to happen in Hpricot.scan
# jruby 1.3.1
jruby -rubygems  -e "require 'hpricot'; Hpricot.scan('spinäl tap') {|x| p x}"
[:text, "spin\303\203\302\244l tap", nil, "spin\303\203\302\244l tap"]

# jruby 1.1.6
jruby -rubygems  -e "require 'hpricot'; Hpricot.scan('spinäl tap') {|x| p x}"
[:text, "spin\212l tap", nil, "spin\212l tap"]
Hide
Thomas Olausson added a comment - 16/Jul/09 3:32 PM
Show
Thomas Olausson added a comment - 16/Jul/09 3:32 PM Related to http://jira.codehaus.org/browse/JRUBY-3813
Hide
Ken Mayer added a comment - 16/Jul/09 4:36 PM

We used git bisect and localized the failure to one commit. The fix is just to roll back the changes, but probably reopens an older bug report.

Original bug report: JRUBY-2974
Our new bug report: JRUBY-3813 (has another test case which doesn't involve Hpricot)
Original commit: r9072 / 4f4595e09bcfd817a64c41b4badff5f8ebf3aa4f

Show
Ken Mayer added a comment - 16/Jul/09 4:36 PM We used git bisect and localized the failure to one commit. The fix is just to roll back the changes, but probably reopens an older bug report. Original bug report: JRUBY-2974 Our new bug report: JRUBY-3813 (has another test case which doesn't involve Hpricot) Original commit: r9072 / 4f4595e09bcfd817a64c41b4badff5f8ebf3aa4f
Hide
Charles Oliver Nutter added a comment - 08/Aug/09 10:30 AM

Ola Bini has just rewritten most of Hpricot, updating it for the latest version. His changes have been merged into _why's repository, but not yet released. Perhaps the updated version will fix this issue?

http://wiki.github.com/why/hpricot

Show
Charles Oliver Nutter added a comment - 08/Aug/09 10:30 AM Ola Bini has just rewritten most of Hpricot, updating it for the latest version. His changes have been merged into _why's repository, but not yet released. Perhaps the updated version will fix this issue? http://wiki.github.com/why/hpricot
Hide
Josh Matthews added a comment - 24/Aug/09 2:37 PM

My experience with Ola Bini's rewrite is that it does fix this issue. It's probably safe to close this and related bugs with a bit more testing.

Show
Josh Matthews added a comment - 24/Aug/09 2:37 PM My experience with Ola Bini's rewrite is that it does fix this issue. It's probably safe to close this and related bugs with a bit more testing.
Hide
Charles Oliver Nutter added a comment - 25/Aug/09 4:39 AM

I would be reluctant to close this until a new (fixed) Hpricot is released...

Show
Charles Oliver Nutter added a comment - 25/Aug/09 4:39 AM I would be reluctant to close this until a new (fixed) Hpricot is released...
Hide
Daniel Harrington added a comment - 25/Aug/09 4:47 AM

I'm currently testing the new version, but since it's not "officially" released yet, I'd also vote against closing the ticket now.

Show
Daniel Harrington added a comment - 25/Aug/09 4:47 AM I'm currently testing the new version, but since it's not "officially" released yet, I'd also vote against closing the ticket now.
Hide
Matthias Brandt added a comment - 26/Aug/09 3:55 AM

I go with Daniel...

Show
Matthias Brandt added a comment - 26/Aug/09 3:55 AM I go with Daniel...
Hide
Daniel Hahn added a comment - 28/Sep/09 7:51 AM

So will this be fixed in the jruby code or in hpricot, or both? Seeing that why disappeared and someone else took over, how will the new hpricot version be released?

Show
Daniel Hahn added a comment - 28/Sep/09 7:51 AM So will this be fixed in the jruby code or in hpricot, or both? Seeing that why disappeared and someone else took over, how will the new hpricot version be released?
Hide
Charles Oliver Nutter added a comment - 05/Oct/09 3:24 PM

This is fixed in Hpricot, but I'm not sure who is responsible for releasing it. We need help to track that person down and get an Hpricot release out with the updated code.

Show
Charles Oliver Nutter added a comment - 05/Oct/09 3:24 PM This is fixed in Hpricot, but I'm not sure who is responsible for releasing it. We need help to track that person down and get an Hpricot release out with the updated code.
Hide
Nick Sieger added a comment - 06/Nov/09 9:58 AM

Fixed with new Hpricot 0.8.2 release.

Show
Nick Sieger added a comment - 06/Nov/09 9:58 AM Fixed with new Hpricot 0.8.2 release.

People

Dates

  • Created:
    04/Jun/09 4:55 PM
    Updated:
    06/Nov/09 9:58 AM
    Resolved:
    06/Nov/09 9:58 AM