Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: JRuby 1.2
-
Fix Version/s: None
-
Component/s: HelpWanted, Parser
-
Labels:None
-
Environment:HideLinux x64
jruby 1.3.0 (ruby 1.8.6p287) (2009-06-03 5dc2e22) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_02) [amd64-java]
jruby 1.2.0 (ruby 1.8.6 patchlevel 287) (2009-03-16 rev 9419) [amd64-java]
jruby 1.1.6 (ruby 1.8.6 patchlevel 114) (2008-12-17 rev 8388) [amd64-java]
*** LOCAL GEMS ***
hpricot (0.6.164)
ShowLinux x64 jruby 1.3.0 (ruby 1.8.6p287) (2009-06-03 5dc2e22) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_02) [amd64-java] jruby 1.2.0 (ruby 1.8.6 patchlevel 287) (2009-03-16 rev 9419) [amd64-java] jruby 1.1.6 (ruby 1.8.6 patchlevel 114) (2008-12-17 rev 8388) [amd64-java] *** LOCAL GEMS *** hpricot (0.6.164)
-
Number of attachments :
Description
UTF-8 characters no longer pass gracefully through hpricot (after jruby 1.1.6)
The following code sample, tested with UTF-8 encoding, has input string containing unicode mdash:
Unable to find source-code formatter for language: ruby. Available languages are: actionscript, html, java, javascript, none, sql, xhtml, xml
require 'rubygems'
require 'hpricot'
input = "<p>TUCSON, Ariz. — The driver</p>"
puts input
doc = Hpricot.parse( input )
puts doc.inner_html
{code:ruby}
Here is comparative output:
% ruby ./utf8_sample_2.rb
<p>TUCSON, Ariz. — The driver</p>
<p>TUCSON, Ariz. — The driver</p>
david) /opt/dist/jruby-1.1.6/bin/jruby ./utf8_sample_2.rb
<p>TUCSON, Ariz. — The driver</p>
<p>TUCSON, Ariz. — The driver</p>
% /opt/dist/jruby-1.2.0/bin/jruby ./utf8_sample_2.rb
<p>TUCSON, Ariz. — The driver</p>
<p>TUCSON, Ariz. â The driver</p>
% /opt/dist/jruby-1.3.0/bin/jruby ./utf8_sample_2.rb
<p>TUCSON, Ariz. — The driver</p>
<p>TUCSON, Ariz. â The driver</p>
Where jruby 1.2.0 and 1.3.0 show a mangled mdash (â).
This could be a parser bug so I will start there and assign it to me...