Issue Details (XML | Word | Printable)

Key: JRUBY-820
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Unassigned
Reporter: Raphael Valyi
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
JRuby

Net::HTTP.get behaves differently form MRI, failing to get UTF8 properly

Created: 15/Apr/07 02:03 PM   Updated: 22/Dec/07 06:29 AM
Component/s: Core Classes/Modules
Affects Version/s: JRuby 0.9.8, JRuby 0.9.9, JRuby 1.0.0RC1, JRuby 1.0.0RC2
Fix Version/s: JRuby 1.0.0RC3

Time Tracking:
Not Specified

Environment: JRuby trunk between 13/04/07 and 15/04/07, java 1.6, Ubuntu 2.6.20-14-generic
Issue Links:
Related

Testcase included: yes


 Description  « Hide
Try the following both in irb and then in jirb:

require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) #some RSS feed with french accentuated characters

in MRI irb, you'll see accentuated characters like é, è, à...
in JRuby jirb, those are replaced by numbers.

also, adding
require 'jcode'
$KCODE = 'u'
doens't fix the issue. I should nontehless mention that using KCode in my Rails app makes Rails outputs properly accentuated caharcters as long as they are in the database or in the RHTML or Ruby code. But my Rss import fails because of that issue.
Any idea why net/http behaves differently? Any idea for a workarround?



 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Raphael Valyi added a comment - 15/Apr/07 02:13 PM
Woops, sorry,

to see the difference, the test should actually use KCode in both cases:

require 'jcode'
$KCODE = 'u'
require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) #some RSS feed with french accentuated characters

then acctentuated characters are OK in MRI, not in JRuby, sorry for the first test that wasn't correct.


Charles Oliver Nutter added a comment - 16/Apr/07 12:48 PM
Hmm, I tried just retrieving that rss you posted, with those two lines of code, and it appears to print out just fine.

My script:

require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) 
puts rss

And the following output resulted (among many rss entries): polytechnique de l'université

And this is without setting $KCODE at all. Can you produce a script that will actually output the wrong content? Perhaps you used "p" instead of "puts" and our inspect impl isn't doing what it ought?


Raphael Valyi added a comment - 16/Apr/07 03:34 PM
OK Charlie, sorry the reporting wasn't correct but there seems to be a bug. Also just back from work, sorry for the delay:
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) 

here what I get in my jirb console is without the accents. That's different from the irb console. But that's not important at all. May be only a console issue.
Then you were right puts rss just behave correctly.

The bug I had actually occurs later with 'rexml/document' when processing the rss:

require 'rexml/document'
xml = REXML::Document.new rss
puts xml

Here I'm very sure MRI, will show the french accents in xml while JRuby don't have them. Instead, I get "?" characters for every accentuated character like:
"Coupe de l'America report?e" with jruby instead of "Coupe de l'America reportée" with MRI

If you prefer a much smipler test to try against MRI and JRuby is:

require 'rexml/document'; REXML::Document.new "<node>héllo accentuated world!</node>"

Any idea? I don't know how would could reame the bug then.


Raphael Valyi added a comment - 16/Apr/07 03:36 PM
Sorry again, java sucked all my brain today, the one line test is actually:
require 'rexml/document'; puts REXML::Document.new "<node>héllo accentuated world!</node>"

Since puts seems to have its importance with JRuby. So any idea?


Charles Oliver Nutter added a comment - 16/Apr/07 03:47 PM
Thank you for the additional test. I can confirm it substitutes in the ? character. This would mean that somewhere the UTF8 handling in JRuby is not recognizing this as a valid character, and is substituting in the default "unknown" character instead.

We'll have a look at it...


Charles Oliver Nutter added a comment - 17/Apr/07 04:18 PM
REXML not parsing UTF8 properly

Charles Oliver Nutter added a comment - 16/May/07 04:48 PM
Not going to be fixed in RC2, but I bumped it up to "blocker" for 1.0. I really hope we can get it fixed before final.

Charles Oliver Nutter added a comment - 22/May/07 01:08 AM
this can actually be reduced to just the following for me using -e:
puts "héllo accentuated world!"

Ruby prints out what it should, JRuby does not.


Charles Oliver Nutter added a comment - 22/May/07 01:23 AM
If I put the code from my last comment in a file it works ok, and if I put Raphael's code in a file it does not...it seems like there's a problem in both cases.

Charles Oliver Nutter added a comment - 30/May/07 04:06 AM
I committed a fix for JRUBY-828 that may resolve this....can you check it again?

Thomas E Enebo added a comment - 30/May/07 12:00 PM
I just ran his test case and it is still \nnn all over the place in jirb.

Charles Oliver Nutter added a comment - 30/May/07 01:36 PM
The IRB and CLI issues may be a Java/terminal problem. Note the following:
class Test {
public static void main(String[] args) {
System.out.println(args[0]);
}
}

And run from the command line:

~/NetBeansProjects/jruby $ java -cp . Test 'h\303\251llo'
h?llo

I'll look into the issues with rexml to see if we're closer on those, but the UTF8-from-terminal problem seems like a larger Java issue.


Thomas E Enebo added a comment - 30/May/07 03:02 PM
Just for data point when we specify -Dfile.encoding=ISO88591 in the above java Test program it does display properly. This is probably a decent clue to the command-line parsing of unicode chars with -e

Charles Oliver Nutter added a comment - 30/May/07 04:07 PM
I committed changes to the Java String-based implementation of MatchData that makes the original rexml code all work now. Basically, the impl wasn't properly re-encoding strings as UTF8 on the way back out of MatchData, causing it to garble unicode characters. Raphael, can you give your code a try and see if it's working better now?

Tom: If you're going to do the encoding work here, wait for Raphael to confirm the fix before resolving this. Otherwise, open a new bug for the encoding stuff.


Raphael Valyi added a comment - 30/May/07 05:58 PM
Indeed, for me parsing rss with REXML now works properly with the accentuated RSS. Thanks a lot for this one!

Still, the console bug failing to read accentuead characters you type (unless specifying -Dfile.encoding=ISO88591 may be) still sucks a bit and is awkward.
A work arround to that last bug is to escape manually accentuated chars but this is only a workarround.

Still, may be we should close that REXML bug and open a new one only for the console UTF8 bug. I don't have admin right to do that, so it's up to you. Thanks for the constant progress!


Charles Oliver Nutter added a comment - 30/May/07 05:59 PM
Tom's currently looking into the console bugs, and he'll decide whether to open a new bug or do it in this one. But it's great to hear the rexml issues are resolved!

Thomas E Enebo added a comment - 30/May/07 06:01 PM
JRUBY-1007 is the console one I am working on. Follow that for console display issues using unicode.