JRuby

Net::HTTP.get behaves differently form MRI, failing to get UTF8 properly

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Blocker Blocker
  • Resolution: Fixed
  • Affects Version/s: JRuby 0.9.8, JRuby 0.9.9, JRuby 1.0.0RC1, JRuby 1.0.0RC2
  • Fix Version/s: JRuby 1.0.0RC3
  • Component/s: Core Classes/Modules
  • Labels:
    None
  • Environment:
    JRuby trunk between 13/04/07 and 15/04/07, java 1.6, Ubuntu 2.6.20-14-generic
  • Testcase included:
    yes
  • Number of attachments :
    0

Description

Try the following both in irb and then in jirb:

require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) #some RSS feed with french accentuated characters

in MRI irb, you'll see accentuated characters like é, è, à...
in JRuby jirb, those are replaced by numbers.

also, adding
require 'jcode'
$KCODE = 'u'
doens't fix the issue. I should nontehless mention that using KCode in my Rails app makes Rails outputs properly accentuated caharcters as long as they are in the database or in the RHTML or Ruby code. But my Rss import fails because of that issue.
Any idea why net/http behaves differently? Any idea for a workarround?

Issue Links

Activity

Hide
Raphael Valyi added a comment -

Woops, sorry,

to see the difference, the test should actually use KCode in both cases:

require 'jcode'
$KCODE = 'u'
require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) #some RSS feed with french accentuated characters

then acctentuated characters are OK in MRI, not in JRuby, sorry for the first test that wasn't correct.

Show
Raphael Valyi added a comment - Woops, sorry, to see the difference, the test should actually use KCode in both cases: require 'jcode' $KCODE = 'u' require 'net/http' rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) #some RSS feed with french accentuated characters then acctentuated characters are OK in MRI, not in JRuby, sorry for the first test that wasn't correct.
Hide
Charles Oliver Nutter added a comment -

Hmm, I tried just retrieving that rss you posted, with those two lines of code, and it appears to print out just fine.

My script:

require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) 
puts rss

And the following output resulted (among many rss entries): polytechnique de l'université

And this is without setting $KCODE at all. Can you produce a script that will actually output the wrong content? Perhaps you used "p" instead of "puts" and our inspect impl isn't doing what it ought?

Show
Charles Oliver Nutter added a comment - Hmm, I tried just retrieving that rss you posted, with those two lines of code, and it appears to print out just fine. My script:
require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) 
puts rss
And the following output resulted (among many rss entries): polytechnique de l'université And this is without setting $KCODE at all. Can you produce a script that will actually output the wrong content? Perhaps you used "p" instead of "puts" and our inspect impl isn't doing what it ought?
Hide
Raphael Valyi added a comment -

OK Charlie, sorry the reporting wasn't correct but there seems to be a bug. Also just back from work, sorry for the delay:

rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) 

here what I get in my jirb console is without the accents. That's different from the irb console. But that's not important at all. May be only a console issue.
Then you were right puts rss just behave correctly.

The bug I had actually occurs later with 'rexml/document' when processing the rss:

require 'rexml/document'
xml = REXML::Document.new rss
puts xml

Here I'm very sure MRI, will show the french accents in xml while JRuby don't have them. Instead, I get "?" characters for every accentuated character like:
"Coupe de l'America report?e" with jruby instead of "Coupe de l'America reportée" with MRI

If you prefer a much smipler test to try against MRI and JRuby is:

require 'rexml/document'; REXML::Document.new "<node>héllo accentuated world!</node>"

Any idea? I don't know how would could reame the bug then.

Show
Raphael Valyi added a comment - OK Charlie, sorry the reporting wasn't correct but there seems to be a bug. Also just back from work, sorry for the delay:
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) 
here what I get in my jirb console is without the accents. That's different from the irb console. But that's not important at all. May be only a console issue. Then you were right puts rss just behave correctly. The bug I had actually occurs later with 'rexml/document' when processing the rss:
require 'rexml/document'
xml = REXML::Document.new rss
puts xml
Here I'm very sure MRI, will show the french accents in xml while JRuby don't have them. Instead, I get "?" characters for every accentuated character like: "Coupe de l'America report?e" with jruby instead of "Coupe de l'America reportée" with MRI If you prefer a much smipler test to try against MRI and JRuby is:
require 'rexml/document'; REXML::Document.new "<node>héllo accentuated world!</node>"
Any idea? I don't know how would could reame the bug then.
Hide
Raphael Valyi added a comment -

Sorry again, java sucked all my brain today, the one line test is actually:

require 'rexml/document'; puts REXML::Document.new "<node>héllo accentuated world!</node>"

Since puts seems to have its importance with JRuby. So any idea?

Show
Raphael Valyi added a comment - Sorry again, java sucked all my brain today, the one line test is actually:
require 'rexml/document'; puts REXML::Document.new "<node>héllo accentuated world!</node>"
Since puts seems to have its importance with JRuby. So any idea?
Hide
Charles Oliver Nutter added a comment -

Thank you for the additional test. I can confirm it substitutes in the ? character. This would mean that somewhere the UTF8 handling in JRuby is not recognizing this as a valid character, and is substituting in the default "unknown" character instead.

We'll have a look at it...

Show
Charles Oliver Nutter added a comment - Thank you for the additional test. I can confirm it substitutes in the ? character. This would mean that somewhere the UTF8 handling in JRuby is not recognizing this as a valid character, and is substituting in the default "unknown" character instead. We'll have a look at it...
Hide
Charles Oliver Nutter added a comment -

REXML not parsing UTF8 properly

Show
Charles Oliver Nutter added a comment - REXML not parsing UTF8 properly
Hide
Charles Oliver Nutter added a comment -

Not going to be fixed in RC2, but I bumped it up to "blocker" for 1.0. I really hope we can get it fixed before final.

Show
Charles Oliver Nutter added a comment - Not going to be fixed in RC2, but I bumped it up to "blocker" for 1.0. I really hope we can get it fixed before final.
Hide
Charles Oliver Nutter added a comment -

this can actually be reduced to just the following for me using -e:

puts "héllo accentuated world!"

Ruby prints out what it should, JRuby does not.

Show
Charles Oliver Nutter added a comment - this can actually be reduced to just the following for me using -e:
puts "héllo accentuated world!"
Ruby prints out what it should, JRuby does not.
Hide
Charles Oliver Nutter added a comment -

If I put the code from my last comment in a file it works ok, and if I put Raphael's code in a file it does not...it seems like there's a problem in both cases.

Show
Charles Oliver Nutter added a comment - If I put the code from my last comment in a file it works ok, and if I put Raphael's code in a file it does not...it seems like there's a problem in both cases.
Hide
Charles Oliver Nutter added a comment -

I committed a fix for JRUBY-828 that may resolve this....can you check it again?

Show
Charles Oliver Nutter added a comment - I committed a fix for JRUBY-828 that may resolve this....can you check it again?
Hide
Thomas E Enebo added a comment -

I just ran his test case and it is still \nnn all over the place in jirb.

Show
Thomas E Enebo added a comment - I just ran his test case and it is still \nnn all over the place in jirb.
Hide
Charles Oliver Nutter added a comment -

The IRB and CLI issues may be a Java/terminal problem. Note the following:

class Test {
public static void main(String[] args) {
System.out.println(args[0]);
}
}

And run from the command line:

~/NetBeansProjects/jruby $ java -cp . Test 'h\303\251llo'
h?llo

I'll look into the issues with rexml to see if we're closer on those, but the UTF8-from-terminal problem seems like a larger Java issue.

Show
Charles Oliver Nutter added a comment - The IRB and CLI issues may be a Java/terminal problem. Note the following:
class Test {
public static void main(String[] args) {
System.out.println(args[0]);
}
}
And run from the command line:
~/NetBeansProjects/jruby $ java -cp . Test 'h\303\251llo'
h?llo
I'll look into the issues with rexml to see if we're closer on those, but the UTF8-from-terminal problem seems like a larger Java issue.
Hide
Thomas E Enebo added a comment -

Just for data point when we specify -Dfile.encoding=ISO88591 in the above java Test program it does display properly. This is probably a decent clue to the command-line parsing of unicode chars with -e

Show
Thomas E Enebo added a comment - Just for data point when we specify -Dfile.encoding=ISO88591 in the above java Test program it does display properly. This is probably a decent clue to the command-line parsing of unicode chars with -e
Hide
Charles Oliver Nutter added a comment -

I committed changes to the Java String-based implementation of MatchData that makes the original rexml code all work now. Basically, the impl wasn't properly re-encoding strings as UTF8 on the way back out of MatchData, causing it to garble unicode characters. Raphael, can you give your code a try and see if it's working better now?

Tom: If you're going to do the encoding work here, wait for Raphael to confirm the fix before resolving this. Otherwise, open a new bug for the encoding stuff.

Show
Charles Oliver Nutter added a comment - I committed changes to the Java String-based implementation of MatchData that makes the original rexml code all work now. Basically, the impl wasn't properly re-encoding strings as UTF8 on the way back out of MatchData, causing it to garble unicode characters. Raphael, can you give your code a try and see if it's working better now? Tom: If you're going to do the encoding work here, wait for Raphael to confirm the fix before resolving this. Otherwise, open a new bug for the encoding stuff.
Hide
Raphael Valyi added a comment -

Indeed, for me parsing rss with REXML now works properly with the accentuated RSS. Thanks a lot for this one!

Still, the console bug failing to read accentuead characters you type (unless specifying -Dfile.encoding=ISO88591 may be) still sucks a bit and is awkward.
A work arround to that last bug is to escape manually accentuated chars but this is only a workarround.

Still, may be we should close that REXML bug and open a new one only for the console UTF8 bug. I don't have admin right to do that, so it's up to you. Thanks for the constant progress!

Show
Raphael Valyi added a comment - Indeed, for me parsing rss with REXML now works properly with the accentuated RSS. Thanks a lot for this one! Still, the console bug failing to read accentuead characters you type (unless specifying -Dfile.encoding=ISO88591 may be) still sucks a bit and is awkward. A work arround to that last bug is to escape manually accentuated chars but this is only a workarround. Still, may be we should close that REXML bug and open a new one only for the console UTF8 bug. I don't have admin right to do that, so it's up to you. Thanks for the constant progress!
Hide
Charles Oliver Nutter added a comment -

Tom's currently looking into the console bugs, and he'll decide whether to open a new bug or do it in this one. But it's great to hear the rexml issues are resolved!

Show
Charles Oliver Nutter added a comment - Tom's currently looking into the console bugs, and he'll decide whether to open a new bug or do it in this one. But it's great to hear the rexml issues are resolved!
Hide
Thomas E Enebo added a comment -

JRUBY-1007 is the console one I am working on. Follow that for console display issues using unicode.

Show
Thomas E Enebo added a comment - JRUBY-1007 is the console one I am working on. Follow that for console display issues using unicode.

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved: