|
Hmm, I tried just retrieving that rss you posted, with those two lines of code, and it appears to print out just fine.
My script: require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss'))
puts rss
And the following output resulted (among many rss entries): polytechnique de l'université And this is without setting $KCODE at all. Can you produce a script that will actually output the wrong content? Perhaps you used "p" instead of "puts" and our inspect impl isn't doing what it ought? OK Charlie, sorry the reporting wasn't correct but there seems to be a bug. Also just back from work, sorry for the delay:
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss'))
here what I get in my jirb console is without the accents. That's different from the irb console. But that's not important at all. May be only a console issue. The bug I had actually occurs later with 'rexml/document' when processing the rss: require 'rexml/document' xml = REXML::Document.new rss puts xml Here I'm very sure MRI, will show the french accents in xml while JRuby don't have them. Instead, I get "?" characters for every accentuated character like: If you prefer a much smipler test to try against MRI and JRuby is: require 'rexml/document'; REXML::Document.new "<node>héllo accentuated world!</node>" Any idea? I don't know how would could reame the bug then. Sorry again, java sucked all my brain today, the one line test is actually:
require 'rexml/document'; puts REXML::Document.new "<node>héllo accentuated world!</node>" Since puts seems to have its importance with JRuby. So any idea? Thank you for the additional test. I can confirm it substitutes in the ? character. This would mean that somewhere the UTF8 handling in JRuby is not recognizing this as a valid character, and is substituting in the default "unknown" character instead.
We'll have a look at it... REXML not parsing UTF8 properly
Not going to be fixed in RC2, but I bumped it up to "blocker" for 1.0. I really hope we can get it fixed before final.
this can actually be reduced to just the following for me using -e:
puts "héllo accentuated world!" Ruby prints out what it should, JRuby does not. If I put the code from my last comment in a file it works ok, and if I put Raphael's code in a file it does not...it seems like there's a problem in both cases.
I committed a fix for
I just ran his test case and it is still \nnn all over the place in jirb.
The IRB and CLI issues may be a Java/terminal problem. Note the following:
class Test {
public static void main(String[] args) {
System.out.println(args[0]);
}
}
And run from the command line: ~/NetBeansProjects/jruby $ java -cp . Test 'h\303\251llo' h?llo I'll look into the issues with rexml to see if we're closer on those, but the UTF8-from-terminal problem seems like a larger Java issue. Just for data point when we specify -Dfile.encoding=ISO88591 in the above java Test program it does display properly. This is probably a decent clue to the command-line parsing of unicode chars with -e
I committed changes to the Java String-based implementation of MatchData that makes the original rexml code all work now. Basically, the impl wasn't properly re-encoding strings as UTF8 on the way back out of MatchData, causing it to garble unicode characters. Raphael, can you give your code a try and see if it's working better now?
Tom: If you're going to do the encoding work here, wait for Raphael to confirm the fix before resolving this. Otherwise, open a new bug for the encoding stuff. Indeed, for me parsing rss with REXML now works properly with the accentuated RSS. Thanks a lot for this one!
Still, the console bug failing to read accentuead characters you type (unless specifying -Dfile.encoding=ISO88591 may be) still sucks a bit and is awkward. Still, may be we should close that REXML bug and open a new one only for the console UTF8 bug. I don't have admin right to do that, so it's up to you. Thanks for the constant progress! Tom's currently looking into the console bugs, and he'll decide whether to open a new bug or do it in this one. But it's great to hear the rexml issues are resolved!
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
to see the difference, the test should actually use KCode in both cases:
require 'jcode'
$KCODE = 'u'
require 'net/http'
rss = Net::HTTP.get(URI.parse('http://rss.wikio.fr/a_la_une.rss')) #some RSS feed with french accentuated characters
then acctentuated characters are OK in MRI, not in JRuby, sorry for the first test that wasn't correct.