|
Chris Williams made changes - 26/Mar/08 10:33 AM
Chris Williams made changes - 26/Mar/08 10:33 AM
We changed the lexer to be non-character aware because Ruby itself is not character aware (in 1.8 at least). We ended up doing this to support Rubys byte-bucket String behavior. I believe netbeans solves the problem by converting strings back to UTF-8 on the way out (back to IDE). I believe they can make that assumption because they know the encoding of the characters before it gets sent to be parsed. Is it possible for you to do the same thing? I'm confused about what Thomas meant by "on the way out" but if a conversion to UTF-8 is needed, would something like the following work in the RDT source? Given a char array called fContents... lexerSource = LexerSource.getSource("filename", new ByteArrayInputStream(String.valueOf(fContents).getBytes("UTF-8")), null, config); I'd like to see this resolved because it's complicating packaging on Gentoo Linux. Internally the lexer assumes all content is just a big stream of bytes. Ruby 1.8 strings require this. So if you send a UTF8 string into the lexer it will process that stream like it is raw (or ISO-8859-1). If you read anything back from the generated AST, you know it originally came from a UTF8 source. So you can just re-encode the strings you get out of the AST as UTF8 from ISO-8859-1. That is what I meant by on the way out. Hmm, it's been quite a while since I reported this, so I'm a little rusty as to the exact issue. But I think essentially it's not just a matter of the AST content being mangled as a result, but also the positions reported for the nodes being off by the extra byte lengths. A related issue that I brought up in IRC is from this original bug report on our system: http://support.aptana.com/asap/browse/ROR-833 The ruby source is module ApplicationHelper
def £
:foo
end
end
I tried a simple test to make sure none of the RDT code was interfering in mangling the source or it's encoding. But it still gets a SyntaxException complaining about the pound character. Here's the test code: String src = "$KCODE = \"utf8\"\nmodule ApplicationHelper\n def £\n :foo\n end\nend"; ParserConfiguration config = new ParserConfiguration(0, true, false); LexerSource source = LexerSource.getSource("filename", new ByteArrayInputStream(String.valueOf(src).getBytes("UTF-8")), null, config); RubyParserResult result = new RubyParser().getDefaultRubyParser().parse(config, source); assertNotNull(result); As for the last issue I commented on here, this is against JRuby 1.1.4. Ok, the last example is not valid Ruby so that example is not a good one for this bug (I am fairly certain we have multi-byte bugs in lexer). This also did not work in Ruby 1.9: % ruby -Ku ~/jruby/scripts/name.rb
/Users/enebo/jruby/scripts/name.rb:2: Invalid char `\243' in expression
/Users/enebo/jruby/scripts/name.rb:3: syntax error, unexpected tSYMBEG
:foo
^
/Users/enebo/jruby/scripts/name.rb:5: syntax error, unexpected kEND, expecting $end
Any other examples? Are you saying it's not valid Ruby despite the fact that it works (for me, at least) in 1.8.6? That's okay, I'm just checking. You are right...For some reason it got saved as iso8859_1 encoded file. When I make it utf-8 things work in both Ruby 1.8.6 and 1.9. The example was fine...My editing skills were not Fix for syntax error aspect of this in commit 8237. This should make strings and identifiers work. Please report back on this about position info...something tells me this is still broken since we return bytes from a stream and we increment accordingly. The multibyte parsing part of this was fixed in the past. The column-offset bug remains. A few months ago we forked all IDE parsing aspects from the main JRuby project into a seperate project JRubyParser: http://kenai.com/projects/jruby-parser I am closing this out since first half of this (part which affects runtime execution) has been fixed already.
Thomas E Enebo made changes - 17/Aug/09 05:06 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Patch for LexerSource to allow users to send in a Reader. New ReadrLexerSource class coming in another attachment.