Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: JRuby 1.6.6
    • Fix Version/s: JRuby 1.7.0.pre1
    • Component/s: None
    • Labels:
      None
    • Environment:
    • Testcase included:
      yes
    • Number of attachments :
      0

      Description

      While running the tests in the ruby library 'mustache' (link: https://github.com/defunkt/mustache), one test in particular is failing:

      https://github.com/defunkt/mustache/blob/master/test/mustache_test.rb#L510-522

      JRuby dies calling StringScanner#scan_until here:

      https://github.com/defunkt/mustache/blob/master/lib/mustache/parser.rb#L231

      You can reproduce the issue with the following:

      require 'strscan'
      regex = /(^[ \t]*)?\{\{/
      text = "<h1>&#20013;&#25991; {{test}}</h1>\n\n{{> utf8_partial}}\n"
      text.force_encoding 'BINARY'
      scanner = StringScanner.new(text)
      scanner.scan_until(regex) # Fans spin up, and this method never returns.
      

      This seems to happen regardless of whether or not JRuby is in 1.8 or 1.9 mode. I am running this test like so:

      JRUBY_OPTS=--1.9 ruby -I"lib:test" test/mustache_test.rb -n test_utf8 -v

      I've also run it with: JRUBY_OPTS="--1.9 LC_ALL=en_US.UTF-8"

      It appears that this affects UTF-8 characters. If I replace the chinese characters with "foo bar", then there is no problem.

        Activity

        Hide
        Scott Gonyea added a comment -

        The bug appears to live in JRuby's Joni library:

        https://github.com/jruby/joni/blob/master/src/org/joni/Matcher.java#L460-464

        Basically, the call `enc.length(bytes, s, end)` returns -1 and 1, with each loop iteration. As an example, I added some logging to the method:

        Config.log.println("entering loop...");
        do

        { Config.log.println("start: s='" + s + "', prev='" + prev + "'"); if (matchCheck(origRange, s, prev)) return match(s); prev = s; s += enc.length(bytes, s, end); Config.log.println("end: enc.length='" + enc.length(bytes, s, end) + "', s='" + s + "', prev='" + prev + "'"); }

        while (s < range);

        The output is basically:

        entering loop...
        start: s='0', prev='0'
        end: enc.length='1', s='1', prev='0'
        start: s='1', prev='0'
        end: enc.length='1', s='2', prev='1'
        start: s='2', prev='1'
        end: enc.length='1', s='3', prev='2'
        start: s='3', prev='2'
        end: enc.length='-1', s='4', prev='3'
        start: s='4', prev='3'
        end: enc.length='1', s='3', prev='4'
        start: s='3', prev='4'
        end: enc.length='-1', s='4', prev='3'
        start: s='4', prev='3'
        end: enc.length='1', s='3', prev='4'
        start: s='3', prev='4'
        end: enc.length='-1', s='4', prev='3'
        start: s='4', prev='3'
        end: enc.length='1', s='3', prev='4'
        start: s='3', prev='4'
        end: enc.length='-1', s='4', prev='3'

        Show
        Scott Gonyea added a comment - The bug appears to live in JRuby's Joni library: https://github.com/jruby/joni/blob/master/src/org/joni/Matcher.java#L460-464 Basically, the call `enc.length(bytes, s, end)` returns -1 and 1, with each loop iteration. As an example, I added some logging to the method: Config.log.println("entering loop..."); do { Config.log.println("start: s='" + s + "', prev='" + prev + "'"); if (matchCheck(origRange, s, prev)) return match(s); prev = s; s += enc.length(bytes, s, end); Config.log.println("end: enc.length='" + enc.length(bytes, s, end) + "', s='" + s + "', prev='" + prev + "'"); } while (s < range); The output is basically: entering loop... start: s='0', prev='0' end: enc.length='1', s='1', prev='0' start: s='1', prev='0' end: enc.length='1', s='2', prev='1' start: s='2', prev='1' end: enc.length='1', s='3', prev='2' start: s='3', prev='2' end: enc.length='-1', s='4', prev='3' start: s='4', prev='3' end: enc.length='1', s='3', prev='4' start: s='3', prev='4' end: enc.length='-1', s='4', prev='3' start: s='4', prev='3' end: enc.length='1', s='3', prev='4' start: s='3', prev='4' end: enc.length='-1', s='4', prev='3' start: s='4', prev='3' end: enc.length='1', s='3', prev='4' start: s='3', prev='4' end: enc.length='-1', s='4', prev='3'
        Hide
        Charles Oliver Nutter added a comment -

        Appears to be working on master. We'll have a 1.7 preview release within the next week.

        system ~/projects/jruby $ cat test.rb 
        require 'strscan'
        regex = /(^[ \t]*)?\{\{/
        text = "<h1>&#20013;&#25991; {{test}}</h1>\n\n{{> utf8_partial}}\n"
        text.force_encoding 'BINARY'
        scanner = StringScanner.new(text)
        scanner.scan_until(regex) # Fans spin up, and this method never returns.
        
        system ~/projects/jruby $ jruby test.rb 
        
        system ~/projects/jruby $ 
        
        Show
        Charles Oliver Nutter added a comment - Appears to be working on master. We'll have a 1.7 preview release within the next week. system ~/projects/jruby $ cat test.rb require 'strscan' regex = /(^[ \t]*)?\{\{/ text = "<h1>&#20013;&#25991; {{test}}</h1>\n\n{{> utf8_partial}}\n" text.force_encoding 'BINARY' scanner = StringScanner.new(text) scanner.scan_until(regex) # Fans spin up, and this method never returns. system ~/projects/jruby $ jruby test.rb system ~/projects/jruby $
        Hide
        Scott Gonyea added a comment - - edited

        That's really weird. The chinese characters that I pasted into JIRA are now gone, replaced with "中文". Let's see if they show up here: "中文".

        edit. Sigh.

        Testing: <pre>中文</pre>

        I wonder if the <pre> block turns them into the &#20013; thing.

        Show
        Scott Gonyea added a comment - - edited That's really weird. The chinese characters that I pasted into JIRA are now gone, replaced with "中文". Let's see if they show up here: "中文". edit. Sigh. Testing: <pre>中文</pre> I wonder if the <pre> block turns them into the &#20013; thing.
        Hide
        Scott Gonyea added a comment -

        For the record, I still get this issue on JRuby master. I'm inclined to move this issue to github, as they seem to be better at UTF-8 than JIRA.

        Show
        Scott Gonyea added a comment - For the record, I still get this issue on JRuby master. I'm inclined to move this issue to github, as they seem to be better at UTF-8 than JIRA.
        Hide
        Scott Gonyea added a comment -

        Issue is not fixed. The text being executed is simply being butchered by JIRA.

        Show
        Scott Gonyea added a comment - Issue is not fixed. The text being executed is simply being butchered by JIRA.
        Hide
        Charles Oliver Nutter added a comment -

        Ahh, yes. The noformat tags do monkey with unicode characters. Can you attach this as a script or confirm it's actually working on master?

        Show
        Charles Oliver Nutter added a comment - Ahh, yes. The noformat tags do monkey with unicode characters. Can you attach this as a script or confirm it's actually working on master?
        Hide
        Charles Oliver Nutter added a comment -

        Or put the script into a gist/pastie.

        Show
        Charles Oliver Nutter added a comment - Or put the script into a gist/pastie.
        Show
        Scott Gonyea added a comment - https://github.com/jruby/jruby/issues/174
        Hide
        Scott Gonyea added a comment -
        Show
        Scott Gonyea added a comment - Moved issue to Github ( https://github.com/jruby/jruby/issues/174 )
        Hide
        Scott Gonyea added a comment -

        God I suck at JIRA. Sorry for my spam.

        Show
        Scott Gonyea added a comment - God I suck at JIRA. Sorry for my spam.

          People

          • Assignee:
            Charles Oliver Nutter
            Reporter:
            Scott Gonyea
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: