GRECLIPSE
  1. GRECLIPSE
  2. GRECLIPSE-805

Unicode characters cause source locations to be off in Groovy editor

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2Release
    • Fix Version/s: 2.1.2Release
    • Component/s: Editor
    • Labels:
      None
    • Number of attachments :
      0

      Description

      Type this into a groovy editor:

      "\u00E9"
      jj
      

      Notice that the underlining is under '0E' instead of 'jj'. This is implying that the source locations are not correct because of the unicode character in the parens.

        Issue Links

          Activity

          Hide
          Andrew Eisenberg added a comment -

          The problem comes around from org.codehaus.groovy.antlr.UnicodeEscapingReader, which will automatically read extra characters if the appropriate escape sequence is used. So, multiple characters can be consumed by the reader even though the GroovyLexer receives a single character.

          Show
          Andrew Eisenberg added a comment - The problem comes around from org.codehaus.groovy.antlr.UnicodeEscapingReader, which will automatically read extra characters if the appropriate escape sequence is used. So, multiple characters can be consumed by the reader even though the GroovyLexer receives a single character.
          Hide
          Andrew Eisenberg added a comment -

          See also GRECLIPSE-824.

          Show
          Andrew Eisenberg added a comment - See also GRECLIPSE-824 .
          Hide
          Andrew Eisenberg added a comment -

          Not going to make 2.1.0.

          Show
          Andrew Eisenberg added a comment - Not going to make 2.1.0.
          Hide
          Andrew Eisenberg added a comment -

          Bumping this up to major since many users are hitting this.

          Show
          Andrew Eisenberg added a comment - Bumping this up to major since many users are hitting this.
          Hide
          Andrew Eisenberg added a comment -

          I have a solution locally that will work for all locations that are not on the same line as the unicode escape sequence. For example, the following will show an underline correctly:

          "\u00E9"
          jj
          

          But, the following would not. It would show the underline 5 characters earlier than it should:

          "\u00E9" + jj
          

          This will solve most of the cases that we are interested in, since a single escape sequence would not mess up locations for the entire file.

          I am trying to find a balance between minimal changes to the underlying Groovy parser and completeness of the solution. The solution I have requires a very small change. I have an idea on how to make a complete solution, but it would require too many changes to the parser to keep things simple.

          I will sketch out my solution and the complete solution below.

          Show
          Andrew Eisenberg added a comment - I have a solution locally that will work for all locations that are not on the same line as the unicode escape sequence. For example, the following will show an underline correctly: "\u00E9" jj But, the following would not. It would show the underline 5 characters earlier than it should: "\u00E9" + jj This will solve most of the cases that we are interested in, since a single escape sequence would not mess up locations for the entire file. I am trying to find a balance between minimal changes to the underlying Groovy parser and completeness of the solution. The solution I have requires a very small change. I have an idea on how to make a complete solution, but it would require too many changes to the parser to keep things simple. I will sketch out my solution and the complete solution below.
          Hide
          Andrew Eisenberg added a comment -

          Actually, I have reworked things a bit and I have a full solution in my workspace.

          Here is a sketch of the solution:

          1. Changes to UnicodeEscapingReader to keep track of the number of escape sequences found in the current line as well as the total amount found in the file (up to the current location).
          2. Changes to SourceBuffer so that when items are added to the lineEndings array, it queries the UnicodeEscapingReader instance to find the offset to add for the total number of escape sequences found in the file.
          3. This correctly computes the source locations for the lines after an escape sequence, but the escape sequence line itself is not computed correctly.
          4. So, in addition to that, create a UnicodeLexerSharedInputState subclass of LexerSharedInputState that queries the UnicodeEscapingReader for an extra offset when the getColumn() and getTokenStartColumn() methods are called.
          5. Changes to ErrorRecoveredCSTParserPlugin and AntlrParserPlugin so that UnicodeLexerSharedInputState is used, instead of LexerSharedInputState.

          There is negligible overhead to this solution. I will write some tests and then commit.

          Show
          Andrew Eisenberg added a comment - Actually, I have reworked things a bit and I have a full solution in my workspace. Here is a sketch of the solution: Changes to UnicodeEscapingReader to keep track of the number of escape sequences found in the current line as well as the total amount found in the file (up to the current location). Changes to SourceBuffer so that when items are added to the lineEndings array, it queries the UnicodeEscapingReader instance to find the offset to add for the total number of escape sequences found in the file. This correctly computes the source locations for the lines after an escape sequence, but the escape sequence line itself is not computed correctly. So, in addition to that, create a UnicodeLexerSharedInputState subclass of LexerSharedInputState that queries the UnicodeEscapingReader for an extra offset when the getColumn() and getTokenStartColumn() methods are called. Changes to ErrorRecoveredCSTParserPlugin and AntlrParserPlugin so that UnicodeLexerSharedInputState is used, instead of LexerSharedInputState . There is negligible overhead to this solution. I will write some tests and then commit.
          Hide
          Andrew Eisenberg added a comment -

          I committed the fix with regression tests.

          Show
          Andrew Eisenberg added a comment - I committed the fix with regression tests.

            People

            • Assignee:
              Andrew Eisenberg
              Reporter:
              Andrew Eisenberg
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: