JRuby

UTF-8 regular expressions aren't working

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: JRuby 0.9.9, JRuby 1.0.0RC1, JRuby 1.0.0RC2
  • Fix Version/s: JRuby 1.0.0RC3
  • Component/s: Core Classes/Modules
  • Labels:
    None
  • Number of attachments :
    8

Description

Regular expressions in UTF-8 (created like /.../u) don't work properly.
A test rb file is attached because I'm not sure what this form would do to the text, which is in UTF-8.
MRI works as expected.

  1. jruby-828.patch
    28/May/07 6:35 PM
    25 kB
    Charles Oliver Nutter
  2. jruby-828.patch
    17/Apr/07 4:06 AM
    15 kB
    Charles Oliver Nutter
  3. jruby-828-2.patch
    29/May/07 7:17 PM
    25 kB
    Marcin Mielzynski
  4. jruby-828-2007-05-22-1.patch
    22/May/07 7:52 AM
    3 kB
    Changshin Lee
  5. jruby-828-2007-05-23-1.patch
    23/May/07 2:52 AM
    6 kB
    Changshin Lee
  6. testregexp.rb
    17/Apr/07 3:10 AM
    0.1 kB
    Wes Nakamura
  7. testUTF8Regex.rb
    30/May/07 8:21 PM
    2 kB
    Wes Nakamura
  8. testUTF8Regex.rb
    22/May/07 4:34 PM
    1 kB
    Wes Nakamura

Issue Links

Activity

Hide
Charles Oliver Nutter added a comment -

Here's a first attempt at a patch...unfortunately it exposes an interesting problem.

The first line prints out "2", which is actually correct...if you're working with characters instead of bytes. Unfortunately, we want the byte position here, not the character position. Obviously passing the raw bytes to jregex isn't an option...it doesn't accept bytes, and it doesn't appear to like us directly casting the bytes into individual characters.

The second line print out "1", but I don't have a good explanation for that one.

So this will require a bit of thought.

Show
Charles Oliver Nutter added a comment - Here's a first attempt at a patch...unfortunately it exposes an interesting problem. The first line prints out "2", which is actually correct...if you're working with characters instead of bytes. Unfortunately, we want the byte position here, not the character position. Obviously passing the raw bytes to jregex isn't an option...it doesn't accept bytes, and it doesn't appear to like us directly casting the bytes into individual characters. The second line print out "1", but I don't have a good explanation for that one. So this will require a bit of thought.
Hide
Charles Oliver Nutter added a comment -

No progress on this since my patch...hopefully still in 1.0.

Show
Charles Oliver Nutter added a comment - No progress on this since my patch...hopefully still in 1.0.
Hide
Charles Oliver Nutter added a comment -

Not going to be in RC2. Wes, if you're out there, this is in danger of falling off 1.0...so we could use some help.

Show
Charles Oliver Nutter added a comment - Not going to be in RC2. Wes, if you're out there, this is in danger of falling off 1.0...so we could use some help.
Hide
Wes Nakamura added a comment -

I was doing some work on chars, hoping that someone would fix this since chars will probably depend on it.

I'll take a look and see if I can figure out how to fix this.

Show
Wes Nakamura added a comment - I was doing some work on chars, hoping that someone would fix this since chars will probably depend on it. I'll take a look and see if I can figure out how to fix this.
Hide
Charles Oliver Nutter added a comment -

I think this issue and JRUBY-820 may be in the same area...basically there's problems with strings that have unicode characters passing into or out of regular expressions. The issue in 820 is likely because REXML uses /u regular expressions extensively to do all its parsing, and so any unicode content in the incoming strings gets mangled.

Show
Charles Oliver Nutter added a comment - I think this issue and JRUBY-820 may be in the same area...basically there's problems with strings that have unicode characters passing into or out of regular expressions. The issue in 820 is likely because REXML uses /u regular expressions extensively to do all its parsing, and so any unicode content in the incoming strings gets mangled.
Hide
Changshin Lee added a comment -

A quick patch to this issue (only).
With this patch, the attached test case works exactly the same as MRI.

Show
Changshin Lee added a comment - A quick patch to this issue (only). With this patch, the attached test case works exactly the same as MRI.
Hide
Changshin Lee added a comment -

One problem on my patch: how can "/u" suffix be recognized when Pattern is created? I assumed flag(int value) "2" indicated the case, which is actually wrong.

Show
Changshin Lee added a comment - One problem on my patch: how can "/u" suffix be recognized when Pattern is created? I assumed flag(int value) "2" indicated the case, which is actually wrong.
Hide
Charles Oliver Nutter added a comment -

The flag should still be present when a literal regexp is created by the interpreter. We can pass it through from there. I will have a look at your patch too.

Show
Charles Oliver Nutter added a comment - The flag should still be present when a literal regexp is created by the interpreter. We can pass it through from there. I will have a look at your patch too.
Hide
Changshin Lee added a comment -

Thanks for your response. FYI, '/u' suffix is not present in CompilerHelpers.regexpLiteral's ptr argument. If it's present or CompilerHelpers.regexpLiteral's options argument provides a hint for UTF 8 (currently ReOptions has no such a option), then we can create a proper Pattern.

Show
Changshin Lee added a comment - Thanks for your response. FYI, '/u' suffix is not present in CompilerHelpers.regexpLiteral's ptr argument. If it's present or CompilerHelpers.regexpLiteral's options argument provides a hint for UTF 8 (currently ReOptions has no such a option), then we can create a proper Pattern.
Hide
Wes Nakamura added a comment -

Sorry, I've fallen a bit behind. In the meantime I've created a QED test file which tests some simple uses of UTF8 regexps. Several other methods are failing.

This is with Changshin's patch applied. I'll continue investigation.

Show
Wes Nakamura added a comment - Sorry, I've fallen a bit behind. In the meantime I've created a QED test file which tests some simple uses of UTF8 regexps. Several other methods are failing. This is with Changshin's patch applied. I'll continue investigation.
Hide
Changshin Lee added a comment -

Here's a patch to testUTF8Regex.rb (only).
This patch passes all the tests in the file.
Note that Pattern is supposed to be UTF-8. In other words, we still need to figure out how to detect /u in JRuby.

Show
Changshin Lee added a comment - Here's a patch to testUTF8Regex.rb (only). This patch passes all the tests in the file. Note that Pattern is supposed to be UTF-8. In other words, we still need to figure out how to detect /u in JRuby.
Hide
Charles Oliver Nutter added a comment -

For those of you working on patches, do look at my original JRUBY-828 patch. I believe it does add the ability to pass /u all the way through from the parser to the construction of regexp. My missing pieces were what you all are working on, making sure that regexp works correctly and strings aren't destroyed. Is there more than what's in my original patch needed to get /u passed all the way through?

Show
Charles Oliver Nutter added a comment - For those of you working on patches, do look at my original JRUBY-828 patch. I believe it does add the ability to pass /u all the way through from the parser to the construction of regexp. My missing pieces were what you all are working on, making sure that regexp works correctly and strings aren't destroyed. Is there more than what's in my original patch needed to get /u passed all the way through?
Hide
Charles Oliver Nutter added a comment -

Here's an updated jruby-828.patch that combines my original patch with Changshin's modifications. This patch allows all of testUTF8Regex.rb to pass, but it seems that the modified logic (from Changshin's patch) for String#rindex is flawed. As a result test/rubicon/test_string.rb fails its rindex tests.

If we can fix that and get any remaining failures resolved, I think this patch goes a long way toward getting UTF8 regexp working correctly.

Show
Charles Oliver Nutter added a comment - Here's an updated jruby-828.patch that combines my original patch with Changshin's modifications. This patch allows all of testUTF8Regex.rb to pass, but it seems that the modified logic (from Changshin's patch) for String#rindex is flawed. As a result test/rubicon/test_string.rb fails its rindex tests. If we can fix that and get any remaining failures resolved, I think this patch goes a long way toward getting UTF8 regexp working correctly.
Hide
Marcin Mielzynski added a comment -

this patch makes String#rindex pass the tests, though, this is another String method that should join the list in JRUBY-688 for a complete rewrite (post 1.0 presumably).

Show
Marcin Mielzynski added a comment - this patch makes String#rindex pass the tests, though, this is another String method that should join the list in JRUBY-688 for a complete rewrite (post 1.0 presumably).
Hide
Charles Oliver Nutter added a comment -

I committed Marcin's revision to the combined patch, plus a few small modifications that allowed it to pass all tests. I'm very interested in bug reporters trying out trunk to see if it works well for them.

Show
Charles Oliver Nutter added a comment - I committed Marcin's revision to the combined patch, plus a few small modifications that allowed it to pass all tests. I'm very interested in bug reporters trying out trunk to see if it works well for them.
Hide
Charles Oliver Nutter added a comment -

All cases reported in this bug are now working and testUTF8Regex.rb is included in the build tests.

Show
Charles Oliver Nutter added a comment - All cases reported in this bug are now working and testUTF8Regex.rb is included in the build tests.
Hide
Wes Nakamura added a comment -

Sorry I wasn't able to be more involved in this bug. I've updated the test file so that it contains a few more complex tests involving strings that contain characters that are 1, 2, and 3 bytes long. I've also switched the test_equals arguments so that they're in the correct order.

Show
Wes Nakamura added a comment - Sorry I wasn't able to be more involved in this bug. I've updated the test file so that it contains a few more complex tests involving strings that contain characters that are 1, 2, and 3 bytes long. I've also switched the test_equals arguments so that they're in the correct order.
Hide
Charles Oliver Nutter added a comment -

Wes, just so you know, test 11 fails under MRI too, so I've commented it out. I'll use your updated test as our regression test. Thanks!

Show
Charles Oliver Nutter added a comment - Wes, just so you know, test 11 fails under MRI too, so I've commented it out. I'll use your updated test as our regression test. Thanks!
Hide
Wes Nakamura added a comment -

Is test 11 from the updated file? I.e.:

test_equal(6, "スパムハンク".index(/ム./u))

I don't get any failures, under both jruby (trunk) or MRI (1.8.2 it looks like I'm running on this machine...).

An interesting point I came across is that #rindex only specifies the last index to begin matching - if the match extends beyond but starts before the specified index, it will match.

Show
Wes Nakamura added a comment - Is test 11 from the updated file? I.e.: test_equal(6, "スパムハンク".index(/ム./u)) I don't get any failures, under both jruby (trunk) or MRI (1.8.2 it looks like I'm running on this machine...). An interesting point I came across is that #rindex only specifies the last index to begin matching - if the match extends beyond but starts before the specified index, it will match.

People

Vote (0)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: