groovy

String minus() fails in case of regex chars in operand [minor breaking change]

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 1.0
  • Fix Version/s: 1.5.2
  • Component/s: groovy-jdk
  • Labels:
    None
  • Environment:
    all
  • Number of attachments :
    0

Description

The current implementation of String.minus fails if the operand contains regular expression special characters. The current implementation looks like this:
/**

  • Remove a part of a String
    *
  • @param left a String
  • @param value a String part to remove
  • @return a String minus the part to be removed
    */
    public static String minus(String left, Object value) { String text = toString(value); return left.replaceFirst(text, ""); }

I would suggest to escape all regex special characters in the "text" String or to choose a different implementation.
Examples of wrong or strange behaviour:
groovy> println "asd" - "a
c"
groovy> go
Caught: java.util.regex.PatternSyntaxException: Illegal control escape sequence near index 2
a\c
^
at gjdk.java.lang.String_GroovyReflector.invoke(Unknown Source)
at CommandLine.run(CommandLine.groovy:1)

groovy> println "should_not_match_but_matches"-"should_...match"
groovy> go
but_matches

groovy> println "asd"-"a
ud"
groovy> go
Caught: java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
a\ud
^
at gjdk.java.lang.String_GroovyReflector.invoke(Unknown Source)
at CommandLine.run(CommandLine.groovy:1)

Activity

Hide
Andreas Sahlbach added a comment -

hmm...some of my examples above are messed up, because Jira replaces my double-backslashes with newlines. So here they come again:

groovy> println "asd" - "a\\c"
groovy> go
Caught: java.util.regex.PatternSyntaxException: Illegal control escape sequence near index 2
a\c
^
at gjdk.java.lang.String_GroovyReflector.invoke(Unknown Source)
at CommandLine.run(CommandLine.groovy:1)

groovy> println "should_not_match_but_matches"-"should_...match"
groovy> go
but_matches

groovy> println "asd"-"a\\ud"
groovy> go
Caught: java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
a\ud
^
at gjdk.java.lang.String_GroovyReflector.invoke(Unknown Source)
at CommandLine.run(CommandLine.groovy:1)
Show
Andreas Sahlbach added a comment - hmm...some of my examples above are messed up, because Jira replaces my double-backslashes with newlines. So here they come again:
groovy> println "asd" - "a\\c"
groovy> go
Caught: java.util.regex.PatternSyntaxException: Illegal control escape sequence near index 2
a\c
^
at gjdk.java.lang.String_GroovyReflector.invoke(Unknown Source)
at CommandLine.run(CommandLine.groovy:1)

groovy> println "should_not_match_but_matches"-"should_...match"
groovy> go
but_matches

groovy> println "asd"-"a\\ud"
groovy> go
Caught: java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
a\ud
^
at gjdk.java.lang.String_GroovyReflector.invoke(Unknown Source)
at CommandLine.run(CommandLine.groovy:1)
Hide
Paul King added a comment -

Now escapes regex meta-characters.

Show
Paul King added a comment - Now escapes regex meta-characters.
Hide
Jim White added a comment -

I just found this issue because of a report by a user discovering the unexpected regex behavior of minus(String, Object).

I object to making a breaking change to the behavior of this (or any) public API at a bug fix release level (like 1.5.2). At a minimum this type of change must wait until a minor release like 1.6.

But I'm not sure I'd agree with this change at all. I do agree that this behavior is at first a surprise, but I suspect the original implementor intended this and it is consistent with Groovy's support of regex. Certainly it has been this way since 1.0 and is documented (although the "see also" makes the assumption that the reader is a Java expert or will click the link).

I do believe the documentation should be updated to make the current behavior clearer. And if it were to change in the future, then the release notes are going to need to make a big notice about this breaking change.

And lastly, if you're gonna yank the regex behavior, don't use String.replaceFirst at all since it is the worst possible performance (quote a regex pattern, compile the pattern, then do regex replacement)! If you want to string literal replacement you use String.indexOf(String).

Show
Jim White added a comment - I just found this issue because of a report by a user discovering the unexpected regex behavior of minus(String, Object). I object to making a breaking change to the behavior of this (or any) public API at a bug fix release level (like 1.5.2). At a minimum this type of change must wait until a minor release like 1.6. But I'm not sure I'd agree with this change at all. I do agree that this behavior is at first a surprise, but I suspect the original implementor intended this and it is consistent with Groovy's support of regex. Certainly it has been this way since 1.0 and is documented (although the "see also" makes the assumption that the reader is a Java expert or will click the link). I do believe the documentation should be updated to make the current behavior clearer. And if it were to change in the future, then the release notes are going to need to make a big notice about this breaking change. And lastly, if you're gonna yank the regex behavior, don't use String.replaceFirst at all since it is the worst possible performance (quote a regex pattern, compile the pattern, then do regex replacement)! If you want to string literal replacement you use String.indexOf(String).
Hide
Paul King added a comment - - edited

In Groovy 1.0, the Javadoc made no mention of the regex-style handling under the covers for the DGM#minus(String, Object) method (nor did GinA as far as I can find). In 1.5.0 and 1.5.1, the Javadoc is not very clear but mentions this regex behavior. We should clarify this doco in 1.5.2 if we can. The proposed behavior for 1.6.0 is that if the remove target is a pattern, regex-like behavior will occur. If target is not a pattern, the text will simply be literally removed from the String. This results in the following behavior:

assert 'abcda.ce' - /a.c/ == 'abcde'
assert 'abcda.ce' - ~/a.c/ == 'da.ce'
Show
Paul King added a comment - - edited In Groovy 1.0, the Javadoc made no mention of the regex-style handling under the covers for the DGM#minus(String, Object) method (nor did GinA as far as I can find). In 1.5.0 and 1.5.1, the Javadoc is not very clear but mentions this regex behavior. We should clarify this doco in 1.5.2 if we can. The proposed behavior for 1.6.0 is that if the remove target is a pattern, regex-like behavior will occur. If target is not a pattern, the text will simply be literally removed from the String. This results in the following behavior:
assert 'abcda.ce' - /a.c/ == 'abcde'
assert 'abcda.ce' - ~/a.c/ == 'da.ce'
Hide
Paul King added a comment -

Reopened until we decide what to do for 1.5.2/1.6. For 1.5.2 we need at least some doco/clarification and not merge any 1.6 only behavior on to that branch.

Show
Paul King added a comment - Reopened until we decide what to do for 1.5.2/1.6. For 1.5.2 we need at least some doco/clarification and not merge any 1.6 only behavior on to that branch.
Hide
Paul King added a comment - - edited

Leaving this issue to cover just the 1.5.2 case, the 1.6 case is now covered in a subtask.
Just to summarize options at this stage:

  1. leave code on branch as is but remove misleading '@see String#replaceFirst(String,String)' from javadoc as this is unintentionally leaking implementation (update this issue as won't fix for 1.5.2 - leaves desired behavior ambiguous for 1.5.x)
  2. include r10049 and r10349 on branch (close this issue - minor breaking change will be noted in release notes - another minor breaking change will be noted in 1.6 release notes)
  3. include r10350 on branch (builds on previous two which will also be added) (close this issue - minor breaking change will be noted in release notes and 1.5.2 behavior will be aligned with 1.6+ behavior). This is my preferred solution.
Show
Paul King added a comment - - edited Leaving this issue to cover just the 1.5.2 case, the 1.6 case is now covered in a subtask. Just to summarize options at this stage:
  1. leave code on branch as is but remove misleading '@see String#replaceFirst(String,String)' from javadoc as this is unintentionally leaking implementation (update this issue as won't fix for 1.5.2 - leaves desired behavior ambiguous for 1.5.x)
  2. include r10049 and r10349 on branch (close this issue - minor breaking change will be noted in release notes - another minor breaking change will be noted in 1.6 release notes)
  3. include r10350 on branch (builds on previous two which will also be added) (close this issue - minor breaking change will be noted in release notes and 1.5.2 behavior will be aligned with 1.6+ behavior). This is my preferred solution.
Hide
Olivier Armand added a comment -

Browsing those patches, I found that RegexUtils.quote seems to have an incorrect behavior in some cases.

System.out.println(RegexUtils.quote("\\")); // ouputs \Q\\E
System.out.println(RegexUtils.quote("\\\\E")); // outputs \Q\\E\\E\Q\E
Show
Olivier Armand added a comment - Browsing those patches, I found that RegexUtils.quote seems to have an incorrect behavior in some cases.
System.out.println(RegexUtils.quote("\\")); // ouputs \Q\\E
System.out.println(RegexUtils.quote("\\\\E")); // outputs \Q\\E\\E\Q\E
Hide
Jim White added a comment -

Those are the same results as Pattern.quote:

monospaced
import java.util.regex.Pattern

assert Pattern.quote('
') == /\Q
E/
assert Pattern.quote('\\\\E') == /\Q\\E\\E\Q\E/

println Pattern.quote('
')
println Pattern.quote("\\\\E")
monospaced

And speaking of RegexUtils it also needs to implement quoteReplacement for Matcher.quoteReplacement. But I guess that's for another JIRA...

Show
Jim White added a comment - Those are the same results as Pattern.quote: monospaced import java.util.regex.Pattern assert Pattern.quote('
') == /\Q
E/ assert Pattern.quote('\\\\E') == /\Q\\E\\E\Q\E/ println Pattern.quote('
') println Pattern.quote("\\\\E") monospaced And speaking of RegexUtils it also needs to implement quoteReplacement for Matcher.quoteReplacement. But I guess that's for another JIRA...
Hide
Olivier Armand added a comment -

You're right, forget my comment.

Show
Olivier Armand added a comment - You're right, forget my comment.
Hide
Guillaume Laforge added a comment -

Should be fixed after the merge of trunk into the branch.

Show
Guillaume Laforge added a comment - Should be fixed after the merge of trunk into the branch.

People

Vote (0)
Watch (5)

Dates

  • Created:
    Updated:
    Resolved: