Optimizing regular expressions in Java

2015/08/27 13:20
阅读数 22

If you've struggled with regular expressions that took hours to match when you needed them to complete in seconds, this article is for you. Java developer Cristian Mocanu explains where and why the regex pattern-matching engine tends to stall, then shows you how to make the most of backtracking rather than getting lost in it, how to optimize greedy and reluctant quantifiers, and why possessive quantifiers, independent grouping, and lookarounds are your friends.

Writing a regular expression is more than a skill -- it's an art.

-- Jeffrey Friedl

In this article I introduce some of the common weaknesses in regular expressions using the default java.util.regex package. I explain why backtracking is both the foundation of pattern matching with regular expressions and a frequent bottleneck in application code, why you should exercise caution when using greedy and reluctant quantifiers, and why it is essential to benchmark your regex optimizations. I then introduce several techniques for optimizing regular expressions, and discuss what happens when I run my new expressions through the Java pattern-matching engine.

For the purpose of this article I assume that you already have some experience using regular expressions and are most interested in learning how to optimize them in Java code. Topics covered include simple and automated optimization techniques as well as how to optimize greedy and reluctant quantifiers using possessive quantifiers, independent grouping, and lookarounds. See the Resources section for anintroduction to regular expressions in Java.

I use double quotes ("") to delimit regular expressions and input strings, X, Y, Z to denote regular sub-expressions or a portion of a regular expression, and a, b, c, d (et-cetera) to denote single characters.

The Java pattern-matching engine and backtracking

The java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton, or NFA. It's called nondeterministic because while trying to match a regular expression on a given string, each character in the input string might be checked several times against different parts of the regular expression. This is a widely used type of engine also found in .NET, PHP, Perl, Python, and Ruby. It puts great power into the hands of the programmer, offering a wide range of quantifiers and other special constructs such as lookarounds, which I'll discuss later in the article.

At heart, the NFA uses backtracking. Usually there isn't only one way to apply a regular expression on a given string, so the pattern-matching engine will try to exhaust all possibilities until it declares failure. To better understand the NFA and backtracking, consider the following example:

The regular expression is " sc(ored|ared|oring)x" The input string is " scared"

First, the engine will look for "sc" and find it immediately as the first two characters in the input string. It will then try to match "ored" starting from the third character in the input string. That won't match, so it will go back to the third character and try "ared". This will match, so it will go forward and try to match "x". Finding no match there, it will go back again to the third character and search for "oring". This won't match either, and so it will go back to the second character in the input string and try to search for another "sc". Upon reaching the end of the input string it will declare failure.

Optimization tips for backtracking

With the above example you've seen how the NFA uses backtracking for pattern matching, and you've also discovered one of the problems with backtracking. Even in the simple example above the engine had to backtrack several times while trying to match the input string to the regular expression. It's easy to imagine what could happen to your application performance if backtracking got out of hand. An important part of optimizing a regular expression is minimizing the amount of backtracking that it does.

The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically. I will discuss some of them later in the article. Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

Because of backtracking, regular expressions encountered in real-world application scenarios can sometimes take hours to completely match. Worse, it takes much longer for the engine to declare that a regular expression did not match an input string than it does to find a successful match. This is an important fact to remember. Whenever you want to test the speed of a regular expression, test it mostly on strings that it does not match. Among those, especially use strings that almostmatch, because those take the longest to complete.

Now let's consider some of the ways you can optimize your regular expressions for backtracking.

Simple ways to optimize regular expressions

Later in the article I'll get into the more involved ways you can optimize regular expressions in Java. To start, though, here are a few simple optimizations that could save you time:

  • If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches(). Not compiling the regular expression can be costly if Pattern.matches() is used over and over again with the same expression, for example in a loop, because thematches() method will re-compile the expression every time it is used. Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
  • Beware of alternation. Regular expressions like "(X|Y|Z)" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)" use "ab(cd|ef)". The latter is faster because the NFA will try to match ab and won't try any of the alternatives if it doesn't find it. (In this case there are only two alternatives. If there were many alternatives the gains in speed would be more impressive.) Alternation really can slow down your programs. The expression ".*(abcd|efgh|ijkl).*" was three times slower in my test than using three calls to String.indexOf(), one for each alternative in the regular expression.
  • Capturing groups incur a small-time penalty each time you use them. If you don't really need to capture the text inside a group, always use non-capturing groups. For example, use "(?:X)" instead of "(X)".

Let the engine do the work for you

As I mentioned before, the java.util.regex engine can optimize a regular expression several ways when it is compiled. For example, if the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.

Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression "\d{100}" is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.

Using benchmarks

After you have identified a possible improvement of a regular expression, even if you are certain that it will improve the speed, make a benchmark and compare the results against the previous expression. If the engine was able to internally optimize the previous expression better than the new one, it could lead to unexpected performance penalties.

For instance, the Java regex engine was not able to optimize the expression ".*abc.*". I expected it would search for "abc" in the input string and report a failure very quickly, but it didn't. On the same input string, using "String.indexOf("abc")" was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as ".{100}abc.*" the engine will match it more than ten times faster. Why? Because now the mandatory string "abc" is at a known position inside the string (there should be exactly one hundred characters before it).

Whenever you write complex regular expressions, try to find a way to write them such that the regex engine will be able to recognize and optimize for these particular situations. For instance, don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match, as shown in the example above.

Optimizing greedy and reluctant quantifiers

You have some basic ideas of how to optimize your regular expressions, as well as some of the ways you can let the regex engine do the work for you. Now let's talk about optimizing greedy and reluctant quantifiers. A greedy quantifier such as "*" or "+" will first try to match as many characters as possible from an input string, even if this means that the input string will not have sufficient characters left in it to match the rest of the regular expression. If this happens, the greedy quantifier will backtrack, returning characters until an overall match is found or until there are no more characters. A reluctant (or lazy) quantifier, on the other hand, will first try to match as few characters in the input string as possible.

So for example, say you want to optimize a sub-expression like ".*a". If the charactera is located near the end of the input string it is better to use the greedy quantifier "*". If the character is located near the beginning of the input string it would be better to use the reluctant quantifier "*?" and change the sub-expression to ".*?a". Generally, I've noticed that the lazy quantifier is a little faster than its greedy counterpart.

Another tip is to be specific when writing a regular expression. Use general sub-constructs like ".*" sparingly because they can backtrack a lot, especially when the rest of the expression can't match the input string. For example, if you want to retrieve everything between two as in an input string, instead of using "a(.*)a", it's much better to use "a([^a]*)a".

Possessive quantifiers and independent grouping

Possessive quantifiers and independent grouping are the most useful operators for optimizing regular expressions. Use them whenever you can to dramatically improve the execution time of your expressions. Possessive quantifiers are denoted by the extra "+" sign, such as in the expression "X?+", "X*+", "X++". The notation for an independent grouping is "(?>X)".

I have successfully used both possessive quantifiers and independent grouping to reduce the execution time of regular expressions from a few minutes to a few seconds. Both operators are allowed to disable the backtracking behavior of the pattern-matching engine for the group to which they are applied. They will try to match their expression as any greedy quantifier would, but if they are able to match it, they will not give back what they have matched, even if this causes the overall regular expression to fail.

The difference between them is subtle. You can see it best by comparing the possessive quantifier "(X)*+" and the independent grouping "(?>X)*". In the former case, the possessive quantifier will disable backtracking for both the X sub-expression and the "*" quantifier. In the latter case, only backtracking for the X sub-expression will be disabled, while the "*" operator, being outside the group, is not affected by the independent grouping and is free to backtrack.

How would you optimize this regular expression?

Now let's consider an optimization example. Say you're trying to match the sub-expression "[^a]*a" on a long input string containing only the character b repeated many times. This expression will fail because the input string does not contain any instances of the character a. Because the pattern engine doesn't know this, it will try to match the expression "[^a]*". Because "*" is a greedy quantifier, it will grab all the characters until the end of the input string, and then it will backtrack, giving back one character at a time in the search for a match.

The expression will fail only when it can't backtrack anymore, which can take some time. Worse, because the "[^a]*" grabbed all characters that weren't a, even backtracking is useless.

The solution is to change the expression "[^a]*a" to "[^a]*+a" using the possessive quantifier "*+". This new expression fails faster because once it has tried to match all the characters that are not a it doesn't backtrack; instead it fails right there.

Lookaround constructs

If you want to write a regular expression that matches any character except some, you could easily write something like "[^abc]*" which means: Match any characters except a or b or c. But what if you wanted it to match strings like "cab" or "cba", but not "abc"?

For this you could use the lookaround constructs. The java.util.regex package has four of them:

  • Positive lookahead: "(?=X)"
  • Negative lookahead: "(?!X)"
  • Positive lookbehind: "(?<=X)"
  • Negative lookbehind: "(?<!X)"

The word positive in this case means that you want the expression to match, while the word negative means that you don't want the expression to match. Lookaheadmeans that you want to search to the right of your current position in the input string. Lookbehind means that you want to search to the left. Remember that the lookaround constructs only peek forward or backward; they don't actually change the current position in the input string. That said, you could use something like "((?!abc).)*" using the negative lookahead operator "?!" to match any sequence of characters but not "abc" in the given order.

Lookarounds in practice

Lookaround constructs help you to be more specific when writing regular expressions, which can have a big affect on matching performance. Listing 1 shows a very common example: using a regular expression to match HTML fields.

Listing 1. Matching HTML fields

Regular expression: "<img.*src=(\S*)/>"
Input string 1: "<img border=1 src=image.jpg />"
Input string 2: "<img src=src=src=src= .... many src= ... src=src="

With the regular expression in Listing 1, the goal is to match the contents of the "src" attribute from an HTML image tag. I especially simplified the expression, assuming that there will be no other attributes after "src", to be able to focus on its performance aspects.

Why not be lazy?

You might be thinking that I could have used the reluctant quantifier ".*?" to optimize the regular expression in Listing 1. In fact, "<img.*?src=(.*)/>" would easily match the first-encountered "src=". This solution works for cases where the regular expression matches. If it didn't match the input string, however, it would start to backtrack and would take just as long to match as the greedy quantifier. Remember to always test your regular expressions using non-matching strings first!

The expression is fast enough when matching the input "string 1", but it takes a very long time to declare failure in its attempt to match the input "string 2 (time growing exponentially with the length of the input string). It fails because there is no "/>" at the end of the input string. To optimize this expression, look at the first ".*" construct. It is supposed to match any attributes that come before "src" but is too generic and it matches too much. In fact, the construct should only match any attributes except "src".

The rewritten expression "<img((?!src=).)*src=(\S*)/>" will handle a large, non-matching string almost a hundred times faster then the previous one!

A note about the StackOverflowError

Sometimes the regex Pattern class will throw a StackOverflowError. This is a manifestation of the known bug #5050507, which has been in the java.util.regexpackage since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See thedescription of the bug for more details. It seems it's triggered mostly by the use of alternations.

If you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.

In conclusion

Regular expressions shouldn't take hours to match, especially for applications that only have seconds to spare. In this article I've introduced some of the weak points of the java.util.regex package and shown you how to work around them. Simple bottlenecks like backtracking just require a little finesse whereas culprits like greedy and reluctant quantifiers require more careful consideration. In some cases you can replace them completely, in others you simply have to "lookaround" them. Either way, you've learned some good tricks for coaxing speed out of your regular expressions.

Let me know what you think about the workarounds I've proposed, and be sure to share your optimizing tips with other JavaWorld readers in the <a href="http://www.javaworld.com/javaforums/newpost.php?Cat=0&Board=112069">discussion thread about optimizing regular expressions in Java</a>.

Cristian Mocanu is a Java team leader at 1&1 Internet AG, Romania. He is a Sun Certified Programmer, Business Component Developer, and Architect with more than five years experience working with enterprise Java.

Learn more about this topic

0 收藏
0 评论
0 收藏