Investigating a performance issue with a regex

 
 
  • Gérald Barré

This post is part of the series 'Crash investigations and code reviews'. Be sure to check out the rest of the blog posts of the series!

Regexes are very useful for extracting information from a string. In the following example, the regex extracts a name and a version from a string such as WhatEverReference("abc", "1.0.0"). The string can contain multiple references anywhere in it, and we need to get all name-version pairs contained in the string.

C#
private static readonly Regex regex = new Regex(
    @".*Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)"".*",
    RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture,
    TimeSpan.FromSeconds(2));

static void Main(string[] args)
{
    var str = @"
This is a reference WhatEverReference(""abc"", ""1.0.0"")
This is another one Reference ( ""def"", ""2.0.0"" )";

    foreach (Match match in regex.Matches(str))
    {
        Console.WriteLine($"{match.Groups["NAME"].Value}@{match.Groups["VERSION"].Value}");
    }
}

The regex is valid and captures the expected values. The problem is that regex evaluation is slow, taking a few milliseconds on a 20 kB string. In our case, we may need to scan a few hundred files while the user is waiting for results in a GUI application, so we cannot afford to wait several seconds.

We only need the data from the named groups. The full captured string (i.e. Match.Value) is not useful. However, the pattern captures the whole line due to the leading and trailing .*, which is unnecessary. Here is the output to illustrate:

C#
var str = @"
This is a reference WhatEverReference(""abc"", ""1.0.0"")
This is another one Reference ( ""def"", ""2.0.0"" )";

foreach (Match match in regex.Matches(str))
{
    Console.WriteLine(match.Value);
}

// Output:
// This is a reference WhatEverReference("abc", "1.0.0")
// This is another one Reference ( "def", "2.0.0" )

The solution is to remove the leading and trailing .* from the regex, so the evaluator only captures what is needed:

C#
// Without the leading and trailing ".*"
new Regex(@".*Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)"".*", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture, TimeSpan.FromSeconds(2));
new Regex(@"Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)""", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture, TimeSpan.FromSeconds(2));

On a 20 kB file, the difference is dramatic. In a .NET Core 3.1 application, removing those 4 characters improves performance by 1000x. You can also see how the recent .NET 5 regex performance improvements helps to mitigate the impact of inefficient regex patterns.

In conclusion, make sure to capture only what you need in your regex!

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?