I regularly use regular expressions as a declarative means for string validation or parsing.
However, regular expressions and matching can become unreadable quickly.
Luckily there is a simple way to improve readability of both the regex itself and the code handling its matches.
Let’s say we want to parse some function and parameters from a string according to a custom defined syntax.
The syntax says that the function for division is called Div
and that the first parameter is the dividend and the second parameter is the divisor.
To express a division of 4 by 2, the string would have look like this: Div(4, 2)
.
We would like to check if that string in fact represents a
function and then extract the two arguments from it.Div
We choose to implement this with a regular expression ^Div(([0-9]+?), ([0-9]+?))$
, match the string against it and receive a Match
object.
How do we get the 4
and 2
?
In this case match.Groups[1]
contains 4
and match.Groups[2]
contains 2
, whereas match.Groups[0]
contains the whole string.
The code looks something like this:
Regex divRegex = new(@"^Div\(([0-9]+?), ([0-9]+?)\)$"); if (divRegex.Match("Div(4, 2)") is { Success: true, Groups: var groups }) { string dividendStr = groups[1].Value; string divisorStr = groups[2].Value; } else { throw new ArgumentException("string was not a valid Div function syntax"); }
The solution so far features a Regex
where we have to squint to see the two parameters and still do not know which is which.
We have to consult the code that handles the Match
.
We can see from the variable names which is the dividend and which is the divisor, but still have to remember how Match
indexes work and then mentally subtract one from the groups index to get the parameter index in the Div
function.
That is a lot of work for the reader.
It is also not obvious whether the above code is correct or not.
Both can be improved with a small change: named capture groups.
From what I can see from the regexes on the internet, few people use this feature and the documentation lumps it together with everything else related to groups.
Instead of just writing (.*)
, you can write (?<MyGroup>.*)
to capture a named group and then have it available as match.Groups["MyGroup"]
; no futzing with indices required.
The code above can thus be transformed into this:
Regex divRegex = new(@"^Div\((?<Dividend>[0-9]+?), (?<Divisor>[0-9]+?)\)$"); if (divRegex.Match("Div(4, 2)") is { Success: true, Groups: var groups }) { string dividendStr = groups["Dividend"].Value; string divisorStr = groups["Divisor"].Value; } else { throw new ArgumentException("string was not a valid Div function syntax"); }
We can immediately see from the Regex
alone which parameter is which, without even looking at the matching code.
This is an important improvement, because the regex is now a self-contained specification of the function syntax (a single source of truth, if you will).
The reader only has to read the declarative regex, instead of having to resort to reading imperative code.
The matching code benefits, too.
It is now very obvious that every group was read into a corresponding variable and the code is thus correct.
In conclusion: type a bit more to name your capture groups when using Regex
. It will go a long way toward making your code more maintainable and your team more productive.