? Matched Expression Does Not Print in Perl

I decided to name this post to the name of the Stack Overflow question I posted, because it depicts how a simple question had guided me into the rabbit hole that is regular expressions. For those who are familiar with Perl and regexes, this might be trivial; for others who are not, this might be boring. I don't think there is any position in between. Anyway, let's get started.

I had this question when I was trying to use Perl to filter text, mainly in the form

printf "Some text\n" | perl -pe 's/text/fun/'

to get

Some fun .

So I want to preface the situation here that I was using Perl as a part of a larger Bash script, not Perl scripting on its own.

So, we have a string of text

This is a string.

However, sometimes we get

This is a test string.

We want to match

This string ,

but when the second string appears, we want to match the test word as well. So, the regular expression has to match

This (test) string

Sounds straightforward, right?

How do we set out to print an optionally matched substring?

In our case, the substring is test, and it is optionally matched. Prior to this problem, I was using this

s/.*(This).*(string).*/$1 $2/ ,

I "selected" the matched subtrings as groups, and then proceeded to only print said substrings, leaving the rest behind. It worked. To have a group that is matched zero or 1 times, we use the ? operator. This was exactly my case with the test substring, so naturally, it made sense to do this right?

s/.*(This).*(test)?.*(string).*/$1 $2 $3/

Except that it didn't work.

The above expression gave me

This  string .

A sad, double-spaced emptiness in between the first and second substring. Clearly something went wrong. I will skip you through and just let you know how we will end up with the final result.

One of the issues with my expression was the use of .* . In regular expressions, using it means it will match any character of any length with greedy matching. Regex engines will use a process known as backtracking to do the matching. It means they swallow up as much of the input feed as they can, and then spit it back bit by bit to match the next part. In my expression, the .* after (This) matches the rest of the line : there is nothing left in the input feed. Since the next part was optional, matching nothing was still considered acceptable and Perl went by its merry way, which was .*(string).* . Here, the matching was mandatory, so it backtracks from the rest of the line to the point where it matches. No issues here.

So, we use non-greedy matching then, right? We replace .* with .*? and everything will be fine. Turns out no. We still get the same string

This string .

This is because non-greedy matching matches the shortest string it can, which was nothing. In turn, the (test)? group does not match the next part of the input feed, but it is OK because it is optional, and then etc. In the end, this does not work out either. Turns out my expression required a rewrite, not a small fix. This is the solution I learnt from the amazing guys over at Stack Overflow.

s/.*?(This)(?:(?!test).)*(test)?.*?(string).*/$1 $2 $3/

I hope I did not bore you at this point, because there is a lot to decipher from solution. The first part is

?!(test).

?! means negative lookahead. It means the matches fail when it matches test in the input feed. The . behind the group is a single character wildcard. Combining with the external parentheses,

?:( ... )*

we get a very interesting interaction. ?: means non-capturing group. Any matches by this non-capturing group is dropped. While * means a match of any length. Combining all three of the previous interactions into

?:(?!(test).)* ,

we get the following : Match any character of any length that is not the pattern test , into a non-capturing group. When the pattern is found in the input feed, the matching stops the input feed at that point, leaving the pattern intact at the start of the input feed. In turn, the next part of the expression is our (test)? , where it will match our perfectly placed pattern from the input feed! This finally means our optional capturing group finally worked, and the rest is history. We finally get our highly anticipated

This test string

Thank you for sticking around until the end. I learnt a great deal about how regex engines behave, and many more functions of regular expressions from this small question. Was diving into the rabbit hole worth it? Hell yes. Thanks again for reading.

P.S. I also learnt a lot about HTML for this post. All hail divs.


"How lucky am I to have something that makes saying goodbye so hard."
- Winnie the Pooh