Regexps part 4; finishing the tutorial.

Right, I hope you guys didn’t think I was only going to do a small tribute entry today now, did you ?

SO, it has been a while since I posted parts one, two and three, so you may want to check those out prior to reading this part.  As promised in the previous part I’ll explain some of the “special matching characters”, go a little deeper into using substitutions and we’ll go over the regexp example patches which can be found on the regexp reference page as well as the ones in the regexp “help patch” (alt-click on the regexp object to see this one).

Recap

Regexps are basically search templates which contain several characters which together can represent certain text. We divide these characters into “matching characters” and “control characters” (not official naming)” :

Matching characters I explained / mentioned so far:

  • The caret sign ‘^’ represents the start of a line of text.
  • The dollar sign ‘$’ represents the end of a line of text.
  • The dot ‘.’ matches any kind of single character.

Control characters I explained so far:

  • The asterisk ‘*’ indicates that the preceding characters might appear several (0 or more) times.
  • The question mark ‘?’ indicates that the preceding character is optional (appears 0 or 1 time).
  • A “pipe” character ‘|’ indicates a “double”. The regexp should apply to everything before it OR everything behind it.
  • The round brackets ‘()’ indicate that we’re grouping or nesting. We’re actually making a regexp inside the main regexp.
  • The square brackets ‘[]’ indicate that we’re using a collection of characters; either by defining a range or by using a group. All collection members are matched individually.
  • The broken brackets ‘{}’ specify “count rate”. The amount of which the previous character needs to occur.
  • The backslash ‘\’ makes the parser (“parse master” treat the following character literally, in other words the character ‘escapes’ parsing and as such is picked up ‘as is’.

Advanced matching

So far I’ve shown you guys how you can match text by setting up a regexp (‘template’) which either searches for specific matches or those defined by certain properties such as an x amount of characters at a certain location, or a word at a specific place which needs to have a certain letter or number in it.

But what if all we knew about a certain text snippet is that it will consist of 3 letters and 2 numbers, in that order, but we don’t know what it’ll be all about?  Yet we’d still want to get all occurrences which meet this description (so 3 letters followed by 2 numbers).  From what I told you so far this seems to be impossible, but fortunately for us its not.  I mentioned it before; regexps are very extensive and versatile, which is also the reason I use them very often.

Apart from all the “control characters” I explained so far there are a number of special occasions. So far a “control character” was basically an ordinary character like a question mark or asterisk or such. Things you might also find used in common sentences. Yet as is common with certain environments we also know a couple of ‘special additions’ to the set. Keep in mind though that although everything I explained so far can easily be used on environments other than Max or Max for Live, these special additions may very well be completely different (or even expanded) elsewhere.

Differentiate between numbers and letters

Example of using extended control characters.

  • \\d – Matches a decimal character.
  • \\D – Matches a non-decimal character.
  • \\w – Matches an alphanumeric character.
  • \\W – Matches a non-alphanumeric character.
  • \\s – Matches a white space.
  • \\S – Matches a non-white space.

And there you have it. Basically the lowercase variable matches a certain group whereas the UPPER case variable sort of negates that match.

So where \\d will match a decimal character, \\D will match everything but a decimal character. So basically it is a way to rule out the occurrence of any decimal characters.

The reason why these control characters are preceded with two backslashes should be obvious by now; Max will erase a single backslash by default. So you need two where the first basically ‘escapes’ the second backslash (as explained in part 3).

Beware of oddities!

When using regexps you should always be weary for unexpected behavior. For example; in the screenshot above you see 3 words which match the regexp yet only 2 are shown in the message box. What happened to the first one ?

I assure you that it was found. But due to the comma (“,”) in that sentence Max somehow cut the line in two parts and treated the comma as if it was the end of the first line. So now we end up with getting 2 lists out of the middle outlet. First one is “APC40” which is then followed by “APC20 APC60”. You can see as much when using the ‘print’ object to see all the generated output.

Advanced substitutions

Last time I showed you how to use the @substitute attribute to replace the word matched by the regexp for something else. So when looking at the example above; would I want to replace any occurrence of ‘APC60’ with APC40 I could easily use: “regexp APC60 @substitute APC40” which will then replace “APC60” with APC40.

But what if I only want to replace parts of a word instead of replacing the whole word ?

This is where grouping comes in handy. I showed you that by using brackets ‘()’ you could group or nest a regexp so that you could basically perform 2 matches within one regexp. When you’re using substitutions however you can also refer to these groups which allows you to perform very specific replacements or even text movements if you’d want to.

Example on using grouping with substitution.

Here you can see the grouping in action. My regexp consists of 2 groups; one matching Synth and one matching Fan. You can immediately see the difference in behavior when looking at outlets 2 (back references) and 3 (substrings). Outlet 2 got a list which consists of 2 words whereas outlet 3 found one single word.

As you can see we are looking for “SynthFan” but in 2 parts. When you define your substitution string you can actually use “%x” variables which will match the group number in the regexp. So as shown above I’m looking for “Synth” and “Fan” occurring right after each other. I then want to replace the found string by the first group, followed by a space which is then followed by the word “user”. The space is also the reason why I needed to use quotes in my substitution string.

So…  The regexp found “SynthFan” and due to the @substitute attribute it gets ready to replace the word. The first character I used is %1 which refers to the first group. This is the word “Synth”. As such the first part of the replacement is “Synth”. Then there’s a space and the word ‘user’. These get appended to the replacement string and thus we end up with replacing “SynthFan” with “Synth user”.

Oh dear..

I just thought of something. I have the sentence all wrong! This was supposed to go to Synthtopia to explain that my alias there (“SynthFan”) is actually ShelLuser which is the alias I use here and on the Ableton forums. But not the other way around since they don’t know ShelLuser.

Easily solved with a quick regexp:

Example on how to swap words using regexps.

Yes; the ‘a’ is a bit awkward in this sentence but I only noticed that after I setup, copied and pasted the above picture in this post. So please ignore the somewhat strange sentence and focus on the example of swapping 2 words in the same sentence.

What is happening here?  Well, I setup my regexp to match the literal word “ShelLuser”. Which is then optionally followed by an x amount of characters. After those characters I match the literal word “SynthFan”. As you can see the big issue here is that I’ve grouped all three different parts.

So now all I basically have to do is turn them around. Always keep in mind that you’re not really replacing parts of a match; you’re always replacing the entire match found by the regexp. But using group referrals allows you to “put back” certain matches into the substitution. And that leaves us with the result above. The sentence is complete turned around; first the 3rd group is placed which matches “SynthFan”. Then the second group is placed which matched everything between the words “ShelLuser” and “SynthFan”. So; matched the rest of the sentence. And finally “ShelLuser” is added, thus forming the sentence you can see coming out of the first outlet.

And there you have it!

This concludes my 4 part tutorial on regular expressions. I think I have covered everything which I mentioned but left untouched in the previous parts. All but one that is; I also  promised to go over the examples from the regexp refpage. If you do spot something I missed please let me know in the comment section.

Examples explained

First Max regexps example.

This one is actually quite simple, and I hope that by following my tutorial you too come to this same conclusion now. Oh; perhaps needless to say but I added the message object so that the output is visible.

First we have a collection which consists of the characters ‘f’ and ‘p’. Remember; all members of a collection are matched individually. So we’re looking for a word which begins with the letter f or p. Then we need to match the literal word “la”. Then another collection is defined which contains the special control character \\w which matches any alpha-numerical character. The * behind that specifies that the previous collection might occur 0 or several times. As someone mentioned in e-mail; the last collection wasn’t indeed necessary; they could also have used “[fp]la\\w*” instead. But, you have to admit that by using a new collection it makes reading this regexp a whole lot easier.

Number 2:

How to split a word.

This one should now be easily debugged as well. I think the unpack object might be more interesting than the regexp 😉  And yes, before I continue; this regexp could have been trimmed down as well. The use of collections in here wasn’t really necessary and quite frankly I think that in this case it makes the regexp harder to read.

But..  the regexp consists of 3 groups. The first matches any single alpha-numerical character and so does the second group. The third however matches the occurrence of 0 or more decimal characters. So basically the original string of “fb003” is split apart in ‘f’, ‘b’ and ‘003’, which can be seen in the message object I added. Because it’s a number the preceding zero’s were removed.

Number 3:

Max example on using substitutions

Another easy one…  Some people got confused because of the order which was given but there is no need for that. Nothing changes in the regexp so for the outcome it doesn’t really matter which message box you click first.

Ok, so we have a regexp which consists of 2 groups. The first matches either the word “paint” or “frame”. This word is then followed by a second group which matches either the word “rect” or “oval”. As such the regexp can match words like paintoval, paintrect, frameoval and framerect. Then a substitution string is defined which consists of the word “paint” followed by the second group.

Now although it appears as if nothing happened with the first message box that is not the case. The first part of the word (“paint”) matches the first group of the regexp. The second part (“oval”) matches the second group. So, what is happening here is that “paintoval” gets replaced by the word “paint” which is then followed by the second group of the regexp; which matches “oval”. SO we’re replacing a word with the exact same word. But don’t be fooled; a replacement has been made here.

The second message should be easy as well now. “frame” matches the first group, ‘”rect” matches the second group and so the whole word is replaced by the literal word “paint”which is then followed by group 2. In this case matching “rect”, so we get “paintrect”.

I think this about covers it. Apart from one last example which can be found in a sub-help patch. I’m including this one because several
people asked me about it by e-mail:

A weird use of regexp.

The question people asked me: “What is that i: doing there?”.

Answer: causing trouble.  This appears to be an undocumented feature and I suspect that its not merely i: but “?i:” which we need to consider. One effect is that it keeps the output out of the second outlet (‘back references’). The other that the regexp matches both lower and UPPER cases; see the match with “toto.WAV”. This match doesn’t occur once you remove the “?i:”.

And this effect can be reproduced:

Using an undocumented feature with regexp.

But quite frankly I’d recommend not using this particular setup due to its strange nature. That and because its an undocumented feature, which means there’s no telling if its still there when Max 6 comes out.

 

And there you have it. This concludes my tutorial on regular expressions in Max / Max for Live. I hope you had fun following it and found it useful.

If you have any more questions about this or previous parts please don’t hesitate to ask them using the comment sections.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

One comment

  1. Great post!

    I see a typo on advanced substing; you have 2 times apc60 while one should be 40.

    And thanks to your posts I can solve your problem with the a too! LOL

    regexp (ShelLuser)(.*)a\\s(SynthFan) @substitute %3%2%1

    What I do not understand is the dot at the end. That should not be there I think?

Comments are closed.