Regexps part 3; processing text in Max/M4L

Welcome to part 3 of my regexp tutorial.

I know its WAY overdue, sorry for the major delay. That’s one the caveats of being self employed. One moment you have some time to spare; the other you’re working your butt off without much free time left.

In this part I’ll be finishing up on some lose ends from the previous parts, we’ll be diving straight into Max/M4L to take a closer look at the inner working of the regexp object as well as checking up on some of the regexp extensions which are specific for Max and M4L.

It won’t be the last part.. on popular demand I’ll go over some of the examples in the regexp help patch and reference page and explain their result in my next post. Some of you asked me to do this because even though they had a good hunch they still would like to know for sure why certain patches react in the way they do.

Mail

I got an e-mail from someone who told me that some of the things I mentioned and explained so far didn’t match with the regexp reference page. For example; he wondered about my explanation of the asterisk ‘*’ since the reference page says that the asterisk means it matches 0 times. I noticed that too, and I recognize the statement from the Perl regexp examples somewhat (the ‘manual page’). However, I don’t know why they put it in because its simply not entirely correct, as can be seen here:

Showing an issue with the asterisk.

As you can see here I’m looking for “shel*”. According to the official example the asterisk should mean that it can match 0 times. In the sentence I’ve used here the word “shel” is present (so the ‘l’ matches 1 time) yet the word “she” is also present (so the ‘l’ matches 0 times); and as you can see both match. In other words; the asterisk stands for matching 0 or more times, not merely 0 times. I do stand corrected myself here since I was under the false impression that it was 1 or more times.

Thanks for the comments, I appreciate the feedback!

Recap

First an overview of the aspects I’ve addressed so far in parts one and two with the correct ‘*’ meaning:

Regexps are basically search templates which contain several characters which together can represent certain text. We divide these characters into “matching characters” and “control characters” (not official naming)” :

Matching characters I explained / mentioned so far:

  • The caret sign ‘^’ represents the start of a line of text.
  • The dollar sign ‘$’ represents the end of a line of text.
  • The dot ‘.’ matches any kind of single character.

Control characters I explained so far:

  • The asterisk ‘*’ indicates that the preceding characters might appear several (0 or more) times.
  • The question mark ‘?’ indicates that the preceding character is optional (appears 0 or 1 time).
  • A “pipe” character ‘|’ indicates a “double”. The regexp should apply to everything before it OR everything behind it.
  • The round brackets ‘()’ indicate that we’re grouping or nesting. We’re actually making a regexp inside the main regexp.
  • The square brackets ‘[]’ indicate that we’re using a collection
    of characters; either by defining a range or by using a group. All
    collection members are matched individually.
  • The broken brackets ‘{}’ specify “count rate”. The amount of which the previous character needs to occur.
  • The backslash ‘\’ makes the parser (“parse master”) treat the
    following character literally, in other words the character ‘escapes’ parsing and as such is picked up literally.

Advanced regexps

There are 2 matching characters which I have only mentioned but haven’t addressed yet and I know there are a few of you who seem rather impatient to see these mentioned. I’m talking about the caret (‘^’) and dollar (‘$’) characters.

Now, I left these out of my tutorial so far because there are many people out there who have a complete wrong (better put: incomplete) idea about them. Some believe that the caret ‘^’ sign basically always marks the beginning of a regexp. And as such the ‘$’ would indicate the end. However, this isn’t always the case.

And its the reason why I haven’t explained these two before; hoping that if you’d be trying some of the things I previously mentioned for yourself you’d notice that regexps would easily work without using either of these two characters.

So what do these characters stand for ?

As previously mentioned; the caret sign ‘^’ marks the beginning of a line of text whereas the dollar sign ‘$’ marks the end. Another way to think of the dollar sign, although a little bit more technical, is “CR/LF” which stands for “carriage return / linefeed”; their hex values being 0D/0A. I’m not going into detail here; but its what you usually would come across at the end of a line of text.

Up until now we’ve been looking at creating a specific match for a single word. For example my previously mentioned search for my alternative nick: “synth(fan|user)?”. If you were to feed this regexp with the sentence “My nick on synthtopia is synthfan” it will end up as a match:

Example on using carrot and dollar signs.So why would we ever want to use indicators to mark the beginning or end of a line of text ?

Simple… Because of the repetition effect. Basically a regexp object won’t stop looking once it found a match but will continue finding matches until it has run out of data to check. You can see one example of this behavior at the screenshot in the beginning where I addressed the ‘*’ oddity.

Another example of it can be seen to the left; the regexp will match “synthfan”, “synthuser” but also “synth” since the last part is optional because of the question mark ‘?’.

As such we have 2 matches here, as you can see by the global outcome of the regexp; it found the words “synth” and “synthfan”.

So how did this happen ?  Because our regexp matches the word “synth” it found this word in the word “synthtopia”. After it found this match it then continued checking the rest of the sentence and eventually came across “synthfan” which also matched. That’s why you see 2 items coming out of the middle outlet.

Overcoming repetition

One of the main reasons why we have so called “controlling characters” is to control the way in which our regexp matches. Suppose we have to go over several sentences which can start with either the words “apc40” or “apc20” and we want to know exactly which word is at the beginning. Then we might be using something like this:
Regexp example showing the carrot sign.
Here you can see that we’re looking for the word “apc” which is then followed by 2 numbers in the range between 0 and 9. The use of the caret ‘^’ sign here means that the match should be located at the beginning of the sentence.

If we were to omit the caret sign then we’d end up with 2 matches; both apc40 and apc20 would match and because of that the middle outlet would send both “apc40” and “apc20” right behind one another.

Obviously the same method can be applied to limit the match only for the word “apc20”. In order to do that we’d need to remove the caret sign and instead add a dollar sign ‘$’ to the end of the regexp. So something like: “apc[0-9]{2}$”. Then we’d be matching the word “apc20” and nothing else.

There is one last control character which I haven’t addressed yet and that is the plus ‘+’ sign:

  • A plus ‘+’ sign indicates that the preceding character appears at least one (1 or more) time(s).

So basically if you need to be sure that something is there (at least 1 time) then you’d want to use the plus sign.

A closer look at the regexp object

Example which shows all the outlets of the regexp object.Here you see the regexp object and I added labels which give you the name / meaning of every outlet. Lets go over them one by one in detail (Max order)…

Dumpout

An outlet which quite frankly still manages to elude me. Simply because there’s nothing coming out of there no matter what I do. I know C74 did something with this outlet during the last update, but apart from that….

 

Unmatched

When working with a regexp you’re basically trying to match something. So whenever you’ve setup a regexp which doesn’t or partly matches the part which doesn’t match will be sent out through this outlet.

It is important here to note that this behavior appears to be a little inconsistent here and there. Sometimes you can end up with one part which matches your regexp while the entire part is also sent out as not matching, see some of the previous examples (even in this post).

As a rule of thumb I never (or hardly) rely on the output of this particular outlet.

Substrings

This outlet basically provides the data which matches your regexp. An important detail here is that you should be well aware that in general a regexp repeats itself until there is no more data to match. So in the case where your regexp matches several times all of the matched data will be sent out through this outlet, which I already displayed in some previous examples.

Backreferences

This outlet sends out all the data which came from “nested regexps”. So, everything which you put between brackets “()”. Note that you’re not limited to using only one part between brackets. You can use several regexps like this, and all of their output will be sent out through this outlet in the form of a list.

Substitutions

This is a part which I haven’t covered in the tutorials yet. So far all we’ve been doing was setting up a regexp which would then match a certain text pattern, which would then be sent out through either the “substrings” or “unmatched” outlets. Another very powerful feature of regexps is the process of substitution.

I think by now it should be easy to consider what this outlet can do..  Basically it allows you to exchange (‘replace’) the text which matches your regexp for something else.

Example of how to use substitutions in regexps.Here you see an example of a substitution. Because its an attribute I’m using the @ character here, but there is also an undocumented option to perform these substitutions.

But I’m going to address those in my next post.

 

So; what is happening here?

We’re looking for “shell” which can then be followed by either ‘fan’ or ‘user’. And we want to substitute the found results with the word “test”.

Now, the 3 different result outputs might be confusing but its all really perfectly logical (I sound like Mr.Spock :)).

The most left outlet will send out the full substituted version of the input. So; we get the output where “shelluser” has been replaced with “test”. The second outlet (‘backreferences’) gives us whatever the ‘nested’ regexp matched. So in this case ‘user’. Finally the ‘substring’ outlet sends out the exact match of our regexp, which in this case is ‘shelluser’.

Coming up next

In my next post I’ll address some of the examples in the ‘help patches’ and I’ll go a little deeper into substitutions.

Once again; sorry for the heavy delays between posts so far. But in the end this is but a hobby for me, and when it comes to hobby vs. work I don’t really have much of a choice   And believe me; being self employed means that sometimes your work does not end at 17:00

Facebooktwittergoogle_plusredditpinterestlinkedinmail

5 comments

  1. Thanks for your reaction!

    There is one last part in the making; there I’ll go in depth on some of the examples in the regexp help patches as well as finish up on anything I may have forgotten in the previous 3 parts.

    But most of the basics concerning regexps have been explained so far.

  2. Thanks for your reaction !

    There is one last part upcoming where I’ll dive more into the examples shown by C’74 in their help patches, but most of the explanation about regexps should be completed by now (maybe apart from some loose ends I overlooked, those will also be addressed).

    ( posted twice because I accidentally removed my first post by clicking “spam” a little too quickly )

Comments are closed.