Regexps part 2; processing text in Max/M4L

Sorry, but this post came out a bit later than I planned but you know what they say; better late than never. The reason for this delay is two-fold; first the purchase of a new microphone: the Samson G-Track; an USB condenser mic which also includes an audio interface. Quite extensive indeed, but more about that in a later post.

But mostly this is caused by the end of my computer chair as I know it.. Its years old (bought this around 1988 I think), doesn’t go up and down as well anymore and when my girlfriend came to sit on my lap this weekend it marked the beginning of the end for the chair. In other words; I’m not sitting as comfortable as usual, which is bound to affect my writing. Which obviously doesn’t mean that I won’t try to reach the schedule; it simply takes more time & effort.

Quick recap

In the first part of my tutorial I explained some of the basics of regexps. To make this a bit easier to follow I’ll do a quick recap:

Regexps are basically search templates which contain several characters which together can represent certain text. We divide these characters into “matching characters” and “control characters” (not official naming).

Matching characters I showed last time:

  • The caret sign ‘^’ represents the start of a line of text.
  • The dollar sign ‘$’ represents the end of a line of text.
  • The dot ‘.’ matches any kind of single character.

Control characters I showed last time:

  • The asterisk ‘*’ indicates that the preceding characters can appear several (1
    or more) times.
  • The question mark ‘?’ indicates that the preceding character is optional
    (appears 0 or 1 time).
  • A “pipe” character ‘|’ indicates a “double”. The regexp should apply to
    everything before it OR everything behind it.
  • The round brackets ‘()’ indicate that we’re grouping or nesting. We’re
    actually making a regexp inside the main regexp.

Homework solution

I also asked about this regexp: “synth(fan|user)?”.

Its relatively simple: first we have “synth”, this is always matched. Which is followed by a regexp in round brackets; so this part gets processed individually. The regexp is basically “fan|user” which matches either the word ‘fan’ or ‘user’. However, the entire part between brackets is followed by a question mark, which means that the preceding character can appear 0 or 1 time.

In this case however it will affect the entire “sub regexp”. Because of the brackets the “sub part” will actually be treated as a single (separate) part. So the question mark will apply to its entire outcome.

As such the regexp will match “synth”, “synthfan” or “synthuser” and nothing else.

Part 2

Today I’ll be going over some more advanced search (matching) options, go over some aspects which are specific for the Max regexp object and last we’ll take a closer look at the inner workings of the regexp object itself (but not everything).

In my next post (which will most likely be the last) we’ll dive even deeper and look at how we can use regexps to change text (better put: how to perform substitutions) as well as fix any loose ends which I may have overlooked.

Collections

So far we have been using regexps by literally defining the stuff we were looking for. For example, the last regexp I showed you was used to look for either the APC40 or APC20 but basically we fully specified all the parts we were looking for: “apc(2|4)0”.

Now, lets assume here that Akai has fully expanded on the APC series to address as many customers as possible. From the highly portable APC10 right up to the massive APC60 (which, funny enough, is more real than you might know). So how are we going to write up a regexp to cover all this ?

With the knowledge we have now it would look like this: “apc(1|2|3|4|5|6)0”. Fortunately for us this can be done much more efficiently by using what I like to call collections:

  • The square brackets ‘[]’ indicate that we’re using a collection of characters; either by defining a range or by using a group. All collection members are matched individually.

So in the previous example we don’t have to type out all of the numbers we want to match. Instead we’ll setup a collection which matches the range from 1 to 6. This is done like so: “apc[1-6]0”. Here we look for “apc” which is then followed by a collection defining a range from 1 to 6. Which is then followed by a 0.

This regexp will easily match apc10, apc30 but obviously will stop matching when you feed it apc70 or apca0.

Demonstrating collections in regexp.

 

Now, the reason why I’m referring to this as a collection is not only because we can use ranges, but because everything between square brackets will be matched individually. The range is simply the most obvious example. In short: “a|b|c” is basically the same as: “[abc]”. The first is a regexp which tells us “a or b or c” while the latter is a collection of characters to match. In this case one of the three; a or b or c.

At this point you may wonder why I even bothered with the (harder) to use a|b|c. That’s because this approach is used with grouping, and grouping provides very specific advantages. I will address these more in depth in my next post; but for now don’t assume that you can simply use either one of these approaches and forget all about the other. Sorry; its not as simple as that.

You can also easily combine ranges with individual characters. Suppose that Akai expands even further on their APC series, besides the apc40, apc10 and apc60 they now bring out a special model: the apcx0. The x standing for eXtended of course..  So how will we match this?

Simple; by adding the new entry to the collection we already defined earlier: “apc[x1-6]0”. Here we defined a collection which consists of “x” and the range of 1 to 6. Whenever there is a – inside a collection it will automatically be picked up as a range between the characters before and behind it. And everything else will be treated individually.

Another good example where ranges can come in handy…  Have you noticed how I used lower case writing throughout all of my examples so far?  Even though the official writing of the Akai Ableton controller is actually APC40 or APC20. That’s because regexps, even those in Max, are case sensitive. Yes, I should have thought about mentioning this in my previous post since it managed to confuse a follower of these tutorials.

In my first examples I showed you “a.b” where the dot ‘.’ stands for any kind of character. Suppose we want to make sure that we don’t get stuff like “a1c” but only letters. Better yet: “abc” is just as good as “aDc”. So we want UPPER and lower cased letters, but nothing else. Here the collection can once again help us out; we simply define 2 ranges of letters; one lower and one uppercase. So: “a[a-zA-Z]c”. While you cannot use regexps together (we need to nest them) you can do that with ranges. So here we have a, which is followed by a collection which consists of all the lower case letters from a to z and the uppercase letters from A to Z. Then this collection is followed by c. So this will match “abc” as well as “aBc”, “aXb” or “axb” and so on.

Counts

Now that we have collections, it would be awesome if we could specify an amount of matches. And yes, we can..  In my first example I started looking for “shelluser”. And I ended up with looking for “shel.*”, the word shel followed by any kind and amount of other characters.

Now; suppose we already knew we were looking for “ShelLuser” but given the possible odd way in which I use upper and lower cases couldn’t really put our finger on it. The first S is a capital; that’s common with names. But the rest..  At least we know it consists of 8 more characters which could be either lower or capital letters.

How to match that?  This is probably not the best example to use but I have my excuse written above (which, all pun put aside is meant seriously).

Lets try to match S followed by 8 letters which can be either lower or CAPITAL. How to do that ?  By specifying an amount:

  • The broken brackets ‘{}’ specify “count rate”. The amount of which the previous character needs to occur.

So..   “S[a-zA-Z]{8}”. Here we have a capital S, followed by a collection which consists of all the letters ranging from a to z and A to Z (lower and capital). Then we have an amount specification of 8. In other words; after the first S the collection can match 8 times. Since the collection consists of all lower and upper cased characters it will match 8 letters, independent of case, behind the letter S.

And yes, this applies throughout the regexp material. People who like to try and “translate” examples to those they already became familiar with; the “count rate” can easily apply to previous examples as well.

I previously mentioned “shel.*” which will match “shelluser”. However, because of the * this will basically match anything which comes behind the word “shel” as stated above. Even stuff like “shelbutIamnotREALLYhim” would match.

To match ‘shelluser’ more closely you could also use count rate: “shel.{5}”. So; the word shel, which is then followed by any other character which can appear 5 times. Again; this isn’t perfect since it will match shelNOTME easily as well, but hopefully it gives you a good idea.

aaaaah…. we’re getting there.

That aah stuff matches “a{5}h.{4}”, right?  Can you spot what is going wrong here?  (yes, I’m aware I’m repeating my previous approach a bit).

We wanted to match the dot ‘.’ 4 times but if you give this a try you’ll notice that its not working out as you expected it. This is because the dot is also a so called “matching character”. It will match any character, and in this particular example it would do so 4 times.

Somehow we need to be able to tell the system that it should stop using the dot as a matching character but instead use it literally.

Escaping

To this end we need to use an “escape”.

The most asked question here is “what are we escaping from?”.  Well, easy; the parse master .

A little more seriously put; we’re escaping the parsing process by telling it that it should ignore the upcoming character and instead treat it as any other character. In this case it would treat the matching character as if it were a regular character.

  • The backslash ‘\’ makes the parser (“parse master”) treat the following character literally, in other words the character ‘escapes’ parsing and as such is picked up literally.

But unfortunately things aren’t always easy..  A Max “caveat” (not really but I’m not going in details here) is that it will automatically remove a single backslash ‘\’ character. Now what ?

Simple; just consider the backslash a “Max control character” and escape it. While removing the backslash may seem peculiar, Max does live up to the global (but unwritten) rules and as such you can simply escape the escape character. Sounds confusing?  Simply put: use the backslash twice.

SO our previous example would actually need to look like this: “a{5}h\\.{4}”. First the letter a, which needs to be matched 5 times. Followed by the letter h which is then followed by a literal dot ‘.’. Its literal because it has been escaped. And because we’re using Max the escape has also been escaped. That dot is then matched 4 times.

Thus resulting in it matching “aaaaah….”. 5 times an ‘a’, followed by an ‘h’ which is then followed by 4 times a dot.

Next time

Sorry, but I am going to keep this shorter than I anticipated. I know that my forecast above doesn’t match any more but right now I only care about finishing up a decent post and getting a new chair!  This is quite uncomfortable.

So next time I’ll explain the regexp object more in depth (all of the inlets and outlets) and also go over the specifics of the object (the reference page CAN be confusing indeed). I’ll also explain “control characters” which can match specific characters (like a decimal digit (or its opposite)) and also explain substitution (so actually changing an incoming list stream).

Once again sorry for the somewhat sparse post, more will come next time!

Facebooktwittergoogle_plusredditpinterestlinkedinmail