Regexps part 1; processing text in Max/M4L

Its only been pretty recent since I discovered the existence of the “regexp” object in Max. I think you could say that it has changed the ways I work with M4L quite drastically since there are many situations in which you need to process text (“lists”) in some kind of way. Obviously Max itself provides enough objects for that, for example stuff like if, select, route and maybe even trigger.

However, being very familiar with Unix (-like) environments one of the things I’m decently fluent with are regular expressions. These can give you much more control over what you’d want to do with text, also because the expression itself allows you to perform some sort of logic. Its been my experience that when you’re able to use this stuff you may very well start to ignore certain out of the box solutions entirely. Getting the name of a file? While you can use strippath for that a regexp is also easily setup. With the advantage that a regexp can be made to support spaces in file names, whereas strippath requires to have “” surrounding the whole path (its main purpose is usage in combination with dropfile or opendialog which do just that).

So, when I suddenly realized how much I’ve been using this, also to help out others on both the Ableton and Cycling ’74 forums, I suddenly wondered if some kind of tutorial might be helpful. Not the kind of technical mumbo jumbo you can find everywhere, but something not too deep but enough to be useful, hopefully easy to understand and totally aimed at Max / M4L.

So here you go…

Regular expressions – some (global) history

Now, I’m not going too deep into this but I do think that it is important to have a rough idea where this stuff comes from. Knowing a little bit about the reasoning behind it may very well help you understand why certain things work in the way they do.

The whole aspect of regular expressions (“regexps” in short) is based on pattern matching and basically sterns from scientific studies (which I won’t address). One of the first implementations of these studies onto the computer platform was QED. We’re roughly talking around 1970 now. A text editor which started to use regular expressions in order to setup searches in text files. Because of its flexibility the regexp also found its way into Unix systems. Unfortunately, as with so many things, there is no real standard. Many Unix programs; vi, perl, sed, awk use regexps yet also introduce some implementation which is specific for that program. However, slowly but steadily has the perl implementation been picked up by many and treated as some form of standard.

So what is all this regexp stuff anyway?

A regular expression is basically a bunch of characters (dots ‘.’, dollarssign ‘$’, question marks ‘?’, asterisk ‘*’) which represent some kind of “text pattern”. So easily put; a regexp is basically some pattern which can be used as a “search string” to see if it matches certain text.

An example..  Say you’re wondering who wrote this blog (please tell me it isn’t so! :)). You hit “control-f” on your browser to use “Find”. This works in both Internet Explorer and Firefox (sorry, I have no idea about others like Safari). Now, you do know my alias begins with “shel” but what was the rest again?

So, as a search string you’re using “shel” and hit find.

And here the browser will come up with “eggshell”, “shelling”, “shelly”, “crabshell”, “shelution” (not everyone can spell), or maybe even the “shel” in “command shell”. And when you’re just about to give up hitting “next” all the time you finally reach the bottom where it says “ShelLuser” (yay, that’s me!).

Surely we should be able to do better than that?  I mean; we knew it started with “shel”. So it would have been useful if we somehow could make the find ignore everything which ended with shel (stuff like ‘crabshell’ or ‘eggshell’).

So what we’d need is some kind of way to tell the browser that we want to place some restrictions on our search. So; using some kind of character string which would tell it that it should find words which start with ‘shel’ and end with other stuff. For example something like: “shel*” where the * basically means “anything”.

Unfortunately this example doesn’t work too well in a browser since the browser will most likely pick up * as a literal character. In other words: it will start looking for “shel*”. But I do hope it gives you an idea about the reasoning behind all this. Also note that such a search command probably will work out in a text processor like Word or OpenOffice Writer.

So, this is the part where regexps can shine. In the example above “shel*” can actually be considered to be a form of regexp. In this case an expression which represents any kind and any amount of characters which appear after the word “shel”.

Basic regular expressions

Ok, now that we have a global idea as to what a regexp actually is its time to look into how we can put them to good use. As mentioned before I’m fully focusing on Max and M4L here. To begin using regexps we first need to know some dry theory (sorry!).

As I explained above a regexp is basically a pattern made from several characters which can represent certain text. But in order to use them we first need to know what each character actually represents. And yes; there are a lot of them. Don’t worry, we’ll start very simple and I’ll provide examples as well.

  • The caret sign ‘^’ represents the start of a line of text.
  • The dollar sign ‘$’ represents the end of a line of text.
  • The dot ‘.’ matches any kind of single character.

These characters represent others. However, if we want to use these to try our previous search action again we’ll soon discover that we’re lacking options. I mean, now I can say I want something to begin with “shel”. Then I can use a . character but this only represents 1 other single character.

In other words, a regexp which looks like this: “a.c” basically means “a which is followed by any other character which is then followed by c”. So it would match: “abc”, “acc”, “apc”, “aac”, “a1c” and so on.

But in our previous example we wanted to search for “shel” which was then followed by several other characters. So how does that work?

Simple, next to what I like to call “matching characters” we also have so called “control characters”. Note that this isn’t official wording, its how I like to call these. So, some examples of “control characters”:

  • The asterisk ‘*’ indicates that the preceding characters can appear several (1 or more) times.
  • The question mark ‘?’ indicates that the preceding character is optional (appears 0 or 1 time).
  • A “pipe” character ‘|’ indicates a “double”. The regexp should apply to everything before it OR everything behind it (sounds complicated, see the example below).

So now we have the required tools in our “regexp toolbox” to make the previous search work. If we want to search for “shel<something>” all we need to do is indicate that “shel” can be followed by one or more characters. Doesn’t really matter which.

SO: “shel.*”.  First “shel”, followed by a dot ‘.’ which basically means “any single character can come here” which is then followed by an asterisk ‘*’ which means “the previous character can appear 1 or more times”.

More examples

When looking at the first example of “a.c” we can expand a bit on this. Lets say that we want to look for “abc” which can be followed by another character, but it doesn’t have to. So “abc” or “abcd” or “abcp” or “abce” and so on. For that we simply use the ‘?’ character.

SO: “abc.?”. First “abc”, followed by a dot ‘.’ which basically means “any single character can come here” which is then followed by a question mark ‘?’ which means “the previous character can appear 0 or 1 time”.

Thus we end up with abc which can then be optionally followed by any other character.

And finally the pipe ‘|’. This is a little more difficult and also the last I’ll be explaining in this part since I want to keep it simple for now…

The character itself is simple, it basically means “or”.  For example: “abc|def”. This basically means that we’re searching for “abc” or “def”.

And obviously you can apply more logic to it..  Our example above (the a.c) can easily fit in. You can make it as easy or as complex as you want. So: “a.c|def” is perfectly usable, but slightly more complex. Now we’re looking for “a which is followed by any other character which is followed by c OR a full match of def”.

In other words; apart from “abc”, “apc”, “azc”, “a3c” this string will also always match “def”.

And as you may notice this also means that we can use more ways to search for something.

Lets say we want to search for “apc” which can then be followed by 20 or 40 (representing the apc40 or apc20 obviously).

We could make this easy on ourselves and use: “apc.0”. SO: “apc” followed by a ‘.’ which can be any other character which is then followed by a 0.

However, this can give us unwanted results because the dot stands for “anything”. Not only does this match with “apc40” it will also easily match “apcz0” or “apcc0” or “apc10” and so on.

A better approach would be to tell the regexp that we wanted to match either a 4 or a 2 and nothing else.

From here the logical approach would be: “apc2|40”, yet this poses a problem…  Can you spot it ?

Lets see, what does this setup actually mean…

First we have “apc2”, so its not unthinkable that the regexp will match this. Then we have the pipe ‘|’ which means or and finally we have “40”.  SO…  Instead of telling our regexp that we wanted to match either 2 or 4 we actually told it to match either “apc2” or “40”.

Not what we wanted to achieve…

So how to tell this regexp that it should treat the 2|4 separately ?  We somehow need to take this expression “out” of the main expression and make it get matched on its own….

This is where “grouping” comes in, and its also the last I’ll be teaching you in this post (to prevent this from becoming an endless essay, don’t worry; I won’t stop here).

Grouping

  • The round brackets ‘()’ indicate that we’re grouping or nesting. We’re actually making a regexp inside the main regexp.

SO, going right back to our previous problem.. Matching either apc40 or apc20..

This would best be setup as follows: “apc(2|4)0”.

SO: “apc” to begin with, this should be logical by now. Then we insert a new separate regexp which basically consists of “2|4”. This means that it will match either the character 2 or the character 4. Which is then followed by a 0.

How to experiment in Max

Regular Expression example.Obviously you’ll need to use the regexp object. The only problem is that this object has several outlets which by themselves can be confusing.

I’ll explain all of them in more detail in part 2 of my post, but for now lets globally focus on 3 of them which can be seen here.

(don’t forget that you can hover over outlets to see their description)

From right to left (max style) we have unmatched. This is an important outlet which will send out data the very moment that your regexp doesn’t match with it. Try pushing “shelluser” and you’ll see.

Next is substrings. This can have several meanings but for now we’ll say that this will output the outcome of the regexp search. So if a string matches then it will be send out through here.

And finally (for now) we have backreferences. Remember where I explained that brackets would give you a regexp “inside” a regexp? Well, the outcome of such a “inner regexp” will be send out through this outlet. In this case the outcome of the 2|4 regexp.

Homework

I really hope this post gives you a global and rough idea about regexps. We’re not done, but because I can well imagine that this stuff is pretty daunting for someone new to it I decided to spread this across posts.

Of course I’d appreciate feedback, and also don’t hesitate to ask about stuff which you don’t understand. Remember: there are no stupid questions. The only thing which is stupid is not asking a question if you don’t understand something.

SO I’ll leave you guys with some “homework”. Care to answer this in the comment section?  No worry if someone already responded before you; just jump in and share your idea(s).

What does this regexp do: “synth(fan|user)?” and also why does it what it does?  (I won’t agree with a mere “it matches this word”).

And there you have it..  I hope this is useful to someone.

Next part will be out in 1 or 2 weeks.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

4 comments

  1. Holy SH*T dude!

    That is pretty awesome, I got max but so far did not do much with it but this looks pretty cool.

    Have u put the patch online somewhere? Would like to download (im lazy, I know).

    and it works too!!! Just did your 1st regexp to try and ur right. very cool.

    subscribed

  2. wow… and now i suddenly understood regexps? good post, i actually understand this now!

    u skipped $ and ^? is that for later? I hope it is, I saw it used much too!

    I need to think about ur homework, LOL. Can we try it in m4l?! LOL

    or is that cheating?

  3. Nice post!

    read lots on this but never really got it. this looks pretty easy so far, i finally think i get the purpose…

    @homework

    I think it matches synthfa or synthuse but also synthfan and synthuser. Why: first it does “synth” then it does the other regexp. But the ? makes it so that the last character is optional.

    but i think thats wrong (my answer) cause you never said ? could delete..

    gonna try this for real now LOL

  4. beat me!

    my answer was almost the same, i think that the ? may work on the whole other regexp. so “synth”, “synthfan” and “synthuser” bcause its also optional (0 or one).

    Can we share the a+ ? LOL

Comments are closed.