Let’s take a moment to talk about regular expressions, or regex as the cool kids call it. As a software developer you probably already know about them, or at least you should. Either way, we will start with a short overview on what they are and how they are used, and after that we will go into specifics of using NSRegularExpression – the Cocoa class that brings regex support to iOS.

What are regular expressions?

Regular expressions are a whole subsection of mathematics. There is a very long and very extensive theory describing what they are and what properties they have. Luckily enough, I am not going to go into details on that mainly because I don’t want to bore you to death with the second paragraph of my post. Also, for your daily programming needs, you will not need to know all that – it should be enough to know roughly what they are and how to use them. Intuitively, regular expressions represent a search pattern that matches certain portions of text. In fact, when you enter a word in any search bar (on any device), you are effectively typing in a very simple regular expression. You specify a keyword and it matches every location in the text (a web page for instance) where that particular sequence of letters is encountered. That’s essentially what regex does, however it also allows you to construct a lot more complex patterns, like “All lines containing less than 80 characters” or “All words in quotes”. This ability is very powerful even though you might not realize it at first. System administrators use it to list specific files or to filter logs for relevant information. It programming, they can be used for complex text processing and analysis. The sky’s the limit with regular expressions.

How to construct regular expressions?

In this section we are going to discuss how to create regular expression statements. For now, we are not going to go into Objective-C specifics. What you read here, can be applied for most programming languages – most of them share the same rules and you should be able to copy-paste the same patterns and expect the same results no matter if you use Bash, Perl or Objective-C.

For the most part, regular expression statements seem very intimidating initially with their strange syntax, but once you get used to it, they are not that bad.

As I mentioned, a regular expression is a string that contains a pattern that will be used to match portions of the text it’s used against. The string is a combination of “literal” characters and special ones that represent any text with a certain properties. You can think of the matching process as a series of substitutions that, if successful, determine that a certain substring can be included in the results. For instance, the “.” character will match any character. If used in a regex like “te.t”, the framework will start looking for sections in the form of “te_t” and will allow the third character to match anything. Results might be words like “test”, “text” and “teet”. However, note that the word “tet” will NOT match, because “.” specifies that it doesn’t matter which character’s there, as long as there is something. If you want to include “tet” in the results you have to use another piece of regex magic. But more on that later. Now, let see a few more options:

:

If you want to use a regex symbol literally (without it’s special meaning – like the “.” in the previous example), you can prefix it with “\” to “escape” it.

.”:

Matches any character, but as already discussed, does not match if there is no character at all.

^”:

Matches the beginning of the line. In many implementations, regular expressions are matched on a “per line” basis – each line is processed separately. It is important to have that in mind.

$”:

Matches the end of the line (the \n character).

“{n,m}”:

Indicates that the character or group to the immediate left can be matched n to m times. For instance, “a{2,3}” will match “aa” and “aaa”. “a{2,}” can also be used to indicate “two or more”.

|”:

Or.

*”:

Matches the character or group to the immediate left zero or more times. Note that, by default, regular expressions are greedy – they will match as many characters as possible – they wont stop as soon as a match is found.

+”:

Matches the character or group to the immediate left one or more times. Note that, by default, regular expressions are greedy – they will match as many characters as possible.

?”:

Matches the character or group to the immediate left one or zero times. Again, because regular expressions are greedy, “?” will match one, instead of zero times if both are available.

\A”:

Matches the beginning of the input text.

\z”:

Matches the end of the input text.

(…)”:

Creates a group with the expression inside the parentheses. It is used in conjunction with operators like “*” and “?” so that they apply for more than one character.

(?!…)”:

Matches only if the expression inside the parentheses does not.

[…]”:

Matches any character in the set (any character that is between the brackets).

[^…]”:

Matches all characters that are not in the brackets.

There are of course, other operators, but let’s not get carried away (I’ll leave them as homework :) ). The above mentioned is more than enough to get you started and if you someday need anything else, you can always look it up.

With the boring stuff out of the way, let’s get started with some examples.

Finding all single line comment in source code:

This example represents one of the simpler tasks you can accomplish using regular expressions – matching everything, provided that in contains several exact characters. More specifically, in this example we will need to match all character on a given line, starting with a “//” sequence. Any ideas…? Anyway, this is the solution:

“//.*”

So, how does that work? First of all we explicitly say that we want the results to begin with “//”. After that the “.” expression specifies that we want to include everything until the end of the line. Remember that “.” will match any character and the “” will match the character to the left zero or more times. That means any character after the “//” sequence will be included. Since regex is greedy, the matching will not stop until it encounters a character that no longer satisfies the pattern. Also, as mentioned, by default, it will process each line separately, so each result will be restricted to the closest “\n” character.

Finding all instances of the word “Obj-C” or “Objective-C”:

In this example, we are going to try using the logical OR in order to match two different patterns at the same time. This is fairly straight-forward and you should be able to figure it out for yourself, but here it is:

“(Obj-C Objective-C)”
The “ ” character here indicates that a string has to match either operand in order to match the pattern. The parentheses are not mandatory here but most of the time they are used because they would be a part of a bigger expression and will specify the end of the “or” statement.

Finding opening XML tags:

We will have to match all substrings starting with “<” and ending with “>”, containing a tag name and series of attributes. So the following should be enough, right?

“<.+>”

That was easy, wasn’t it? Well… no… What will happen to a string like:

“<tag1 attr1=”true” <tag2/>>”

It certainly does not look like an XML tag, yet, our regular expression is going to match it. OK, let’s try again. An XML tag is a substring that starts with “<” and ends with “>” and does not contain other instances of “<” and “>”. That’s a little more complicated, isn’t it? Truth is that many real world problems require just that. So let’s resolve the issue. We need to add an additional restriction in our pattern. We have to accept all character sequences that do not include other tags. Now scroll up a little and read the “(?!…)” section once again. It allows us to specify a pattern that we don’t want present in the final result. Using it, we complete the task by using the following pattern:

“<((?!<).)*>”

That’s better! The day is sunny once again and… wait! What if I had to match the following string:

“<tag1 attr=”chicken”> some text <tag2 enabled=”false”>”

It matches the whole string. But why??? Well, as I hinted multiple times, regular expressions are greedy. They will match as much text as possible. So when processing reaches “<tag1 attr=”chicken”>” the pattern matches, however, the algorithm continues adding characters to the result until it reaches a position after which the text no longer conforms to the regular expression. This this case, it is the last character and that’s why the whole string matches the pattern. We need to figure out a way to stop the regex after the first closing symbol. To do that, we will use yet another piece of magic - “?”. Intuitively, this says “Match the preceding expression zero or more times, but prefer less matches”. Non-intuitively, the “” will match any number of times, but the “?” (remember that it means “one of zero times”) will make it stop as soon as the pattern successfully matches.

So let’s see what we end up with:

“<((?!<).)*?>”

But wait! There’s more! Although, this will work fine for the most part, there is a little detail we need to consider. I already mentioned, regular expressions are usually processed on a per-line basis. This means that each line in the input is handled separately. So if we had an XML with several lines like this:

“<tag1 enabled=”true”

required=”false”/> Hello <tag2 />”

The regex will return only one result - “<tag2 />”, because it doesn’t test the whole string at once (unless you tell it to, but more on that later). So if this is important to you, you will have to modify your expression. Let’s see how – I promise there will be no more surprises:

“(^((?!(< >)).)*> <((?!<).)*?(> $))”

Wow! That escalated quickly. Let me explain. What’s new here is that we attempt to handle situations where the closing/opening tag character is not on the same line. That means text from the beginning of the line to the first closing symbol (provided that it does not contain “<”) as well as text for the last “<” to the end of the line if a closing “>” is missing. You should already know that “^” matches the beginning of a line and “$” matches the end of it. The rest is nothing new as a concept – you should be able to figure it out…eventually.

You might have noticed that the last expression handles a tag with two lines, but what about three… or four? But let’s not get carried away.

So what seemed like an easy task, turned out to be a lot harder than expected. We started with a simple expression with four characters and ended up with a huge pattern that might give you a headache just looking at it. Truth is, in my experience, it is not uncommon thing to happen while using regex. Not that I want to discourage you, I just want you to be prepared, because regular expressions are coming for you… and there is nothing you can do to stop them! Sooner or later every programmer has to face them.

Regular expressions in iOS (NSRegularExpression)

Before I begin, I would like to share a bit of personal advice. When you start using regular expressions in your source code, in my experience, debugging is not enough. You have to be sure your patterns are correct before you start using them in your program. There is no easy way to determine what’s wrong in the matching process so it would be difficult to find errors by conventional methods. I would recommend that you fire up your favorite text editor and test your expressions there – it’s a lot more visual and you can quickly see what the problem is. I personally use TextMate, but you can probably use whatever editor you currently have. Unfortunately, the built-in TextEdit application does not support regex, but Xcode does.

[caption id=”attachment_104” align=”aligncenter” width=”300”]Figure 2: Regular Expressions in TextMate Figure 2: Regular Expressions in TextMate[/caption]

Just press CMD + F to go the “find” menu, click on the magnifying glass, choose “Edit find options…” and select “Regular Expression” as “Matching Style”.You can even use grep if you are not afraid of the terminal.

[caption id=”attachment_105” align=”aligncenter” width=”300”]Figure 1: Regular Expressions in Xcode Figure 1: Regular Expressions in Xcode[/caption]

 

Now that we covered some regex basics, it is time to dig into Cocoa’s implementation – the NSRegularExpression class. Let’s start with an example and afterwards I will explain what everything does:

NSError* regexError = nil;
NSRegularExpression* regex = [NSRegularExpressionregularExpressionWithPattern:@"<((?!<).)*?>"
options:NSRegularExpressionCaseInsensitive|NSRegularExpressionDotMatchesLineSeparators
  error:&regexError];

if (regexError)
{
    NSLog(@"Regex creation failed with error: %@", [regexError description]);
    return nil;
}

NSArray* matches = [regex matchesInString:self.htmlString 
                                  options:NSMatchingWithoutAnchoringBounds 
                                    range:NSMakeRange(0, self.htmlString.length)];

So, the code above will apply the regular expression we discussed in the previous section, using Objective-C. You will see that since all the heavy-lifting of creating the pattern is done, applying it is relatively simple. Not surprisingly, first you create an instance of the NSRegularExpression class and after that you apply it to an NSString. There not much more to it than that.

Let’s see how are NSRegularExpression created. In the example above, we use a class method in order to get an autoreleased (if you’re not using ARC) regular expression object. You have to supply a string, containing the pattern. As far as I know, there is no other way to supply the regex to the object (inlike NSPredicate where there is way to specify conditions without hardcoding a string). With the second parameter, you will be able to set some parameters that will apply to the search. In this case, we want our matching process to ignore character cases so we are using NSRegularExpressionCaseInsensitive. This is the standard option you are most likely to see in tutorials and is probably enough for your day-to-day regex needs. However, in our case, we are going to add another option – NSRegularExpressionDotMatchesLineSeparators. Please note that in the code snippet, we didn’t use the final version of the regular expression from the previous section. We used the one before that – the one that didn’t account for XML tags spanning on two lines. The problem there was that regular expression frameworks by default were matching each line separately, so they couldn’t always handle results that contain new line characters. However, by adding the NSRegularExpressionDotMatchesLineSeparators option in NSRegularExpression, we force it to continue matching even after it reaches the end of the line.

Note: Always have the fact that regular expression are matched on a per-line basis in mind, because it might cause frustration when a piece of text does not match, especially in larger input texts. When you test your code, you are more likely to use shorter examples where everything is fine, but as soon as your code encounters some real world problems, it starts failing for no reason. Consider yourself warned!

The last parameter that is required in order to create an NSRegularExpression can be used to obtain a reference to an error that occurred while creating the regex. Now, I don’t see how your users will react to an “Unable to create regex” error, but at the very least, you can log the error in the console for troubleshooting.

With the regular expression created, it is time to apply it to some text. In this example, we obtain an array containing all results, but the API provides a whole range of method to fit your needs – finding matches, number of matches, first match, replacing matches… you name it. You can find the complete list on the NSRegularExpression reference page.

The rest is fairly simple. You supply the text you want to use as input, specify the range of the string you want to restrict the matching to as well as some additional settings (explained here).

Conclusion

It’s been a long post, but we (or at least some of us) finally made it. Regular expressions in Objective-C. You might feel intimidated by them and I don’t blame you. But understanding them takes time. And it certainly takes more than one blog post. That’s the reason I wrote it in the first place. Personally, I cannot remember how many I’ve read and I still have a long way to go. It is always good to learn about everybody’s take on regular expressions. And this is mine… I hope you enjoyed this article and found it useful.

Thanks for reading!

<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC" hreflang="en" /> 	<link href="http://purl.org/dc/terms/" rel="schema.DCTERMS" hreflang="en" /> 	<link href="http://purl.org/dc/dcmitype/" rel="schema.DCTYPE" hreflang="en" /> 	<link href="http://purl.org/dc/dcam/" rel="schema.DCAM" hreflang="en" />