String Patterns: Difference between revisions
>NXTBoy Knocked out a heading level |
>NXTBoy Knocked out a heading level |
(No difference)
|
Revision as of 21:39, 17 January 2012
What are String Patterns?
String patterns are, in essence, just strings. What makes them different from ordinary strings then, you ask? String patterns are strings that use a special combination of characters. These characters combinations are generally used with functions in the string library such as 'string.match' and 'string.gsub' to do interesting things with strings. For instance, with string patterns you can do something like this:
But what makes the code above so cool? Perhaps you've wanted to make a list of people without using a table, or maybe you need to parse a string. String patterns can help do this!
As said before, string patterns are strings that look a little different and are used for a different purpose than what strings are usually used for. Here we will look at the basics of just what make a string pattern up. Here we will look at just what the different parts of a string pattern mean.
In these examples, we will use the string.match function.
Simple matching
Guess what? You already know some string patterns! Any string is a pattern!
Character Classes
There's only so far we can go by using this kind of pattern matching. Sometimes, we want to match any of a set of characters. Here's an example:
The following table shows the meaning of each character class:
Pattern | Meaning | Example matches |
---|---|---|
. | Any character | #32kas321fslk#?@34 |
%a | Uppercase or lowercase letter | aBcDeFgHiJkLmNoPqRsTuVwXyZ |
%l | Lowercase letter | abcdefghijklmnopqrstuvwxyz |
%u | Uppercase letter | ABCDEFGHIJKLMNOPQRSTUVWXYZ |
%p | Punctuation character | #^;,. |
%w | Alphanumeric characters - either a letter or a digit | aBcDeFgHiJkLmNoPqRsTuVwXyZ0123456789 |
%d | Digits | 0123456789 |
%s | Whitespace character | , \n, and \r |
%c | Control character | |
%x | Hexadecimal (Base 16) digits | 0123456789ABCDEF |
%z | the NUL character, '\0' |
Any non-magic character (not one of ^$()%.[]*+-?), represents itself in a pattern. To search for a literal magic character, precede it by a space - for example, to look for a percent symbol, use %%.
One of the things you might notice about the character classes above is that they are all lowercase. Making them uppercase reverses their effect. For instance, %s represents whitespace, but %S represents any non-whitespace character. %l represents a lowercase letter while %L represents its compliment - any characters but a lowercase letter. Let's look at this example, which matches a digit, followed by four non-digits:
Quantifiers
Character classes allow you to match any character. Quantifiers allow you to match any number of characters
Pattern | Meaning |
---|---|
? | Match 0 or 1 of the preceding character specifier |
* | Match 0 or more of the preceding character specifier |
+ | Match 1 or more of the preceding character specifier |
- | Match as few of the preceding character specifier as possible |
+
Let's say you have a string that contains a number, such as "It costs 100 tix", and you want to extract the number. If you know how many digits the number has, you could use the pattern %d%d%d which would match three digits in a row. But what happens if you don't know how many digits there are? For this, you can use quantifiers. In this example, the + quantifier is suitable.
Now how does this work exactly? As we know, a character class followed by a '+' matches one or more repetitions. For this example, it means that it would match the first digits it finds until it reaches the end of the string or a non-digit.
*
The difference between + and * is that + matches 1 or more characters, while * matches 0 or more. This means that if the character class that is followed by this quantifier isn't represented in the string, it doesn't matter, because no matches are required.
As you can see, it matches a digit, punctuation characters (if there are any), and then another digit. If you had used +, the second example would have returned nil, because + requires at least one match. The * pattern is very useful when you have something in the string that is optional.
-
Unlike * and +, - matches the shortest possible sequence. For example, if you have a path name, and you want to retrieve a part of the string between /s, then you can use the - item. This example shows you the difference you'd get if you used '-' compared to the '*' item.
From the example, you see that the - found the shortest possible sequence and stopped at the second /, while the * matched the longest sequence and stopped only at the last / in the string.
?
The '?' pattern item is much different than the others because it matches only 0 or 1 occurrence of the string. This is used to make certain characters in the string optional. This makes it a bit like the '*' item except that instead of matching 0 or more occurrences, it only matches 0 or 1.
From the example, you can see, the '?' item matches either 0 or 1. In the first string, there is a single dot between the numbers which this pattern item matches. In the second string, there is no punctuation at all so the punctuation is skipped over. Finally, in the third example, the '?' item matches the first dot, but not the second. In this case, it's skipped over and a match is found immediately after the two dots.
Sets
Sets are used when a single character class cannot do the whole job. For instance, you might want to match both lowercase letters (%l) as well as punctuation characters (%p) using a single class. So how would we do this? Let's take a look at this example:
As you can see from the example, sets are defined by the '[' and ']' around them. You also see that the classes for lowercase letters and punctuation is contained within. This means that the set will act as a class that represents both lowercase and punctuation, unlike if you used %l%p which would match the sequence of a punctuation character following a lowercase letter.
You aren't restricted to using only character classes, though! You can also use normal characters to add to the set. Also, you can specify a range of characters with the '-' symbol. Let's see how this works in the following example:
From the example, you can see how string.gmatch manipulated strings s1 and s2 using the string patterns. And yet, there's still one last thing you can do. Like with character classes, sets have compliments of themselves.
This pattern is the compliment of [%s1-9]. As seen from the example, the compliment of a set is defined by using the '^' character at the beginning of the set. All this does is reverse the meaning of the set. As you can easily see from this example, the spaces, the number 29 in the middle of 'Hello', and the 1 at the end were removed.
Captures
Captures are used to get pieces of a string that match a capture. Captures are defined by parentheses around them. For instance, (%a%s) is a capture for a letter and a space character. When a capture is matched, it is then stored for future use. Let's look at this example:
Now what happens if you want to get a list by using captures? You can use string.gmatch to do this.
Note that 'key' and 'val' are actually referring to capture 1 and capture 2. The name does not matter, but it is still a good practice to choose a relevant name. As you can see, string.gmatch iterated through all the matches in the string and returned only the captures which is basically what captures are for, to capture a certain part of the string to use.
A final thing you can do with captures is that you can leave the captures empty. In these cases they will capture the current position on the string. This means that unlike the other, non-empty captures, a number is returned instead of a string. Look at this example:
From the example, once a match was found, string.find returned the first and second captures' positions in the string instead of returning the characters 'H' and '!'.