User:Merlin11188/Draft: Difference between revisions
>Merlin11188 No edit summary |
>Merlin11188 No edit summary |
||
Line 116: | Line 116: | ||
</pre>}} | </pre>}} | ||
==Pattern | ==Pattern Items== | ||
Alright, now it's time to explain what a pattern item is. A pattern item may be: | Alright, now it's time to explain what a pattern item is. A pattern item may be: | ||
Line 127: | Line 127: | ||
* %n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below); | * %n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below); | ||
* %bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses. | * %bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses. | ||
A pattern cannot contain embedded zeros. Use %z instead. | |||
Pattern: | Pattern: | ||
A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the | A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the string. A '$' at the end of a pattern anchors the match at the end of the string. At other positions, '^' and '$' have no special meaning and represent themselves. Here's an example of a pattern: | ||
{{Example|<pre> | |||
local Pattern="[%w%s%p]*" -- Get the longest sequence containing alpha-numeric characters, punctuation marks, and spaces. | |||
local Pattern2="^%a+" -- The string has to start with a sequence of letters. | |||
x="Hello, my name is Merlin!" | |||
print(x:match(Pattern)) | |||
print(x:match(Pattern2)) | |||
Output: | |||
Hello, my name is Merlin! -- The entire string contained only alpha-numeric characters, punctuation marks, and spaces! | |||
Hello -- Matched only the letters at the start of the string. | |||
</pre>}} | |||
A pattern may contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the | ==Captures== | ||
A pattern may contain sub-patterns enclosed in parentheses; they describe captures. When a match of a capture succeeds, the substring that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3. Whaaaaat??? Here: | |||
{{Example|<pre> | |||
local number="55" | |||
print(number:find("%d%d")) -- Find returns the location of the match, not the match itself | |||
print(number:find("(%d%d)")) | |||
Ouput: | |||
1 2 -- The first digit is at number:sub(1,1) and the second digit is at number:sub(2,2) | |||
1 2 55 -- The 55 is captured and returned. | |||
</pre>}} | |||
The second string had the parentheses represent a capture of one digit immediately followed by another. So, what a capture does is return whatever the function returns, the locations, as well as the ''matched substring''. What's inside the parentheses is the substring that is being matched. So, the %d%d was the substring that was to be matched, and it was returned along with the 1 and the 2, the values the function returns. | |||
As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5. | As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5. | ||
Revision as of 21:11, 11 July 2011
Patterns
Classes
Character Class:
A character class is used to represent a set of characters. The following are character classes and their representations:
- x — Where x is any non-magic character (^$()%.[]*+-?), x represents itself
- . — Represents all characters (#32kas321fslk#?@34)
- %a — Represents all letters (aBcDeFgHiJkLmNoPqRsTuVwXyZ)
- %c — Represents all control characters (all ascii characters below 32 and ascii character 127)
- %d — Represents all base-10 digits (1-10)
- %l — Represents all lower-case letters (abcdefghijklmnopqrstuvwxyz)
- %p — Represents all punctuation characters (#^;,.) etc.
- %s — Represents all space characters
- %u — Represents all upper-case letters (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
- %w — Represents all alpha-numeric characters (aBcDeFgHiJkLmNoPqRsTuVwXyZ0123456789)
- %x — Represents all hexadecimal digits (0123456789ABCDEF)
- %z — Represents the character with representation 0 (the null terminator)
- %x — Represents (where x is any non-alphanumeric character) the character x. This is the standard way to escape the magic characters. Any punctuation character (even the non magic) can be preceded by a '%' when used to represent itself in a pattern. So, a percent sign in a string is "%%"
Here's an example:
String="Ha! You'll never find any of these (323414123114452) numbers inside me!" print(string.match(String, "%d")) -- Find a digit character Output: 3
An upper-case version of any of these classes results in the complement of that class. For instance, %A will represent all
non-letter characters. You can even combine them! Here's another example:
Martian="141341432431413415072343E334141241312" print(Martian:match("%D%d")) -- Find any non-digit character immediately followed by a digit. Output: E3
Modifiers
In Lua, modifiers are used for repetitions and optional parts. That's where they're useful; you can get more than one character at a time:
- + — 1 or more repetitions
- * — 0 or more repetitions
- - — (minus sign) also 0 or more repetitions
- ? — optional (0 or 1 occurrence)
I'll start with the simplest one: the ?. This makes the character class optional, and if it's there, captures 1 of it. That sounds complex, but is actually really simple, so here's an example:
stringToMatch="Once upon a time, in a land far, far away..." print(stringToMatch:match("%a?")) -- Find a letter, but it doesn't have to be there. print(stringToMatch:match("%d?")) -- Find a number, but it doesn't have to be there. Output: O -- O, in Once. --Nothing because the digit didn't need to be there, so nothing was returned.
The + symbol used after a character class requires at least one instance of that class, and will get the longest string of that class. Here's an example:
stringToMatch="Once upon a time, in a land far, far away..." print(stringToMatch:match("%a+")) -- Finds the first letter, then matches letters until a non-letter character print(stringToMatch:match("%d+")) -- Finds the first number, then matches numbers until a non-number character Output: Once nil -- Nil, because the pattern required the digit to be there, but it wasn't, which returns nil.
The * symbol used after a character class is like a combination of the + and ? modifiers. It matches the longest sequence of the character class, but it doesn't have to be there. Here's an example of it matching a floating-point (decimal) number, without requiring the decimal:
numPattern="%d+%.?%d*" --[[ Requires there to be a natural number (a digit >= 1), and if there's a decimal point, get it (remember: a period is magic character, so you have to escape it with the % sign), and if there are numbers after the decimal point, grab them. ]] local num1="21608347 is an integer, a whole number, and a natural number!" local num2="2034782.014873 is a decimal number!" print(num1:match(numPattern)) print(num2:match(numPattern)) Output: 21608347 -- Grabbed a whole number, because there wasn't a decimal point or numbers after the decimal point 2034782.014873 -- Grabbed the floating-point number, because it had a decimal and numbers after it
The - symbol used after a character class is like the * symbol; there's only one difference, actually: It matches the shortest sequence of the character class. Here's an example showing the difference:
String="((3+4)+3+4)+2" print(String:match("%(.*%)")) -- Find a (, then match all (the . represens all characters) characters until the LAST ). print(String:match("%(.-%)")) -- Find a (, then match all characters until the FIRST ). Output: ((3+4)+3+4) -- Grabbed everything from the first parenthesis to the last closing parenthesis ((3+4) -- Grabbed everything from the first parenthesis to the first closing parenthesis
Sets
- [set] represents the class which is the union of all characters in the set. You define a set with brackets, like [%a%d]. A range of characters may be specified by separating the end characters of the range with a '-'. All classes described above may also be used as components in set. All other characters in a set represent themselves. For example, [%w_] (or [_%w]) represents all alphanumeric characters plus the underscore, [0-7] represents the octal digits, and [0-7%l%-] represents the octal digits plus the lowercase letters plus the '-' character.
The interaction between ranges and classes is not defined. Therefore, patterns like [%a-z] or [a-%%] have no meaning.
- [^set] represents the complement of set, where set is interpreted as above.
The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l. In a proper locale, the latter form includes letters such as `ç´ and `ã´. You should always use the latter form, unless you have a strong reason to do otherwise: It is simpler, more portable, and slightly more efficient.
Vowel="[AEIOUaeiou]" -- Match a vowel, upper-case or lower-case Consonant="[^AEIOUaeiou]" -- Match a consonant by using the complement of the vowel set OctalDigit="[0-7]" -- Match an octal digit. Octal digits: 0,1,2,3,4,5,6,7 stringToMatch="I have several vowels and consonants, and I'm followed by an octal number: 0231356701" print(stringToMatch:match(Vowel)) print(stringToMatch:match(Consonant)) print(stringToMatch:match(OctalDigit)) Output: I-- First vowel -- This is a space; it was the first non-vowel character (after the I). 0-- First octal digit, late in the string.
Pattern Items
Alright, now it's time to explain what a pattern item is. A pattern item may be:
- a single character class, which matches any single character in the class;
- a single character class followed by '*', which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
- a single character class followed by '+', which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
- a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
- a single character class followed by '?', which matches 0 or 1 occurrence of a character in the class;
- %n, for n between 1 and 9; such item matches a substring equal to the n-th captured string (see below);
- %bxy, where x and y are two distinct characters; such item matches strings that start with x, end with y, and where the x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, the item %b() matches expressions with balanced parentheses.
A pattern cannot contain embedded zeros. Use %z instead.
Pattern:
A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the string. A '$' at the end of a pattern anchors the match at the end of the string. At other positions, '^' and '$' have no special meaning and represent themselves. Here's an example of a pattern:
local Pattern="[%w%s%p]*" -- Get the longest sequence containing alpha-numeric characters, punctuation marks, and spaces. local Pattern2="^%a+" -- The string has to start with a sequence of letters. x="Hello, my name is Merlin!" print(x:match(Pattern)) print(x:match(Pattern2)) Output: Hello, my name is Merlin! -- The entire string contained only alpha-numeric characters, punctuation marks, and spaces! Hello -- Matched only the letters at the start of the string.
Captures
A pattern may contain sub-patterns enclosed in parentheses; they describe captures. When a match of a capture succeeds, the substring that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3. Whaaaaat??? Here:
local number="55" print(number:find("%d%d")) -- Find returns the location of the match, not the match itself print(number:find("(%d%d)")) Ouput: 1 2 -- The first digit is at number:sub(1,1) and the second digit is at number:sub(2,2) 1 2 55 -- The 55 is captured and returned.
The second string had the parentheses represent a capture of one digit immediately followed by another. So, what a capture does is return whatever the function returns, the locations, as well as the matched substring. What's inside the parentheses is the substring that is being matched. So, the %d%d was the substring that was to be matched, and it was returned along with the 1 and the 2, the values the function returns.
As a special case, the empty capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.