This post is going to explore the different ways of parsing complex expressions in Lua.
Examples are intended to be used in a script file or a plugin “script” section, not in “send to script” in MUSHclient. The reason for that is that “send to script” would need “%” and “\” symbols to be doubled to work correctly.
Note: I will use the abbreviation regexp for a regular expression and LPEG for an LPEG construction.
Also note: Strictly speaking, in Lua they are patterns rather than regexps however they are close enough that I’ll stick to the word “regexp” here.
Regular expressions help match on complex lines. A normal string compare suffices for things like:
The door is closed.
But if there is variable data, you need to be able to use some sort of wildcards, eg.
You killed the kobold and got 10 experience.
In MUSHclient triggers and aliases, you can use “simple” wildcards like this:
You killed the * and got * experience.
However internally they get turned into a regexp, so I won’t cover them here.
We will assume that you have loaded the “tprint” (table print) module like this:
require "tprint"Let’s try to match the example line:
You killed the kobold and got 10 experience.
We’ll make that test string into a Lua variable:
target = "You killed the kobold and got 10 experience."print (string.match (target, "You killed the .+ and got .+ experience%."))--> Output: You killed the kobold and got 10 experience.
In regexps, the:
The final period has to be “escaped” because we want to literally match a period, not “anything”. In Lua regexps, the “%” character escapes the character after it.
The output is the entire matching text. If there was no match we would get nil as a result.
line = rex.new ("You killed the .+ and got .+ experience\\.")
s, e = line:match (target)
print (s, e)--> Output: 1 44
In this case we know the start and end matching columns. If there was no match we would get nil as a result.
The final period has to be “escaped” because we want to literally match a period, not “anything”. In PCRE regexps, the “\” character escapes the character after it. Since it is in a Lua string we have to double it, because otherwise Lua interprets that as escaping the next symbol (the period). If you are using a regexp inside a trigger or alias “match” field you don’t need to double the backslashes.
Things get a little more complex with LPEG. First, let’s pull in some functions and table items as local variables, to save typing:
require "lpeg" -- get the LPEG module - not needed for MUSHclient which has it built in
-- save typing function names with "lpeg" in front of them:
local P, V, Cg, Ct, Cc, S, R, C, Cf, Cb, Cs =
lpeg.P, lpeg.V, lpeg.Cg, lpeg.Ct, lpeg.Cc, lpeg.S, lpeg.R, lpeg.C, lpeg.Cf, lpeg.Cb, lpeg.Cs
-- character classes
lpeg.locale (lpeg) -- get digit, alpha, etc.
local alpha, cntrl, digit, graph, lower, punct, space, upper, alnum, xdigit =
lpeg.alpha, lpeg.cntrl, lpeg.digit, lpeg.graph, lpeg.lower, lpeg.punct,
lpeg.space, lpeg.upper, lpeg.alnum, lpeg.xdigitNow, the “P” function makes a pattern according to what you supply it. The simple case is a string literal, so that P“foo” matches “foo”.
Matching the variable words like “kobold” and “10” is a bit more complex. We can use the pattern P(1) to match a single character (any character). But we might have more than one character (indeed, “kobold” is 6 characters). Now in LPEG we can do this to match “one or more” characters:
P(1)^1The trouble is, that is a “greedy” match so it will consume the rest of the line.
Since LPEG does not do backtracking, we have to do this a different way.
We either need to:
We will use method #1 first, and match alpha (letters) (for “kobold”) and digits (numbers) (for the experience):
line = P"You killed the " * alpha^1 * " and got " * digit^1 * P" experience."
print (lpeg.match (line, target))--> Output: 45
LPEG returns the first column past the match. If there was no match we would get nil as a result.
In LPEG:
The “*” operator concatenates, that is P“a” * P“b” looks for “a” followed by “b”.
The “+” operator expresses an “or” condition, that is P“a” + P“b” would match on “a” and failing that, “b”
The “-” operator signifies negative look-ahead assertion. That is P(1) - P“b” first checks that the next character is not “b” and then matches any character. (So, any character except “b” in this case). However alpha^1 - P“book” would match one or more alpha characters, providing that the current position does not match “book”).
The ^0 operator expresses the concept of “zero or more” of the item it follows.
The ^1 operator expresses the concept of “one or more” of the item it follows.
If you want 10 or more it would be ^10, and so on.
The ^-n operator matches on “at most” n items. So for example, P“a”^-2 would match, at most, two lots of the letter “a”.
Now we’ll try method #2, and look for “anything that is not ‘and got’”. That effectively matches the word “kobold”.
line = P"You killed the " * (1 - P" and got")^1 * " and got " * digit^1 * P" experience."
print (lpeg.match (line, target))However this looks a bit weird. It makes more sense to decide what you need to match on (eg. some letters) rather than skip everything that is not what follows.
That approach is tedious because we need to put “and got” twice into the expression. If we need to change that to “and received” then we have to change two places. So, we can make a helper function to do it for us:
function upto (what)
return ((P(1) - P(what))^1) * P(what)
end -- upto
line = P("You killed the ") * upto(" and got ") * upto(" experience.")The upto function takes a pattern, and looks for a character followed by something that is not that pattern, and if that succeeds, it repeats, until it hits the target pattern. So in other words, it consumes all the characters up to the stopping pattern.
An alternative would be to put the word you don’t want first, and then look for one character, like this:
function upto (what)
return ((-P(what) * P(1))^1) * P(what)
end -- uptoThe LPEG “re” (regular expression) module lets you describe LPEG in a more “regexp” way, like this:
require "re"
line = re.compile [[
'You killed the ' %a+ ' and got ' %d+ ' experience.'
]]
print (lpeg.match (line, target))--> Output: 45
The underlying matching is still the same as LPEG, so you need to explicitly describe what you want to match (eg. “%a+” for the word “kobold”).
Both PCRE and Lua regular expressions can match “greedily” or not. What does “greedy matching” do? Consider wanting to match on the regexp:
a+
And imagine an input of:
aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb
Greedy matching would be:
aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb
^^^^^^^^^^^^^^^^^^
match
Non-greedy matching would be:
aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb
^
match
The greedy matching matches as much as it can, and still satisfy that part of the regexp. Non-greedy matches as little as it can.
print (string.match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb", "a+"))--> Output: aaaaaaaaaaaaaaaaaa
print (string.match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb", "a-"))--> Output: (nothing)
In this case, “as little as it can” is nothing at all! In Lua the “-” symbol means “zero or more, non-greedy” so in this case the minimum is nothing.
line = rex.new ("a+")
s, e = line:match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb")
print (s, e)--> Output: 1 18
line = rex.new ("a+?")
s, e = line:match ("aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbb")
print (s, e)--> Output: 1 1
With PCRE regexps you append a “?” to the match counter to get non-greedy, so in this case we get a single “a” (because we have to match on one or more).
LPEG only does greedy matches.
It’s all very well matching on a string like:
You killed the kobold and got 10 experience.
But what if you want to know what the variable words are (“kobold” and “10”)?
This is where “captures” are useful. You place in the regexp symbols to tell it that we want the matching part returned.
mobName, experience = string.match (target, "You killed the (.+) and got (.+) experience%.")
print (mobName)
print (experience)--> Output: kobold
10
Things we want to capture are placed in round brackets. If we need to match on round brackets we have to put a “%” in front of them.
line = rex.new ("You killed the (.+) and got (.+) experience\\.")
s, e, matches = line:match (target)
print (s, e)
tprint (matches)--> Output: 1 44
1="kobold"
2="10"
Things we want to capture are placed in round brackets. If we need to match on round brackets we have to put a “\” in front of them.
In other words, we got the same start and end columns as before, but also a table of captures, where capture #1 is the first set of round brackets, and capture #2 is the second set.
line = P"You killed the " * C(alpha^1) * " and got " * C(digit^1) * P" experience."
mobName, experience = lpeg.match (line, target)
print (mobName)
print (experience)--> Output: kobold
10
That format returns each capture as a result from the lpeg.match() call. If you want a table of matches surround the pattern by a Ct() function call:
line = Ct (P"You killed the " * C(alpha^1) * " and got " * C(digit^1) * P" experience.")
tprint (lpeg.match (line, target))--> Output: 1="kobold"
2="10"
This required two changes. First we put C( ) around things we want to capture. Second, we put Ct( ) around the whole expression. The C (capture) functions mark those parts as needing to be captured. The Ct (capture table) function places the captures into a table.
Now, using the method of stopping on “and got” we can modify our upto helper function slightly to return a capture:
function upto (what)
return C((P(1) - P(what))^1) * P(what)
end -- upto
line = P("You killed the ") * upto(" and got ") * upto(" experience.")
mobName, experience = lpeg.match (line, target)
print (mobName)
print (experience)--> Output: kobold
10
line = re.compile [[
'You killed the ' { %a+ } ' and got ' { %d+ } ' experience.'
]]
mobName, experience = lpeg.match (line, target)--> Output: kobold
10
The above returns the captures as results from the lpeg.match() call.
If the mob name could consist of any characters (except " and got " of course) then you can use the concept described earlier of looking for not the string " and got " followed by a single character, and repeat that until " and got " is reached, like this:
line = re.compile [[ 'You killed the '
{(!' and got ' .)*}
' and got '
{[0-9]+}
' experience.'
]]If you want a table returned, do this:
line = re.compile [[
{| 'You killed the ' { %a+ } ' and got ' { %d+ } ' experience.' |}
]]
tprint (lpeg.match (line, target))--> Output: 1="kobold"
2="10"
The {…} syntax replicates the lpeg.C (capture) call. The {|…|} syntax replicates the lpeg.Ct (capture table) call.
A normal Lua or PCRE regexp is not anchored.
For example, in Lua:
target = "You saw a dog and a cat"
print (string.match (target, "dog"))--> Output: dog
That matched “dog” even though it wasn’t at the start of the line.
LPEG would not match in that situation.
target = "You saw a dog and a cat."
print (lpeg.match ("dog", target))--> Output: nil
To anchor to the start of the line, we put a “^” symbol at the start, for example:
target = "You saw a dog and a cat"
print (string.match (target, "^dog"))--> Output: nil
However it matches “You saw”:
target = "You saw a dog and a cat"
print (string.match (target, "^You saw"))--> Output: You saw
To anchor to the end of the line, we put a “$” symbol at the end, for example:
target = "You saw a dog and a cat"
print (string.match (target, "dog$"))--> Output: nil
However it matches “cat”:
target = "You saw a dog and a cat"
print (string.match (target, "cat$"))--> Output: cat
To match the exact regular expression you use both:
target = "You saw a dog and a cat"
print (string.match (target, "^dog$"))--> Output: nil
However:
target = "dog"
print (string.match (target, "^dog$"))--> Output: dog
LPEG is already anchored at the start, so how to anchor at the end? We add a pattern of P(-1) to the end, which is the same as:
"" - P(1)
In other words, it matches the empty string, providing there is nothing following the empty string. This can only happen at the end of the line.
target = "You saw a dog and a cat"
print (lpeg.match ("You saw a" * P(-1), target))--> Output: nil
target = "You saw a dog and a cat"
print (lpeg.match (P"You saw a dog and a cat" * P(-1), target))--> Output: 24
Or to see the matching string add a capture around the pattern:
print (lpeg.match (C("You saw a dog and a cat" * P(-1)), target)) --> Output: You saw a dog and a cat
Similarly to what you do with LPEG, you can anchor an re pattern by finishing with “!.”. For example:
require "re"
target = "You saw a dog and a cat"
print (lpeg.match (re.compile ("'You saw a' !."), target))--> Output: nil
require "re"
target = "You saw a dog and a cat"
print (lpeg.match (re.compile ("'You saw a dog and a cat' !."), target))--> Output: 24
function anywhere (p)
return lpeg.P { p + 1 * lpeg.V(1) }
end
print (lpeg.match (anywhere ("dog"), target))--> Output: 14
The helper function anywhere accomplishes this. This actually sets up a “grammar” which is what you get when you give LPEG a table (note the curly braces).
The grammar could be written like this:
grammar = {
[1] = p + 1 * V(1) -- rule #1
}So, the grammar has one rule, named 1.
Looking at the grammar, we can see:
Effectively you could say it recurses, and tries to match one position in from the start. If that fails, it repeats, and matches two positions in, and so on until it runs out of things to match, or gets a match.
This might sound slow, trying the pattern over and over, but really, in most cases the test would immediately fail (ie, on the first letter). Most of the time, the attempt to match (on “dog” in this case) immediately fails, so only one character needs to be tested.
If you wanted to capture the matching word you can add a capture to the anywhere function, eg.
function anywhere (p)
return lpeg.P { C(p) + 1 * lpeg.V(1) }
end
target = "You see 666 dogs and a cat"
print (lpeg.match (anywhere (digit^1), target))--> Output: 666
If you want to scan the line for the pattern, but have it anchored to the end, then we can add “* P(-1)” to the end of the pattern, like this:
function anywhere (p)
return lpeg.P { C(p) + 1 * lpeg.V(1) }
end
target = "You see 666 dogs and a cat"
print (lpeg.match (anywhere ("cat" * P(-1)), target))--> Output: cat
For more complex strings you can make a “grammar” - that is, a set of rules for parsing the line.
Take this for example:
You see exits leading north, up, down, west and south
We can break that down into parts like this:
Directions <- "north" | "south" | "east" | "west" | "up" | "down"
CommaDirections <- Directions (", " Directions)* " and "
ExitLine <- "You see exits leading " CommaDirections? Directions
In the notation above “|” means “or”, “*" means “zero or more” and “?” means zero or one.
We can express that grammar in LPEG like this:
exitgrammar = P {
"ExitLine", --> this tells LPEG which rule to process first
Directions = C (P"north" + "south" + "east" + "west" + "up" + "down"),
CommaDirections = V"Directions" * (", " * V"Directions")^0 * " and ",
ExitLine = "You see exits leading " * V"CommaDirections"^-1 * V"Directions",
}
result = lpeg.match (Ct (exitgrammar), "You see exits leading north, up, down, west and south")
tprint (result)This gives a table of matches, like this:
1="north"
2="up"
3="down"
4="west"
5="south"The same grammar as above can be expressed in more natural way using the ‘re’ module:
require "re"
exits = re.compile[[
ExitLine <- {| "You see exits leading " CommaDirections? Directions |}
CommaDirections <- Directions (", " Directions)* " and "
Directions <- { "north" / "south" / "east" / "west" / "up" / "down" }
]]
tprint ( exits:match ("You see exits leading north, up, down, west and south") )This gives results like this:
1="north"
2="up"
3="down"
4="west"
5="south"| Syntax | Description |
|---|---|
| ( p ) | grouping |
| 'string' | literal string |
| "string" | literal string |
| [class] | character class |
| . | any character |
| %name | pattern defs[name] or a pre-defined pattern |
| name | non terminal |
| <name> | non terminal |
| {} | position capture |
| { p } | simple capture |
| {: p :} | anonymous group capture |
| {:name: p :} | named group capture |
| {~ p ~} | substitution capture |
| {| p |} | table capture |
| =name | back reference |
| p ? | optional match |
| p * | zero or more repetitions |
| p + | one or more repetitions |
| p^num | exactly n repetitions |
| p^+num | at least n repetitions |
| p^-num | at most n repetitions |
| p -> 'string' | string capture |
| p -> "string" | string capture |
| p -> num | numbered capture |
| p -> name | function/query/string capture equivalent to p / defs[name] |
| p => name | match-time capture equivalent to lpeg.Cmt(p, defs[name]) |
| & p | and predicate |
| ! p | not predicate |
| p1 p2 | concatenation |
| p1 / p2 | ordered choice |
| (name <- p)+ | grammar |
And now for a fancier example. I wanted to match lines with colour codes in them (like Aardwolf uses) but not include the colour codes in word matching. For example, the word “jumped” should match even if preceded by “@x2” (as in “@x2jumped”).
To do this I made up a grammar where we had a rule for the colour codes. This can be either:
The “word” to match on was just alphabetic (ie. “%a+”) but you could include underscores or numbers if you wanted.
require "re"
local target = "the quick @r@g@Wbrown@g, fox@x2jumped, @x009over the lazy frog helicopter"
local grammar = re.compile[[
line <- {| (wordWithColour+ / .)* |}
wordWithColour <- colourCode* {} {word} colourCode*
word <- %a+
colourCode <- "@" (("x" %d^-3) / colourLetters)
colourLetters <- [bBcCrRmMgGwWyYdD]
]]
-- run grammar on target text
local resultTable = grammar:match (target)
tprint (resultTable)The output is a table of positions and matching words, like this:
1=1
2="the"
3=5
4="quick"
5=17
6="brown"
7=26
8="fox"
9=32
10="jumped"
11=45
12="over"
13=50
14="the"
15=54
16="lazy"
17=59
18="frog"
19=64
20="helicopter"You can then do something with that (like substitute another word for the matching ones).
You can control match numbers like this:
So in this case “%d^-3” matches a maximum of 3 digits.
Expanding on the above example, a modified version calls a function for matching words
In this example the grammar calls gotWord (notice the table as the second argument to re.compile). When a pattern matches a word, gotWord is called which then optionally substitutes a different word. The entire match is then put into the result table, which can be concatenated to reconstruct the original line, with substitutions.
The rule “%a+ -> gotWord” means that matches get sent to the gotWord function, and whatever it returns is used as the final capture value.
require "re"
print (string.rep ("-", 60))
-- words they want replaced
local wantedReplacements = {
['quick'] = 'slow',
['jumped'] = 'hopped',
['brown'] = 'green',
['the'] = "THE",
['helicopter'] = 'bus',
-- and so on
} -- end of wantedReplacements
function gotWord (x)
return wantedReplacements [x] or x
end -- gotWord
local target = "the quick @r@g@Wbrown@g, @@fox@x2jumped, @x009over the lazy frog helicopter"
local grammar = re.compile ([[
line <- {| (wordWithColour+ / {.} )* |}
wordWithColour <- colourCode* word colourCode*
word <- %a+ -> gotWord
colourCode <- { ("@" (("x" %d^-3) / colourLetters)) }
colourLetters <- [bBcCrRmMgGwWyYdD]
]], { gotWord = gotWord } )
-- run grammar on target text
result = grammar:match (target)
-- debug
require "tprint"
tprint (result)
print (table.concat (result))Output is:
1="THE"
2=" "
3="slow"
4=" "
5="@r"
6="@g"
7="@W"
8="green"
9="@g"
10=","
11=" "
12="@"
13="@"
14="fox"
15="@x2"
16="hopped"
17=","
18=" "
19="@x009"
20="over"
21=" "
22="THE"
23=" "
24="lazy"
25=" "
26="frog"
27=" "
28="bus"
THE slow @r@g@Wgreen@g, @@fox@x2hopped, @x009over THE lazy frog bus
The function lpeg.Cs returns a string with the values for captures replacing what they capture. We can call a function to do the replacements, for example:
pattern = lpeg.R"am"
pattern = lpeg.Cs((pattern / string.upper + 1)^0)
print (pattern:match ("the quick brown fox jumped over the lazy dog"))Output is:
tHE quICK Brown Fox JuMpED ovEr tHE LAzy DoG
In this case the pattern matches letters in the range “a” to “m”. The second line repeatedly matches the pattern, or advances one character (if there is no match).
We can do a similar thing using the “re” module:
require "re"
pattern = re.compile ("{~ ([a-m] -> upper / .)* ~}", { upper = string.upper } )
print (pattern:match ("the quick brown fox jumped over the lazy dog"))In this case the “{~ … ~}” sequence indicates a substitution capture (like lpeg.Cs). Inside that we look for the set “a” to “m” and if found send it to the function “upper” (supplied in a table as the second argument to re.compile). Otherwise we skip one character and repeat.
We can supply our own function for transforming a match on the pattern. It takes arguments, one for each capture (in this case there are two captures):
lpeg.locale (lpeg) -- get digit, alpha, etc.
-- match on alphas followed by digits (eg. abc123) and capture each
pattern = lpeg.C (lpeg.alpha^1) * lpeg.C (lpeg.digit^1)
function f (a, b)
return b .. a
end -- f
pattern = lpeg.Cs((pattern / f + 1)^0)
print (pattern:match ("I am testing abc123 and def567"))Output is:
I am testing 123abc and 567def
The function “f” reverses the two captures, so that “abc123” becomes “123abc”.
We can do a similar thing using the “re” module:
require "re"
pattern = re.compile ("{~ ( ( {%alpha+} {%digit+} ) -> reverse / .)* ~}",
{ reverse = function (a, b) return b .. a end } )
print (pattern:match ("I am testing abc123 and def567"))The pattern again contains a substitution capture sequence: “{~ … ~}”. Inside that we look for one or more alpha characters (%alpha) which are the first capture (indicated by the braces) followed by one or more digit characters (%digit) which are the second capture. If found, they are passed to the “reverse” function to have the order reversed. If not, we skip one character and try again.
Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.