"Because friends don't let friends write parsers by hand"
NPeg is a pure Nim pattern matching library. It provides macros to compilepatterns and grammars (PEGs) to Nim procedures which will parse a string andcollect selected parts of the input. PEGs are not unlike regular expressions,but offer more power and flexibility, and have less ambiguities. (More aboutPEGs on Wikipedia)
Some use cases where NPeg is useful are configuration or data file parsers,robust protocol implementations, input validation, lexing of programminglanguages or domain specific languages.
Some NPeg highlights:
Grammar definitions and Nim code can be freely mixed. Nim code is embeddedusing the normal Nim code block syntax, and does not disrupt the grammardefinition.
NPeg-generated parsers can be used both at run and at compile time.
NPeg offers various methods for tracing, optimizing and debuggingyour parsers.
NPeg can parse sequences of any data types, also making it suitable as astage-two parser for lexed tokens.
NPeg can draw cool diagrams.
Here is a simple example showing the power of NPeg: The macro peg
compiles agrammar definition into a parser
object, which is used to match a string andplace the key-value pairs into the Nim table words
:
import npeg, strutils, tables
type Dict = Table[string, int]
let parser = peg("pairs", d: Dict):
pairs <- pair * *(',' * pair) * !1
word <- +Alpha
number <- +Digit
pair <- >word * '=' * >number:
d[$1] = parseInt($2)
var words: Table[string, int]
doAssert parser.match("one=1,two=2,three=3,four=4", words).ok
echo words
Output:
{"two": 2, "three": 3, "one": 1, "four": 4}
A brief explanation of the above code:
The macro peg
is used to create a parser object, which uses pairs
as theinitial grammar rule to match. The variable d
of type Dict
will be availableinside the code block parser for storing the parsed data.
The rule pairs
matches one pair
, followed by zero or more times (*
) acomma followed by a pair
.
The rules word
and number
match a sequence of one or more (+
)alphabetic characters or digits, respectively. The Alpha
and Digit
rulesare pre-defined rules matching the character classes {'A'..'Z','a'..'z'}
and {'0'..'9'}
.
The rule pair
matches a word
, followed by an equals sign (=
), followedby a number
.
The word
and number
in the pair
rule are captured with the >
operator. The Nim code fragment below this rule is executed for every match,and stores the captured word and number in the words
Nim table.
The patt()
and peg()
macros can be used to compile parser functions:
patt()
creates a parser from a single anonymous pattern.
peg()
allows the definition of a set of (potentially recursive) rulesmaking up a complete grammar.
The result of these macros is an object of the type Parser
which can be usedto parse a subject:
proc match(p: Parser, s: string) = MatchResult
proc matchFile(p: Parser, fname: string) = MatchResult
The above match
functions returns an object of the type MatchResult
:
MatchResult = object
ok: bool
matchLen: int
matchMax: int
...
ok
: A boolean indicating if the matching succeeded without error. Note thata successful match does not imply that all of the subject was matched,unless the pattern explicitly matches the end-of-string.
matchLen
: The number of input bytes of the subject that successfullymatched.
matchMax
: The highest index into the subject that was reached duringparsing, even if matching was backtracked or did not succeed. This offsetis usually a good indication of the location where the matching erroroccurred.
The string captures made during the parsing can be accessed with:
proc captures(m: MatchResult): seq[string]
A simple pattern can be compiled with the patt
macro.
For example, the pattern below splits a string by white space:
let parser = patt *(*' ' * > +(1-' '))
echo parser.match(" one two three ").captures
Output:
@["one", "two", "three"]
The patt
macro can take an optional code block which is used as code blockcapture for the pattern:
var key, val: string
let p = patt >+Digit * "=" * >+Alpha:
(key, val) = ($1, $2)
assert p.match("15=fifteen").ok
echo key, " = ", val
The peg
macro provides a method to define (recursive) grammars. The firstargument is the name of initial patterns, followed by a list of named patterns.Patterns can now refer to other patterns by name, allowing for recursion:
let parser = peg "ident":
lower <- {'a'..'z'}
ident <- *lower
doAssert parser.match("lowercaseword").ok
The order in which the grammar patterns are defined affects the generatedparser.Although NPeg could always reorder, this is a design choice to give the usermore control over the generated parser:
when a pattern P1
refers to pattern P2
which is defined before P1
,P2
will be inlined in P1
. This increases the generated code size, butgenerally improves performance.
when a pattern P1
refers to pattern P2
which is defined after P1
,P2
will be generated as a subroutine which gets called from P1
. This willreduce code size, but might also result in a slower parser.
The NPeg syntax is similar to normal PEG notation, but some changes were madeto allow the grammar to be properly parsed by the Nim compiler:
*
, +
, -
and ?
.|
instead of /
because of operator precedence.*
infix operator is used for sequences.NPeg patterns and grammars can be composed from the following parts:
Atoms:
0 # matches always and consumes nothing
1 # matches any character
n # matches exactly n characters
'x' # matches literal character 'x'
"xyz" # matches literal string "xyz"
i"xyz" # matches literal string, case insensitive
{'x'..'y'} # matches any character in the range from 'x'..'y'
{'x','y','z'} # matches any character from the set
Operators:
P1 * P2 # concatenation
P1 | P2 # ordered choice
P1 - P2 # matches P1 if P2 does not match
(P) # grouping
!P # matches everything but P
&P # matches P without consuming input
?P # matches P zero or one times
*P # matches P zero or more times
+P # matches P one or more times
@P # search for P
P[n] # matches P n times
P[m..n] # matches P m to n times
Precedence operators:
P ^ N # P is left associative with precedence N
P ^^ N # P is right associative with precedence N
String captures:
>P # Captures the string matching P
Back references:
R("tag", P) # Create a named reference for pattern P
R("tag") # Matches the given named reference
Error handling:
E"msg" # Raise an execption with the given message
In addition to the above, NPeg provides the following built-in shortcuts forcommon atoms, corresponding to POSIX character classes:
Alnum <- {'A'..'Z','a'..'z','0'..'9'}, # Alphanumeric characters
Alpha <- {'A'..'Z','a'..'z'}, # Alphabetic characters
Blank <- {' ','\t'}, # Space and tab
Cntrl <- {'\x00'..'\x1f','\x7f'}, # Control characters
Digit <- {'0'..'9'}, # Digits
Graph <- {'\x21'..'\x7e'}, # Visible characters
Lower <- {'a'..'z'}, # Lowercase characters
Print <- {'\x21'..'\x7e',' '}, # Visible characters and spaces
Space <- {'\9'..'\13',' '}, # Whitespace characters
Upper <- {'A'..'Z'}, # Uppercase characters
Xdigit <- {'A'..'F','a'..'f','0'..'9'}, # Hexadecimal digits
Atoms are the basic building blocks for a grammar, describing the parts of thesubject that should be matched.
Integer literal: 0
/ 1
/ n
The int literal atom n
matches exactly n number of bytes. 0
alwaysmatches, but does not consume any data.
Character and string literals: 'x'
/ "xyz"
/ i"xyz"
Characters and strings are literally matched. If a string is prefixed withi
, it will be matched case insensitive.
Character sets: {'x','y'}
Characters set notation is similar to native Nim. A set consists of zero ormore comma separated characters or character ranges.
{'x'..'y'} # matches any character in the range from 'x'..'y'
{'x','y','z'} # matches any character from the set 'x', 'y', and 'z'
The set syntax {}
is flexible and can take multiple ranges and charactersin one expression, for example {'0'..'9','a'..'f','A'..'F'}
.
NPeg provides various prefix, infix and suffix operators. These operatorscombine or transform one or more patterns into expressions, building largerpatterns.
Concatenation: P1 * P2
o──[P1]───[P2]──o
The pattern P1 * P2
returns a new pattern that matches only if first P1
matches, followed by P2
.
For example, "foo" * "bar"
would only match the string "foobar"
.
Ordered choice: P1 | P2
o─┬─[P1]─┬─o
╰─[P2]─╯
The pattern P1 | P2
tries to first match pattern P1
. If this succeeds,matching will proceed without trying P2
. Only if P1
can not be matched,NPeg will backtrack and try to match P2
instead. Once either P1
or P2
hasmatched, the choice will be final ("commited"), and no more backtracking willbe possible for this choice.
For example ("foo" | "bar") * "fizz"
would match both "foofizz"
and"barfizz"
.
NPeg optimizes the |
operator for characters and character sets: Thepattern 'a' | 'b' | 'c'
will be rewritten to a character set{'a','b','c'}
.
Difference: P1 - P2
The pattern P1 - P2
matches P1
only if P2
does not match. This isequivalent to !P2 * P1
:
━━━━
o──[P2]─»─[P1]──o
NPeg optimizes the -
operator for characters and character sets: Thepattern {'a','b','c'} - 'b'
will be rewritten to the character set{'a','c'}
.
Grouping: (P)
Brackets are used to group patterns similar to normal arithmetic expressions.
Not-predicate: !P
━━━
o──[P]──o
The pattern !P
returns a pattern that matches only if the input does notmatch P
.In contrast to most other patterns, this pattern does not consume any input.
A common usage for this operator is the pattern !1
, meaning "only succeedif there is not a single character left to match" - which is only true forthe end of the string.
And-predicate: &P
━━━
━━━
o──[P]──o
The pattern &P
matches only if the input matches P
, but will notconsume any input. This is equivalent to !!P
. This is denoted by a doublenegation in the railroad diagram, which is not very pretty unfortunately.
Optional: ?P
╭──»──╮
o─┴─[P]─┴─o
The pattern ?P
matches if P
can be matched zero or more times, soessentially succeeds if P
either matches or not.
For example, ?"foo" * bar"
matches both "foobar"
and "bar"
.
Match zero or more times: *P
╭───»───╮
o─┴┬─[P]─┬┴─o
╰──«──╯
The pattern *P
tries to match as many occurrences of pattern P
aspossible - this operator always behaves greedily.
For example, *"foo" * "bar"
matches "bar"
, "fooboar"
, "foofoobar"
,etc.
Match one or more times: +P
o─┬─[P]─┬─o
╰──«──╯
The pattern +P
matches P
at least once, but also more times.It is equivalent to the P * *P
- this operator always behave greedily.
Search: @P
This operator searches for pattern P
using an optimized implementation. Itis equivalent to s <- *(1 - P) * P
, which can be read as "try to match asmany characters as possible not matching P
, and then match P
:
╭─────»─────╮
│ ━━━ │
o─┴┬─[P]─»─1─┬┴»─[P]──o
╰────«────╯
Note that this operator does not allow capturing the skipped data up to thematch; if his is required you can manually construct a grammar to do this.
Match exactly n
times: P[n]
The pattern P[n]
matches P
exactly n
times.
For example, "foo"[3]
only matches the string "foofoofoo"
:
o──[P]─»─[P]─»─[P]──o
Match m
to n
times: P[m..n]
The pattern P[m..n]
matches P
at least m
and at most n
times.
For example, "foo[1,3]"
matches "foo"
, "foofoo"
and "foofoofo"
:
╭──»──╮ ╭──»──╮
o──[P]─»┴─[P]─┴»┴─[P]─┴─o
Note: This is an experimental feature, the implementation or API might changein the future.
Precedence operators allows for the construction of "precedence climbing" or"Pratt parsers" with NPeg. The main use for this feature is building parsersfor programming languages that follow the usual precedence and associativityrules of arithmetic expressions.
N
: P ^ N
<1<
o──[P]──o
N
: P ^^ N
>1>
o──[P]──o
During parsing NPeg keeps track of the current precedence level of the parsedexpression - the default is 0
if no precedence has been assigned yet. Whenthe ^
operator is matched, either one of the next three cases applies:
P ^ N
where N > 0
and N
is lower then the current precedence: in thiscase the current precedence is set to N
and parsing of pattern P
continues.
P ^ N
where N > 0
and N
is higher or equal then the current precedence:parsing will fail and backtrack.
P ^ 0
: resets the current precedence to 0 and continues parsing. This mainuse case for this is parsing sub-expressions in parentheses.
The heart of a Prett parser in NPeg would look something like this:
exp <- prefix * *infix
parenExp <- ( "(" * exp * ")" ) ^ 0
prefix <- number | parenExp
infix <- {'+','-'} * exp ^ 1 |
{'*','/'} * exp ^ 2 |
{'^'} * exp ^^ 3:
More extensive documentation will be added later, for now take a look at theexample in tests/precedence.nim
.
╭╶╶╶╶╶╮
s o────[P]────o
╰╶╶╶╶╶╯
NPeg supports a number of ways to capture data when parsing a string.The various capture methods are described here, including a concise example.
The capture examples below build on the following small PEG, which parsesa comma separated list of key-value pairs:
const data = "one=1,two=2,three=3,four=4"
let parser = peg "pairs":
pairs <- pair * *(',' * pair) * !1
word <- +Alpha
number <- +Digit
pair <- word * '=' * number
let r = parser.match(data)
The basic method for capturing is marking parts of the peg with the captureprefix >
. During parsing NPeg keeps track of all matches, properly discardingany matches which were invalidated by backtracking. Only when parsing has fullysucceeded it creates a seq[string]
of all matched parts, which is thenreturned in the MatchData.captures
field.
In the example, the >
capture prefix is added to the word
and number
rules, causing the matched words and numbers to be appended to the resultcapture seq[string]
:
let parser = peg "pairs":
pairs <- pair * *(',' * pair) * !1
word <- +Alpha
number <- +Digit
pair <- >word * '=' * >number
let r = parser.match(data)
The resulting list of captures is now:
@["one", "1", "two", "2", "three", "3", "four", "4"]
Code block captures offer the most flexibility for accessing matched data inNPeg. This allows you to define a grammar with embedded Nim code for handlingthe data during parsing.
Note that for code block captures, the Nim code gets executed during parsing,even if the match is part of a pattern that fails and is later backtracked.
When a grammar rule ends with a colon :
, the next indented block in thegrammar is interpreted as Nim code, which gets executed when the rule has beenmatched. Any string captures that were made inside the rule are available tothe Nim code in the injected variable capture[]
of type seq[Capture]
:
type Capture = object
s*: string # The captured string
si*: int # The index of the captured string in the subject
The total subject matched by the code block rule is available in capture[0]
Any additional explicit >
string captures made by the rule or any of itschild rules will be available as capture[1]
, capture[2]
, ...
For convenience there is syntactic sugar available in the code block captureblocks:
The variables $0
to $9
are rewritten to capture[n].s
and can be used toaccess the captured strings. The $
operator uses then usual Nim precedence,thus these variables might need parentheses or different ordering in somecases, for example $1.parseInt
should be written as parseInt($1)
.
The variables @0
to @9
are rewritten to capture[n].si
and can be usedto access the offset in the subject of the matched captures.
Example:
let p = peg foo:
foo <- >(1 * >1) * 1:
echo "$0 = ", $0
echo "$1 = ", $1
echo "$2 = ", $2
echo p.match("abc").ok
Will output
$0 = abc
$1 = ab
$2 = b
Code block captures consume all embedded string captures, so these captureswill no longer be available after matching.
A code block capture can also produce captures by calling the push(s: string)
function from the code block. Note that this is an experimental feature andthat the API might change in future versions.
The example has been extended to capture each word and number with the >
string capture prefix. When the pair
rule is matched, the attached code blockis executed, which adds the parsed key and value to the words
table.
from strutils import parseInt
var words = initTable[string, int]()
let parser = peg "pairs":
pairs <- pair * *(',' * pair) * !1
word <- +Alpha
number <- +Digit
pair <- >word * '=' * >number:
words[$1] = parseInt($2)
let r = parser.match(data)
After the parsing finished, the words
table will now contain:
{"two": 2, "three": 3, "one": 1, "four": 4}
Code block captures can be used for additional validation of a captured string:the code block can call the functions fail()
or validate(bool)
to indicateif the match should succeed or fail. Failing matches are handled as if thecapture itself failed and will result in the usual backtracking. When thefail()
or validate()
functions are not called, the match will succeedimplicitly.
For example, the following rule will check if a passed number is a validuint8
number:
uint8 <- >Digit[1..3]:
let v = parseInt($a)
validate v>=0 and v<=255
The following grammar will cause the whole parse to fail when the error
rulematches:
error <- 0:
fail()
Note: The Nim code block is running within the NPeg parser context and intheory could access to its internal state - this could be used to create customvalidator/matcher functions that can inspect the subject string, do lookaheador lookback, and adjust the subject index to consume input. At the time ofwriting, NPeg lacks a formal API or interface for this though, and I am notsure yet what this should look like - If you are interested in doing this,contact me so we can discuss the details.
NPeg allows passing of data of a specific type to the match()
function, thisvalue is then available inside code blocks as a variable. This mitigates theneed for global variables for storing or retrieving data in access captures.
The syntax for defining a generic grammar is as follows:
peg(name, identifier: Type)
For example, the above parser can be rewritten using a generic parser as such:
type Dict = Table[string, int]
let parser = peg("pairs", userdata: Dict):
pairs <- pair * *(',' * pair) * !1
word <- +Alpha
number <- +Digit
pair <- >word * '=' * >number:
userdata[$1] = parseInt($2)
var words: Dict
let r = parser.match(data, words)
Backreferences allow NPeg to match an exact string that matched earlier in thegrammar. This can be useful to match repetitions of the same word, or forexample to match so called here-documents in programming languages.
For this, NPeg offers the R
operator with the following two uses:
The R(name, P)
pattern creates a named reference for pattern P
which canbe referred to by name in other places in the grammar.
The pattern R(name)
matches the contents of the named reference thatearlier been stored with R(name, P)
pattern.
For example, the following rule will match only a string which will have thesame character in the first and last position:
patt R("c", 1) * *(1 - R("c")) * R("c") * !1
The first part of the rule R("c", 1)
will match any character, and store thisin the named reference c
. The second part will match a sequence of zero ormore characters that do not match reference c
, followed by reference c
.
Repetitive inlining of rules might cause a grammar to grow too large, resultingin a huge executable size and slow compilation. NPeg tries to mitigate this intwo ways:
Patterns that are too large will not be inlined, even if the above orderingrules apply.
NPeg checks the size of the total grammar, and if it thinks it is too largeit will fail compilation with the error message NPeg: grammar too complex
.
Check the section "Compile-time configuration" below for more details about toocomplex grammars.
The parser size and performance depends on many factors; when performanceand/or code size matters, it pays to experiment with different orderings andmeasure the results.
When in doubt, check the generated parser instructions by compiling with the-d:npegTrace
or -d:npegDotDir
flags - see the section Tracing andDebugging for more information.
At this time the upper limit is 4096 rules, this might become a configurablenumber in a future release.
For example, the following grammar will not compile because recursive inliningwill cause it to expand to a parser with more then 4^6 = 4096 rules:
let p = peg "z":
f <- 1
e <- f * f * f * f
d <- e * e * e * e
c <- d * d * d * d
b <- c * c * c * c
a <- b * b * b * b
z <- a * a * a * a
The fix is to change the order of the rules so that instead of inlining NPegwill use a calling mechanism:
let p = peg "z":
z <- a * a * a * a
a <- b * b * b * b
b <- c * c * c * c
c <- d * d * d * d
d <- e * e * e * e
e <- f * f * f * f
f <- 1
When in doubt check the generated parser instructions by compiling with the-d:npegTrace
flag - see the section Tracing and Debugging for moreinformation.
When building more complex grammars you may find yourself duplicating certainconstructs in patterns over and over again. To avoid code repetition (DRY),NPeg provides a simple mechanism to allow the creation of parameterized rules.In good Nim-fashion these rules are called "templates". Templates are definedjust like normal rules, but have a list of arguments, which are referred to inthe rule. Technically, templates just perform a basic search-and-replaceoperation: every occurrence of a named argument is replaced by the exactpattern passed to the template when called.
For example, consider the following grammar:
numberList <- +Digit * *( ',' * +Digit)
wordList <- +Alpha * *( ',' * +Alpha)
This snippet uses a common pattern twice for matching lists: p * *( ',' * p)
.This matches pattern p
, followed by zero or more occurrences of a commafollowed by pattern p
. For example, numberList
will match the string1,22,3
.
The above example can be parameterized with a template like this:
commaList(item) <- item * *( ',' * item )
numberList <- commaList(+Digit)
wordList <- commaList(+Alpha)
Here the template commaList
is defined, and any occurrence of its argument'item' will be replaced with the patterns passed when calling the template.This template is used to define the more complex patterns numberList
andwordList
.
Templates may invoke other templates recursively; for example the above caneven be further generalized:
list(item, sep) <- item * *( sep * item )
commaList(item) <- list(item, ',')
numberList <- commaList(+Digit)
wordList <- commaList(+Alpha)
For simple grammars it is usually fine to build all patterns from scratch fromatoms and operators, but for more complex grammars it makes sense to definereusable patterns as basic building blocks.
For this, NPeg keeps track of a global library of patterns and templates. Thegrammar
macro can be used to add rules or templates to this library. Allpatterns in the library will be stored with a qualified identifier in theform libraryname.patternname
, by which they can be referred to at a latertime.
For example, the following fragment defines three rules in the library with thename number
. The rules will be stored in the global library and are referredto in the peg by their qualified names number.dec
, number.hex
andnumber.oct
:
grammar "number":
dec <- {'1'..'9'} * *{'0'..'9'}
hex <- i"0x" * +{'0'..'9','a'..'f','A'..'F'}
oct <- '0' * *{'0'..'9'}
let p = peg "line":
line <- int * *("," * int)
int <- number.dec | number.hex | number.oct
let r = p.match("123,0x42,0644")
NPeg offers a number of pre-defined libraries for your convenience, these canbe found in the npeg/lib
directory. A library an be imported with the regularNim import
statement, all rules defined in the imported file will then beadded to NPeg's global pattern library. For example:
import npeg/lib/uri
To allow the user to add custom captures to imported grammars or rules, it ispossible to override or shadow an existing rule in a grammar.
Overriding will replace the rule from the library with the provided new rule,allowing the caller to change parts of an imported grammar. A overridden ruleis allowed to reference the original rule by name, which will cause the newrule to shadow the original rule. This will effectively rename the originalrule and replace it with the newly defined rule which will call the originalreferred rule.
For example, the following snippet will reuse the grammar from the uri
library and capture some parts of the URI in a Nim object:
import npeg/lib/uri
type Uri = object
host: string
scheme: string
path: string
port: int
var myUri: Uri
let parser = peg "line":
line <- uri.URI
uri.scheme <- >uri.scheme: myUri.scheme = $1
uri.host <- >uri.host: myUri.host = $1
uri.port <- >uri.port: myUri.port = parseInt($1)
uri.path <- >uri.path: myUri.path = $1
echo parser.match("http://nim-lang.org:8080/one/two/three")
echo myUri # --> (host: "nim-lang.org", scheme: "http", path: "/one/two/three", port: 8080)
Note: This is an experimental feature, the implementation or API might changein the future.
NPeg was originally designed to parse strings like a regular PEG engine, buthas since evolved into a generic parser that can parse any subject of typeopenArray[T]
. This section describes how to use this feature.
The peg()
macro must be passed an additional argument specifying the basetype T
of the subject; the generated parser will then parse a subject oftype openArray[T]
. When not given, the default type is char
, and the parserparsers openArray[char]
, or more typically, string
.
When matching non-strings, some of the usual atoms like strings or charactersets do not make sense in a grammar, instead the grammar uses literal atoms.Literals can be specified in square brackets and are interpreted as any Nimcode: [foo]
, [1+1]
or ["foo"]
are all valid literals.
When matching non-strings, captures will be limited to only a single elementof the base type, as this makes more sense when parsing a token stream.
For an example of this feature check the example in tests/lexparse.nim
- thisimplements a classic parser with separate lexing and parsing stages.
Unlike regular expressions, PEGs are always matched in anchored mode only:the defined pattern is matched from the start of the subject string.For example, the pattern "bar"
does not match the string "foobar"
.
To search for a pattern in a stream, a construct like this can be used:
p <- "bar"
search <- p | 1 * search
The above grammar first tries to match pattern p
, or if that fails, matchesany character 1
and recurs back to itself. Because searching is a commonoperation, NPeg provides the builtin @P
operator for this.
PEGs do not care what is in the subject string after the matching succeeds. Forexample, the rule "foo"
happily matches the string "foobar"
. To make surethe pattern matches the end of string, this has to be made explicit in thepattern.
The idiomatic notation for this is !1
, meaning "only succeed if there is nota single character left to match" - which is only true for the end of thestring.
The lookahead(&
) and not(!
) operators may not consume any input, and makesure that after matching the internal parsing state of the parser is reset toas is was before the operator was started, including the state of the captures.This means that any captures made inside a &
and !
block also arediscarded. It is possible however to capture the contents of a non-consumingblock with a code block capture, as these are always executed, even when theparser state is rolled back afterwards.
NPeg offers a number of ways to handle errors during parsing a subject string:
The ok
field in the MatchResult
indicates if the parser was successful:when the complete pattern has been matched this value will be set to true
,if the complete pattern did not match the subject the value will be false
.
In addition to the ok
field, the matchMax
field indicates the maximumoffset into the subject the parser was able to match the string. If thematching succeeded matchMax
equals the total length of the subject, if thematching failed, the value of matchMax
is usually a good indication of wherein the subject string the error occurred.
When, during matching, the parser reaches an E"message"
atom in the grammar,NPeg will raise an NPegException
exception with the given message.The typical use case for this atom is to be combine with the ordered choice |
operator to generate helpful error messages.The following example illustrates this:
let parser = peg "list":
list <- word * *(comma * word) * eof
eof <- !1
comma <- ','
word <- +{'a'..'z'} | E"word"
echo parser.match("one,two,three,")
The rule word
looks for a sequence of one or more letters (+{'a'..'z'}
). Ifcan this not be matched the E"word"
matches instead, raising an exception:
Error: unhandled exception: Parsing error at #14: expected "word" [NPegException]
The NPegException
type contains the same two fields as MatchResult
toindicate where in the subject string the match failed: matchLen
andmatchMax
:
let a = patt 4 * E"boom"
try:
doAssert a.match("12345").ok
except NPegException as e:
echo "Parsing failed at position ", e.matchMax
NPeg does not support left recursion (this applies to PEGs in general). Forexample, the rule
A <- A / 'a'
will cause an infinite loop because it allows for left-recursion of thenon-terminal A
.
Similarly, the grammar
A <- B / 'a' A
B <- A
is problematic because it is mutually left-recursive through the non-terminalB
.
Note that loops of patterns that can match the empty string will not result inthe expected behavior. For example, the rule *0
will cause the parser tostall and go into an infinite loop.
NPeg has no built-in support for Unicode or UTF-8, instead is simply able toparse UTF-8 documents just as like any other string. NPeg comes with a simpleUTF-8 grammar library which should simplify common operations like matching asingle code point or character class. The following grammar splits an UTF-8document into separate characters/glyphs by using the utf8.any
rule:
import npeg/lib/utf8
let p = peg "line":
line <- +char
char <- >utf8.any
let r = p.match("γνωρίζω")
echo r.captures() # --> @["γ", "ν", "ω", "ρ", "ί", "ζ", "ω"]
When compiled with -d:npegGraph
, NPeg will dump syntax diagrams (also knownas railroad diagrams) for all parsed rules.
Syntax diagrams are sometimes helpful to understand or debug a grammar, or toget more insight in a grammars' complexity.
╭─────────»──────────╮
│ ╭─────»──────╮│
╭╶╶╶╶╶╶╶╶╶╶╮ │ │ ━━━━ ││ ╭╶╶╶╶╶╶╶╮
inf o──"INF:"─»───[number]───»┴─","─»┴┬─[lf]─»─1─┬┴┴»─[lf]─»───[url]────o
╰╶╶╶╶╶╶╶╶╶╶╯ ╰────«─────╯ ╰╶╶╶╶╶╶╶╯
?
) are indicated by a forward arrow overhead.!
) are overlined in red. Note that the diagram does notmake it clear that the input for not-predicates is not consumed.NPeg can generate a graphical representation of a grammar to show the relationsbetween rules. The generated output is a .dot
file which can be processed bythe Graphviz tool to generate an actual image file.
When compiled with -d:npegDotDir=<PATH>
, NPeg will generate a .dot
file foreach grammar in the code and write it to the given directory.
Edge colors represent the rule relation:grey=inline, blue=call, green=builtin
Rule colors represent the relative size/complexity of a rule:black=<10, orange=10..100, red=>100
Large rules result in larger generated code and slow compile times. Rule sizecan generally be decreased by changing the rule order in a grammar to allowNPeg to call rules instead of inlining them.
When compiled with -d:npegTrace
, NPeg will dump its intermediaterepresentation of the compiled PEG, and will dump a trace of the executionduring matching. These traces can be used for debugging or optimization of agrammar.
For example, the following program:
let parser = peg "line":
space <- ' '
line <- word * *(space * word)
word <- +{'a'..'z'}
discard parser.match("one two")
will output the following intermediate representation at compile time. Fromthe IR it can be seen that the space
rule has been inlined in the line
rule, but that the word
rule has been emitted as a subroutine which getscalled from line
:
line:
0: line opCall 6 word word
1: line opChoice 5 *(space * word)
2: space opStr " " ' '
3: line opCall 6 word word
4: line opPartCommit 2 *(space * word)
5: opReturn
word:
6: word opSet '{'a'..'z'}' {'a' .. 'z'}
7: word opSpan '{'a'..'z'}' +{'a' .. 'z'}
8: opReturn
At runtime, the following trace is generated. The trace consists of a numberof columns:
0| 0|one two |line |call -> word:6 |
6| 0|one two |word |set {'a'..'z'} |
7| 1|ne two |word |span {'a'..'z'} |
8| 3| two | |return |
1| 3| two |line |choice -> 5 |
2| 3| two | space |chr " " |*
3| 4|two |line |call -> word:6 |*
6| 4|two |word |set {'a'..'z'} |*
7| 5|wo |word |span {'a'..'z'} |*
8| 7| | |return |*
4| 7| |line |pcommit -> 2 |*
2| 7| | space |chr " " |*
| 7| | |fail |*
5| 7| | |return (done) |
The exact meaning of the IR instructions is not discussed here.
NPeg has a number of configurable setting which can be configured at compiletime by passing flags to the compiler. The default values should be ok in mostcases, but if you ever run into one of those limits you are free to configurethose to your liking:
-d:npegPattMaxLen=N
This is the maximum allowed length of NPeg's internalrepresentation of a parser, before it gets translated to Nim code. The reasonto check for an upper limit is that some grammars can grow exponentially byinlining of patterns, resulting in slow compile times and oversizedexecutable size. (default: 4096)
-d:npegInlineMaxLen=N
This is the maximum allowed length of a pattern to beinlined. Inlining generally results in a faster parser, but also increasescode size. It is valid to set this value to 0; in that case NPeg will neverinline patterns and use a calling mechanism instead, this will result in thesmallest code size. (default: 50)
-d:npegRetStackSize=N
Maximum allowed depth of the return stack for theparser. The default value should be high enough for practical purposes, thestack depth is only limited to detect invalid grammars. (default: 1024)
-d:npegBackStackSize=N
Maximum allowed depth of the backtrace stack for theparser. The default value should be high enough for practical purposes, thestack depth is only limited to detect invalid grammars. (default: 1024)
-d:npegGcsafe
This is a workaround for the case where NPeg needs to be usedfrom a {.gcsafe.}
context when using threads. This will mark the generatedmatching function to be {.gcsafe.}
.
NPeg has a number of compile time flags to enable tracing and debugging of thegenerated parser:
-d:npegTrace
: Enable compile time and run time tracing. Please refer to thesection 'Tracing' for more details.
-d:npegGraph
: Dump syntax diagrams of all parsed rules at compile time.
These flags are meant for debugging NPeg itself, and are typically not usefulto the end user:
-d:npegDebug
: Enable more debug info. Meant for NPeg development debuggingpurposes only.
-d:npegExpand
: Dump the generated Nim code for all parsers defined in theprogram. Meant for NPeg development debugging purposes only.
The NPeg syntax is similar, but not exactly the same as the official PEGsyntax: it uses some different operators, and prefix instead of postfixoperators. The reason for this is that the NPeg grammar is parsed by a Nimmacro in order to allow code block captures to embed Nim code, which puts somelimitations on the available syntax. Also, NPeg's operators are chosen so thatthey have the right precedence for PEGs.
Almost, but not quite. Although PEGS and EBNF look quite similar, there aresome subtle but important differences which do not allow a literal translationfrom EBNF to PEG. Notable differences are left recursion and ordered choice.Also, see "From EBNF to PEG" from Roman R. Redziejowski.
let parser = peg "line":
exp <- term * *( ('+'|'-') * term)
term <- factor * *( ('*'|'/') * factor)
factor <- +{'0'..'9'} | ('(' * exp * ')')
line <- exp * !1
doAssert parser.match("3*(4+15)+2").ok
The following PEG defines a complete parser for the JSON language - it will notproduce any captures, but simple traverse and validate the document:
let s = peg "doc":
S <- *Space
jtrue <- "true"
jfalse <- "false"
jnull <- "null"
unicodeEscape <- 'u' * Xdigit[4]
escape <- '\\' * ({ '{', '"', '|', '\\', 'b', 'f', 'n', 'r', 't' } | unicodeEscape)
stringBody <- ?escape * *( +( {'\x20'..'\xff'} - {'"'} - {'\\'}) * *escape)
jstring <- ?S * '"' * stringBody * '"' * ?S
minus <- '-'
intPart <- '0' | (Digit-'0') * *Digit
fractPart <- "." * +Digit
expPart <- ( 'e' | 'E' ) * ?( '+' | '-' ) * +Digit
jnumber <- ?minus * intPart * ?fractPart * ?expPart
doc <- JSON * !1
JSON <- ?S * ( jnumber | jobject | jarray | jstring | jtrue | jfalse | jnull ) * ?S
jobject <- '{' * ( jstring * ":" * JSON * *( "," * jstring * ":" * JSON ) | ?S ) * "}"
jarray <- "[" * ( JSON * *( "," * JSON ) | ?S ) * "]"
doAssert s.match(json).ok
let doc = """ {"jsonrpc": "2.0", "method": "subtract", "params": [42, 23], "id": 1} """
doAssert parser.match(doc).ok
The following example shows how to use code block captures. The definedgrammar will parse a HTTP response document and extract structured data fromthe document into a Nim object:
import npeg, strutils, tables
type
Request = object
proto: string
version: string
code: int
message: string
headers: Table[string, string]
# HTTP grammar (simplified)
let parser = peg("http", userdata: Request):
space <- ' '
crlf <- '\n' * ?'\r'
url <- +(Alpha | Digit | '/' | '_' | '.')
eof <- !1
header_name <- +(Alpha | '-')
header_val <- +(1-{'\n'}-{'\r'})
proto <- >+Alpha:
userdata.proto = $1
version <- >(+Digit * '.' * +Digit):
userdata.version = $1
code <- >+Digit:
userdata.code = parseInt($1)
msg <- >(+(1 - '\r' - '\n')):
userdata.message = $1
header <- >header_name * ": " * >header_val:
userdata.headers[$1] = $2
response <- proto * '/' * version * space * code * space * msg
headers <- *(header * crlf)
http <- response * crlf * headers * eof
# Parse the data and print the resulting table
const data = """
HTTP/1.1 301 Moved Permanently
Content-Length: 162
Content-Type: text/html
Location: https://nim.org/
"""
var request: Request
let res = parser.match(data, request)
echo request
The resulting data:
(
proto: "HTTP",
version: "1.1",
code: 301,
message: "Moved Permanently",
headers: {
"Content-Length": "162",
"Content-Type":
"text/html",
"Location": "https://nim.org/"
}
)
More examples can be found in tests/examples.nim.
Here are some things I'd like to have implemented one day. Some are hard andrequire me to better understand what I'm doing first. In no particular order:
Design and implement a proper API for code block captures. The current APIfeels fragile and fragmented (capture[], $1/$2, fail(), validate()
), anddoes not offer solid primitives to make custom match functions yet, somethingbetter should be in place before NPeg goes v1.0.
Resuming/streaming: The current parser is almost ready to be invoked multipletimes, resuming parsing where it left off - this should allow parsing of(infinite) streams. The only problem not solved yet is how to handlecaptures: when a block of data is parsed it might contain data which mustlater be available to collect the capture. Not sure how to handle this yet.
Memoization: I guess it would be possible to add (limited) memoization toimprove performance, but no clue where to start yet.
Parallelization: I wonder if parsing can parallelized: when reaching anordered choice, multiple threads should be able to try to parse eachindividual choice. I do see problems with captures here, though.
I'm not happy about the {.gcsafe.}
workaround. I'd be happy to hear anyideas on how to improve this.