【acwj】02， Introduction to Parsing 解析器介绍

楚嘉纳

2023-12-01

搬运自https://github.com/DoctorWkt/acwj，一个介绍如何使用C语言编写一个可自举的类C语言编译器的说明。进行了粗略的翻译。

【acwj】02， Introduction to Parsing 解析器介绍

In this part of our compiler writing journey, I’m going to introduce the basics of a parser. As I mentioned in the first part, the job of the parser is to recognise the syntax and structural elements of the input and ensure that they conform to the grammar of the language.

在我们的编译器编写之旅的这个部分，我将介绍解析器——编译器基础之一。如同我在第一部分提及的，解析器的工作是识别输入的语法和结构元素，并确保它们符合该语言的语法。

We already have several language elements that we can scan in, i.e. our tokens:

我们已经有一些可以扫描进来的语言元素，即我们的标记：

the four basic maths operators: *, /, + and -
decimal whole numbers which have 1 or more digits 0 … 9

Now let’s define a grammar for the language that our parser will recognise.

现在让我们定义该语言（输入）的语法，以便我们的解析器能识别

BNF: Backus-Naur Form 巴克斯范式

You will come across the use of BNF at some point if you get into dealing with computer languages. I will just introduce enough of the BNF syntax here to express the grammar we want to recognise.

如果你开始解决计算机语言，那么你将对巴克斯范式有一些印象。我将介绍足够的BNF语法来表达我们想识别的语法。

We want a grammar to express maths expressions with whole numbers. Here is the BNF description of the grammar:

我们需要一个用整数来表达数学表达式的语法。以下是BNF对该语法的描述：

expression: number
          | expression '*' expression
          | expression '/' expression
          | expression '+' expression
          | expression '-' expression
          ;

number:  T_INTLIT
         ;

The vertical bars separate options in the grammar, so the above says:

An expression could be just a number, or
An expression is two expressions separated by a ‘*’ token, or
An expression is two expressions separated by a ‘/’ token, or
An expression is two expressions separated by a ‘+’ token, or
An expression is two expressions separated by a ‘-‘ token
A number is always a T_INTLIT token

竖线代表语法中的不同选项，因此以上（的BNF）说的是：

一个expression只能是一个number，或者
一个expression是两个用*标记分隔的expression，或者
一个expression是两个用/标记分隔的expression，或者
一个expression是两个用+标记分隔的expression，或者
一个expression是两个用-标记分隔的expression，或者
一个number只能是一个T_INTLIT标记

It should be pretty obvious that the BNF definition of the grammar is recursive: an expression is defined by referencing other expressions. But there is a way to *bottom-out” the recursion: when an expression turns out to be a number, this is always a T_INTLIT token and thus not recursive.

很显然，BNF对于语法的定义是递归的：一个表达式通过引用其他表达式进行定义。但是应当有一个方法来跳出递归：当一个表达式是一个number的时候，它永远是一个T_INTLIT标记，从而结束了递归。

In BNF, we say that “expression” and “number” are non-terminal symbols, as they are produced by rules in the grammar. However, T_INTLIT is a terminal symbol as it is not defined by any rule. Instead, it is an already-recognised token in the language. Similarly, the four maths operator tokens are terminal symbols.

在BNF里，我们说“expression”和“number”是 非终止符，因为它们都由语法的规则创造的（意思是它们是用来表达语法的规则的，不属于语言的内容）。同时，T_INTLIT是一个终止符，因为它不是由规则定义的，而是一个语言中已经可以识别的标记。同样的，这四个运算符也是终结符。

S->A|S*S|S/S|S+S|S-S
A->0|1|...|9

或

S->A|A*S|A/S|A+S|A-S
A->0|1|...|9

Recursive Descent Parsing递归下降解析

Given that the grammar for our language is recursive, it makes sense for us to try and parse it recursively. What we need to do is to read in a token, then look ahead to the next token. Based on what the next token is, we can then decide what path we need to take to parse the input. This may require us to recursively call a function that has already been called.

考虑到我们的语言的语法是递归的，我们可以试着递归地解析它。我们需要做的是读进一个标记，然后查看下一个标记。然后，我们可以基于下一个标记决定我们应当如何解析输入。这可能需要我们递归地调用一个已经调用了的函数。

In our case, the first token in any expression will be a number and this may be followed by maths operator. After that there may only be a single number, or there may be the start of a whole new expression. How can we parse this recursively?

在我们的例子中，任何表达式中的第一个标记将是一个数字，而下一个标记可能是一个数学表达式。之后，可能只有一个数字，或者是一个全新的表达式的开始。我们如何递归地解析它？

We can write pseudo-code that looks like this:

我们可以编写如下所示的伪代码：

function expression() {
  Scan and check the first token is a number. Error if it's not
  Get the next token
  If we have reached the end of the input, return, i.e. base case

  Otherwise, call expression()
}

Let’s run this function on the input 2 + 3 - 5 T_EOF where T_EOF is a token that reflects the end of the input. I will number each call to expression().

让我们用输入2 + 3 - 5 T_EOF来运行这个代码。此处T_EOF也是一个标记，它指示输入的结束。我将会给expression()的每一次调用进行标号

expression0:
  Scan in the 2, it's a number
  Get next token, +, which isn't T_EOF
  Call expression()

    expression1:
      Scan in the 3, it's a number
      Get next token, -, which isn't T_EOF
      Call expression()

        expression2:
          Scan in the 5, it's a number
          Get next token, T_EOF, so return from expression2

      return from expression1
  return from expression0

Yes, the function was able to recursively parse the input 2 + 3 - 5 T_EOF.

是的，该函数能够递归地解析输入2 + 3 - 5 T_EOF

Of course, we haven’t done anything with the input, but that isn’t the job of the parser. The parser’s job is to recognise the input, and warn of any syntax errors. Someone else is going to do the semantic analysis of the input, i.e. to understand and perform the meaning of this input.

当然，我们还没有对输入进行任何处理，但是它不是解析器的工作。解析器的工作是识别输入，并对任何语法错误法处警告。一些其他要做的是做输入的语义分析，即，去理解、执行该输入的意义。

Later on, you will see that this isn’t actually true. It often makes sense to intertwine the syntax analysis and semantic analysis.
在这之后，你会发现这并不是完全正确的。将语法分析与语义分析交织在一起通常是有意义的。

Abstract Syntax Trees AST抽象语法树

To do the semantic analysis, we need code that either interprets the recognised input, or translates it to another format, e.g. assembly code. In this part of the journey, we will build an interpreter for the input. But to get there, we are first going to convert the input into an abstract syntax tree, also known as an AST.

为了进行语义分析，我们需要解释解释已经识别的输入，或将其转换为另一种格式的代阿，例如汇编代码。在旅程的这一部分，我们将为输入构建一个解释器。但是为了达到这个目的，我们需要先将输入转换为一个抽象语法树，也称之为AST。

I highly recommend you read this short explanation of ASTs:
我极力推荐您阅读以下关于AST的简单解释
Leveling Up One’s Parsing Game With ASTs by Vaidehi Joshi

It’s well written and really help to explain the purpose and structure of ASTs. Don’t worry, I’ll be here when you get back.

它写得很好，而且对解释AST的意图和结构很有帮助。不用担心，我会在这里等你回来。

The structure of each node in the AST that we will build is described in defs.h:

AST的每一个结点的结构定义在defs.h中：

// AST node types
enum {
  A_ADD, A_SUBTRACT, A_MULTIPLY, A_DIVIDE, A_INTLIT
};

// Abstract Syntax Tree structure
struct ASTnode {
  int op;                               // "Operation" to be performed on this tree
  struct ASTnode *left;                 // Left and right child trees
  struct ASTnode *right;
  int intvalue;                         // For A_INTLIT, the integer value
};

Some AST nodes, like those with op values A_ADD and A_SUBTRACT have two child ASTs that are pointed to by left and right. Later on, we will add or subtract the values of the sub-trees.

一些AST结点，例如op值为A_ADD和A_SUBTRACT有两个子树，一个指向left，另一个指向right。之后，我们将添加或者削减子树的值。

Alternatively, an AST node with the op value A_INTLIT represents an integer value. It has no sub-tree children, just a value in the intvalue field.

可选的是，一个op值为A_INTLIT的AST结点表示一个整数值。它没有子树，只在intvalue字段有一个值。

Building AST Nodes and Trees 构建AST结点和树

The code in tree.c has the functions to build ASTs. The most general function, mkastnode(), takes values for all four fields in an AST node. It allocates the node, populates the field values and returns a pointer to the node:

tree.c中的代码有构建AST树的函数。最常用的函数，mkastnode()，给一个AST结点的四个字段都赋上值。它分配空间给该结点，填充字段值，并返回指向结点的指针：

// Build and return a generic AST node
struct ASTnode *mkastnode(int op, struct ASTnode *left,
                          struct ASTnode *right, int intvalue) {
  struct ASTnode *n;

  // Malloc a new ASTnode
  n = (struct ASTnode *) malloc(sizeof(struct ASTnode));
  if (n == NULL) {
    fprintf(stderr, "Unable to malloc in mkastnode()\n");
    exit(1);
  }
  // Copy in the field values and return it
  n->op = op;
  n->left = left;
  n->right = right;
  n->intvalue = intvalue;
  return (n);
}

Given this, we can write more specific functions that make a leaf AST node (i.e. one with no children), and make an AST node with a single child:

给出这个后，我们能写出更多的创建AST叶结点的特殊函数，然后创建一个只有一个子结点的AST结点

// Make an AST leaf node
struct ASTnode *mkastleaf(int op, int intvalue) {
  return (mkastnode(op, NULL, NULL, intvalue));
}

// Make a unary AST node: only one child
struct ASTnode *mkastunary(int op, struct ASTnode *left, int intvalue) {
  return (mkastnode(op, left, NULL, intvalue));

Purpose of the AST AST的目的

We are going to use an AST to store each expression that we recognise so that, later on, we can traverse it recursively to calculate the final value of the expression. We do want to deal with the precedence of the maths operators. Here is an example.

我们将使用一个AST来存储我们识别出来的每个表达式，以便以其后我们能将其递归地遍历它来得到表达式的最终值。我们也想要处理运算符的优先级。以下是一个例子：

Consider the expression 2 * 3 + 4 * 5. Now, multiplication has higher precedence that addition. Therefore, we want to bind the multiplication operands together and perform these operations before we do the addition.

考虑表达式2 * 3 + 4 * 5。现在，乘法比加法有更高的优先级。因此，我们想要将乘法运算绑定在一起，并在我们执行加发前先执行这些运算

If we generated the AST tree to look like this:

如果我们创建了一个AST，它看起来是这样子的：

          +
         / \
        /   \
       /     \
      *       *
     / \     / \
    2   3   4   5

then, when traversing the tree, we would perform 2*3 first, then 4*5. Once we have these results, we can then pass them up to the root of the tree to perform the addition.

然后，当遍历这个树的时候，我们将先运算2 * 3，然后再运算4 * 5。一旦我们得到了这些结果，我们旧可以将它们传递给树的根部来执行加法。

A Naive Expression Parser 朴素表达式解析器

Now, we could re-use the token values from our scanner as the AST node operation values, but I like to keep the concept of tokens and AST nodes separate. So, to start with, I’m going to have a function to map the token values into AST node operation values. This, along with the rest of the parser, is in expr.c:

现在，我们能重新使用扫描器中的标记值作为AST结点的操作值，但是我喜欢将标记和AST结点的概念区分开来。因此，一开始，我将使用一个函数来映射标记值和AST结点操作值。这些和解析器剩下的部分都在expr.c中。

// Convert a token into an AST operation.
int arithop(int tok) {
  switch (tok) {
    case T_PLUS:
      return (A_ADD);
    case T_MINUS:
      return (A_SUBTRACT);
    case T_STAR:
      return (A_MULTIPLY);
    case T_SLASH:
      return (A_DIVIDE);
    default:
      fprintf(stderr, "unknown token in arithop() on line %d\n", Line);
      exit(1);
  }
}

The default statement in the switch statement fires when we can’t convert the given token into an AST node type. It’s going to form part of the syntax checking in our parser.

当无法将给定的标记转换成AST结点类型时，switch语句中的default语句将被即法。它将成为语法解析器中语法检查的一部分。

We need a function to check that the next token is an integer literal, and to build an AST node to hold the literal value. Here it is:

我们需要一个函数来检查下一个标记是否为整数字面值，并构建一个AST节点来保存字面值。以下：

// Parse a primary factor and return an
// AST node representing it.
static struct ASTnode *primary(void) {
  struct ASTnode *n;

  // For an INTLIT token, make a leaf AST node for it
  // and scan in the next token. Otherwise, a syntax error
  // for any other token type.
  switch (Token.token) {
    case T_INTLIT:
      n = mkastleaf(A_INTLIT, Token.intvalue);
      scan(&Token);
      return (n);
    default:
      fprintf(stderr, "syntax error on line %d\n", Line);
      exit(1);
  }
}

This assumes that there is a global variable Token, and that it already has the most recent token scanned in from the input. In data.h:

这假设存在一个全局变量Token，并且它已经从输入扫描了最新的Token。在data.h中：

extern_ struct token    Token;

and in main():
在main.c中

  scan(&Token);                 // Get the first token from the input
  n = binexpr();                // Parse the expression in the file

Now we can write the code for the parser:
现在我们可以写解析器的代码了：

// Return an AST tree whose root is a binary operator
struct ASTnode *binexpr(void) {
  struct ASTnode *n, *left, *right;
  int nodetype;

  // Get the integer literal on the left.
  // Fetch the next token at the same time.
  left = primary();

  // If no tokens left, return just the left node
  if (Token.token == T_EOF)
    return (left);

  // Convert the token into a node type
  nodetype = arithop(Token.token);

  // Get the next token in
  scan(&Token);

  // Recursively get the right-hand tree
  right = binexpr();

  // Now build a tree with both sub-trees
  n = mkastnode(nodetype, left, right, 0);
  return (n);
}

Notice that nowhere in this naive parser code is there anything to deal with different operator precedence. As it stands, the code treats all operators as having equal precedence. If you follow the code as it parses the expression 2 * 3 + 4 * 5, you will see that it builds this AST:

请注意，在这个朴素的解析器代码中，没有任何地方可以处理不同的运算符优先级。目前，代码将所有运算符视为具有同等优先级。如果在解析表达式2*3+4*5时遵循代码，您将看到它构建了以下AST：

This is definitely not correct. It will multiply 4*5 to get 20, then do 3+20 to get 23 instead of doing 2*3 to get 6.

这绝对是不对的。它将乘以4*5得到20，然后用3+20得到23，而不是用2*3得到6。

So why did I do this? I wanted to show you that writing a simple parser is easy, but getting it to also do the semantic analysis is harder.

那我为什么要这么做？我想向您展示，编写一个简单的解析器很容易，但让它同时进行语义分析则更难。

Interpreting the Tree 解析语法树

Now that we have our (incorrect) AST tree, let’s write some code to interpret it. Again, we are going to write recursive code to traverse the tree. Here’s the pseudo-code:

现在我们有了（不正确的）AST树，让我们编写一些代码来解释它。同样，我们将编写递归代码来遍历树。以下是伪代码：

interpretTree:
  First, interpret the left-hand sub-tree and get its value
  Then, interpret the right-hand sub-tree and get its value
  Perform the operation in the node at the root of our tree
  on the two sub-tree values, and return this value

首先，解释左边的子树并得到它的值

然后，解释右边的子树并得到它的值

在树的根节点中执行该操作

并返回此值

Going back to the correct AST tree:

回顾正确的AST树：

          +
         / \
        /   \
       /     \
      *       *
     / \     / \
    2   3   4   5

the call structure would look like:

调用结构应当如下：

interpretTree0(tree with +):
  Call interpretTree1(left tree with *):
     Call interpretTree2(tree with 2):
       No maths operation, just return 2
     Call interpretTree3(tree with 3):
       No maths operation, just return 3
     Perform 2 * 3, return 6

  Call interpretTree1(right tree with *):
     Call interpretTree2(tree with 4):
       No maths operation, just return 4
     Call interpretTree3(tree with 5):
       No maths operation, just return 5
     Perform 4 * 5, return 20

  Perform 6 + 20, return 26

Code to Interpret the Tree 解析语法树的代码

This is in interp.c and follows the above pseudo-code:

这些代码在interp.c中，根据以上的伪代码写的的：

// Given an AST, interpret the
// operators in it and return
// a final value.
int interpretAST(struct ASTnode *n) {
  int leftval, rightval;

  // Get the left and right sub-tree values
  if (n->left)
    leftval = interpretAST(n->left);
  if (n->right)
    rightval = interpretAST(n->right);

  switch (n->op) {
    case A_ADD:
      return (leftval + rightval);
    case A_SUBTRACT:
      return (leftval - rightval);
    case A_MULTIPLY:
      return (leftval * rightval);
    case A_DIVIDE:
      return (leftval / rightval);
    case A_INTLIT:
      return (n->intvalue);
    default:
      fprintf(stderr, "Unknown AST operator %d\n", n->op);
      exit(1);
  }
}

Again, the default statement in the switch statement fires when we can’t interpret the AST node type. It’s going to form part of the sematic checking in our parser.

同样，当我们无法解释AST节点类型时，switch语句中的默认语句将激发。它将成为语法分析器中语义检查的一部分。

Building the Parser 构建解析器

There is some other code here and the, like the call to the interpreter in main():
这里还有一些其他代码，比如main（）中对解释器的调用：

  scan(&Token);                 // Get the first token from the input
  n = binexpr();                // Parse the expression in the file
  printf("%d\n", interpretAST(n));      // Calculate the final result
  exit(0);

You can now build the parser by doing:

你可以通过如下步骤构建解析器：

$ make
cc -o parser -g expr.c interp.c main.c scan.c tree.c

I’ve provided several input files for you to test the parser on, but of course you can create your own. Remember, the calculated results are incorrect, but the parser should detect input errors like consecutive numbers, consecutive operators, and a number missing at the end of the input. I’ve also added some debugging code to the interpreter so you can see which AST tree nodes get evaluated in which order:

我已经提供了几个输入文件供您测试解析器，但是您当然可以创建自己的输入文件。请记住，计算结果是不正确的，但是解析器应该检测输入错误，如连续数字、连续运算符和输入末尾缺少的数字。我还向解释器添加了一些调试代码，这样您就可以看到哪些AST树节点是按什么顺序计算的：

$ cat input01
2 + 3 * 5 - 8 / 3

$ ./parser input01
int 2
int 3
int 5
int 8
int 3
8 / 3
5 - 2
3 * 3
2 + 9
11

$ cat input02
13 -6+  4*
5
       +
08 / 3

$ ./parser input02
int 13
int 6
int 4
int 5
int 8
int 3
8 / 3
5 + 2
4 * 7
6 + 28
13 - 34
-21

$ cat input03
12 34 + -56 * / - - 8 + * 2

$ ./parser input03
unknown token in arithop() on line 1

$ cat input04
23 +
18 -
45.6 * 2
/ 18

$ ./parser input04
Unrecognised character . on line 3

$ cat input05
23 * 456abcdefg

$ ./parser input05
Unrecognised character a on line 1

Conclusion and What’s Next 总结展望

A parser recognises the grammar of the language and checks that the input to the compiler conforms to this grammar. If it doesn’t, the parser should print out an error message. As our expression grammar is recursive, we have chosen to write a recursive descent parser to recognise our expressions.

解析器识别语言的语法，并检查编译器的输入是否符合该语法。如果不符合，解析器应该打印一条错误消息。由于表达式语法是递归的，我们选择编写一个递归下降解析器来识别表达式。

Right now the parser works, as shown by the above output, but it fails to get the semantics of the input right. In other words, it doesn’t calculate the correct value of the expressions.

现在解析器可以工作了，如上面的输出所示，但它无法正确获取输入的语义。换句话说，它不会计算表达式的正确值。

In the next part of our compiler writing journey, we will modify the parser so that it also does the semantic analysis of the expressions to get the right maths results.

在编译器编写过程的下一部分中，我们将修改解析器，以便它也对表达式进行语义分析，以获得正确的数学结果。

【acwj】02， Introduction to Parsing 解析器介绍

【acwj】02， Introduction to Parsing 解析器介绍

BNF: Backus-Naur Form 巴克斯范式

Recursive Descent Parsing递归下降解析

Abstract Syntax Trees AST抽象语法树

Building AST Nodes and Trees 构建AST结点和树

Purpose of the AST AST的目的

A Naive Expression Parser 朴素表达式解析器

Interpreting the Tree 解析语法树

Code to Interpret the Tree 解析语法树的代码

Building the Parser 构建解析器

Conclusion and What’s Next 总结展望

相关阅读

相关文章

相关问答

相关文档