当前位置: 首页 > 工具软件 > acwj > 使用案例 >

【acwj】03, Operator Precedence 运算符优先级

卢枫涟
2023-12-01

搬运自https://github.com/DoctorWkt/acwj,一个介绍如何使用C语言编写一个可自举的类C语言编译器的说明。进行了粗略的翻译。

【acwj】03, Operator Precedence 运算符优先级

We saw in the previous part of our compiler writing journey that a parser doesn’t necessarily enforce the semantics of our language. It only enforces the syntax and structural rules of the grammar.

在编译器编写过程的前一部分中,我们看到解析器不一定强制执行我们语言的语义。它只执行语法的语法和结构规则。

We ended up with code that calculates the wrong value of expressions like 2 * 3 + 4 * 5, because the code created an AST that looks like:

我们最终得到的代码计算了错误的表达式值,如2*3+4*5,因为该代码创建了一个如下所示的AST:

     *
    / \
   2   +
      / \
     3   *
        / \
       4   5

而不是

          +
         / \
        /   \
       /     \
      *       *
     / \     / \
    2   3   4   5

To solve this, we have to add code to our parser to perform operator precedence. There are (at least) two ways of doing this:

  • Making the operator precedence explicit in the language’s grammar
  • Influencing the existing parser with an operator precedence table

为了解决这个问题,我们不得不往我们的解析器里增加代码来实现运算符优先级。至少有两种方法能实现:

  • 使运算优先级在语言的语法中显示表达
  • 使用运算符优先级表修正现有解析器

Making the Operator Precedence Explicit 使运算优先级显示表达

Here is our grammar from the last part of the journey:

以下是我们上一个部分的语法:

expression: number
          | expression '*' expression
          | expression '/' expression
          | expression '+' expression
          | expression '-' expression
          ;

number:  T_INTLIT
         ;

Note that there is no differentiation between any of the four maths operators. Let’s tweak the grammar so that there is a difference:

注意,四个数学运算符之间没有区别。让我们调整语法,使其发生改变:

expression: additive_expression
    ;

additive_expression:
      multiplicative_expression
    | additive_expression '+' multiplicative_expression
    | additive_expression '-' multiplicative_expression
    ;

multiplicative_expression:
      number
    | number '*' multiplicative_expression
    | number '/' multiplicative_expression
    ;

number:  T_INTLIT
         ;
S->A
A->B|A+B|A-B
B->C|C*B|C/B
C->0|1|...|9

We now have two types of expressions: additive expressions and multiplicative expressions. Note that the grammar now forces the numbers to be part of multiplicative expressions only. This forces the ‘*’ and ‘/’ operators to bind more tightly to the numbers on either side, thus having higher precedence.

我们现在有两种类型的表达式:加法表达式和乘法表达式。注意,语法现在强制数字只作为乘法表达式的一部分。这迫使“*”和“/”运算符与两边的数字绑定得更紧密,从而具有更高的优先级。

Any additive expression is actually either a multiplicative expression by itself, or an additive (i.e. multiplicative) expression followed by a ‘+’ or ‘-‘ operator then another multiplicative expression. The additive expression is now at a much lower predencence than the multiplicative expression.

任何加法表达式实际上要么是一个乘法表达式本身,要么是一个加法(即乘法)表达式,后跟一个“+”或“-”运算符,然后是另一个乘法表达式。加法表达式现在比乘法表达式的优先级低得多。

Doing The Above in the Recursive Descent Parser 在递归下降解析器中执行上述操作

How do we take the above version of our grammar and implement it into our recursive descent parser? I’ve done this in the file expr2.c and I’ll cover the code below.

我们如何将上述版本的语法实现到递归下降解析器中?我已经在expr2.c文件中完成了这项工作,我将介绍下面的代码。

The answer is to have a multiplicative_expr() function to deal with the ‘*’ and ‘/’ operators, and an additive_expr() function to deal with the lower precedence ‘+’ and ‘-‘ operators.

答案是使用multiplicative_expr() 来处理“*”和“/”运算符,使用additive_expr() 来处理优先级较低的“+”和“-”运算符。

Both functions are going to read in something and an operator. Then, while there are following operators at the same precedence, each function will parse some more of the input and combine the left and right halves with the first operator.

这两个函数都将读入一些内容和一个运算符。然后,虽然后面跟着具有相同优先级的运算符,但每个函数将解析更多的输入,并将左半部和右半部与第一个运算符组合。

However, additive_expr() will have to defer to the higher-precedence multiplicative_expr() function. Here is how this is done.

但是,加法函数必须遵从更高优先级的乘法函数。这是如何做到的?

additive_expr()

// Return an AST tree whose root is a '+' or '-' binary operator
struct ASTnode *additive_expr(void) {
  struct ASTnode *left, *right;
  int tokentype;

  // Get the left sub-tree at a higher precedence than us
  left = multiplicative_expr();

  // If no tokens left, return just the left node
  tokentype = Token.token;
  if (tokentype == T_EOF)
    return (left);

  // Loop working on token at our level of precedence
  while (1) {
    // Fetch in the next integer literal
    scan(&Token);

    // Get the right sub-tree at a higher precedence than us
    right = multiplicative_expr();

    // Join the two sub-trees with our low-precedence operator
    left = mkastnode(arithop(tokentype), left, right, 0);

    // And get the next token at our precedence
    tokentype = Token.token;
    if (tokentype == T_EOF)
      break;
  }

  // Return whatever tree we have created
  return (left);
}

Right at the beginning, we immediately call multiplicative_expr() in case the first operator is a high-precedence ‘*’ or ‘/’. That function will only return when it encounters a low-precedence ‘+’ or ‘-‘ operator.

在开始时,如果第一个运算符是高优先级的“*”或“/”,我们立即调用multiplicative_expr()。该函数仅在遇到低优先级“+”或“-”运算符时返回。

Thus, when we hit the while loop, we know we have a ‘+’ or ‘-‘ operator. We loop until there are no tokens left in the input, i.e. when we hit the T_EOF token.

因此,当我们点击while循环时,我们知道我们有一个“+”或“-”操作符。我们循环直到输入中没有剩余的标记,即当我们碰到T_EOF标记时。

Inside the loop, we call multiplicative_expr() again in case any future operators are higher precedence than us. Again, this will return when they are not.

在循环内部,我们再次调用multiplicative_expr() ,以防后面的运算符优先级高于我们。同样,当它们(后面的运算符)(优先级)没有(当前运算符优先级高)时将返回。

Once we have a left and right sub-tree, we can combine them with the operator we got the last time around the loop. This repeats, so that if we had the expression 2 + 4 + 6, we would end up with the AST tree:

一旦我们有了一个左右的子树,我们就可以将它们与上一次循环中得到的操作符结合起来。这会重复,因此如果我们有表达式 2 + 4 + 6,我们将得到AST树:

       +
      / \
     +   6
    / \
   2   4

But if multiplicative_expr() had its own higher precedence operators, we would be combining sub-trees with multiple nodes in them.

但是,如果multiplicative_expr() 有它自己的更高优先级的操作符,我们将把子树与其中的多个节点结合起来。

multiplicative_expr()

// Return an AST tree whose root is a '*' or '/' binary operator
struct ASTnode *multiplicative_expr(void) {
  struct ASTnode *left, *right;
  int tokentype;

  // Get the integer literal on the left.
  // Fetch the next token at the same time.
  left = primary();

  // If no tokens left, return just the left node
  tokentype = Token.token;
  if (tokentype == T_EOF)
    return (left);

  // While the token is a '*' or '/'
  while ((tokentype == T_STAR) || (tokentype == T_SLASH)) {
    // Fetch in the next integer literal
    scan(&Token);
    right = primary();

    // Join that with the left integer literal
    left = mkastnode(arithop(tokentype), left, right, 0);

    // Update the details of the current token.
    // If no tokens left, return just the left node
    tokentype = Token.token;
    if (tokentype == T_EOF)
      break;
  }

  // Return whatever tree we have created
  return (left);
}

The code is similar to additive_expr() except that we get to call primary() to get real integer literals! We also only loop when we have operators at our high precedence level, i.e. ‘*’ and ‘/’ operators. As soon as we hit a low precedence operator, we simply return the sub-tree that we’ve built to this point. This goes back to additive_expr() to deal with the low precedence operator.

这段代码与additive_expr()类似,只是我们需要调用primary()来获取真正的整数文本!我们也只在有高优先级的运算符时循环,即“*”和“/”运算符。只要我们找到一个低优先级的操作符,我们就可以简单地返回到这一点上所构建的子树。这返回到additive_expr() 来处理低优先级运算符。

Drawbacks of the Above 上述措施的缺点

The above way of constructing a recursive descent parser with explicit operator precedence can be inefficient because of all the function calls needed to reach the right level of precedence. There also has to be functions to deal with each level of operator precedence, so we end up with lots of lines of code.

上述构造具有显式运算符优先级的递归下降解析器的方法可能效率低下,因为所有函数调用都需要达到正确的优先级级别,还必须有函数来处理每个级别的运算符优先级。因此我们最终会得到很多行代码。

The Alternative: Pratt Parsing 替代方案:Pratt解析

One way to cut down on the amount of code is to use a Pratt parser which has a table of precedence values associated with each token instead of having functions that replicate the explicit precedence in the grammar.

减少代码量的一种方法是使用Pratt解析器,该解析器具有与每个标记相关联的优先级值表,而不是使用复制语法中显式优先级的函数。

At this point I highly recommend that you read Pratt Parsers: Expression Parsing Made Easy by Bob Nystrom. Pratt parsers still make my head hurt, so read as much as you can and get comfortable with the basic concept.

此时,我强烈建议您阅读Pratt Parsers:Bob Nystrom简化的表达式解析。普拉特语法分析器仍然让我头疼,所以尽可能多地阅读并熟悉基本概念。

expr.c: Pratt Parsing

I’ve implemented Pratt parsing in expr.c which is a drop-in replacement for expr2.c. Let’s start the tour.

Firstly, we need some code to determine the precedence levels for each token:

我已经在expr.c中实现了Pratt解析,它是expr2.c的替代品。让我们开始吧。

首先,我们需要一些代码来确定每个标记的优先级:

// Operator precedence for each token
static int OpPrec[] = { 0, 10, 10, 20, 20,    0 };
//                     EOF  +   -   *   /  INTLIT

// Check that we have a binary operator and
// return its precedence.
static int op_precedence(int tokentype) {
  int prec = OpPrec[tokentype];
  if (prec == 0) {
    fprintf(stderr, "syntax error on line %d, token %d\n", Line, tokentype);
    exit(1);
  }
  return (prec);
}

Higher numbers (e.g. 20) mean a higher precedence than lower numbers (e.g. 10).

更大的数字(例如20)比起更低的数字(例如10)意味着更高的优先级

Now, you might ask: why have a function when you have a look-up table called OpPrec[]? The answer is: to spot syntax errors.

现在,你可能会问:为什么需要一个单独的函数来查询优先级呢?答案是:为了发现语法错误。

Consider an input that looks like 234 101 + 12. We can scan in the first two tokens. But if we simply used OpPrec[] to get the precedence of the second 101 token, we wouldn’t notice that it isn’t an operator. Thus, the op_precedence() function enforces the correct grammar syntax.

考虑这样的输入:234 101 + 12。我们能扫描前两个标记,但是当我们简单地用OpPrec来获取第二个标记101的优先级时,我们没有办法注意到它是不是一个操作符。因此,使用函数op_precedence()来执行语法检查。

Now, instead of having a function for each precedence level, we have a single expression function that uses the table of operator precedences:

现在,我们没有为每个优先级设置函数,而是使用一个表达式函数,该函数使用运算符优先级表:

// Return an AST tree whose root is a binary operator.
// Parameter ptp is the previous token's precedence.
struct ASTnode *binexpr(int ptp) {
  struct ASTnode *left, *right;
  int tokentype;

  // Get the integer literal on the left.
  // Fetch the next token at the same time.
  left = primary();

  // If no tokens left, return just the left node
  tokentype = Token.token;
  if (tokentype == T_EOF)
    return (left);

  // While the precedence of this token is
  // more than that of the previous token precedence
  while (op_precedence(tokentype) > ptp) {
    // Fetch in the next integer literal
    scan(&Token);

    // Recursively call binexpr() with the
    // precedence of our token to build a sub-tree
    right = binexpr(OpPrec[tokentype]);

    // Join that sub-tree with ours. Convert the token
    // into an AST operation at the same time.
    left = mkastnode(arithop(tokentype), left, right, 0);

    // Update the details of the current token.
    // If no tokens left, return just the left node
    tokentype = Token.token;
    if (tokentype == T_EOF)
      return (left);
  }

  // Return the tree we have when the precedence
  // is the same or lower
  return (left);
}

Firstly, note that this is still recursive like the previous parser functions. This time, we receive the precedence level of the token that was found before we got called. main() will call us with the lowest precedence, 0, but we will call ourselves with higher values.

首先,请注意,这仍然是递归的,就像前面的解析器函数一样。这一次,我们收到了在调用之前找到的令牌的优先级。main()将使用最低优先级0调用我们,但我们将使用更高的值调用我们自己。

You should also spot that the code is quite similar to the multiplicative_expr() function: read in an integer literal, get the operator’s token type, then loop building a tree.

您还应该注意到,该代码与multiplicative_expr() 函数非常相似:读入整数文本,获取运算符的标记类型,然后循环构建树。

The difference is the loop condition and body:

不同之处在于循环条件和循环体:

multiplicative_expr():
  while ((tokentype == T_STAR) || (tokentype == T_SLASH)) {
    scan(&Token); right = primary();

    left = mkastnode(arithop(tokentype), left, right, 0);

    tokentype = Token.token;
    if (tokentype == T_EOF) return (left);
  }

binexpr():
  while (op_precedence(tokentype) > ptp) {
    scan(&Token); right = binexpr(OpPrec[tokentype]);

    left = mkastnode(arithop(tokentype), left, right, 0);

    tokentype = Token.token;
    if (tokentype == T_EOF) return (left);
  }

With the Pratt parser, when the next operator has a higher precedence than our current token, instead of just getting the next integer literal with primary(), we call ourselves with binexpr(OpPrec[tokentype]) to raise the operator precedence.

对于Pratt解析器,当下一个运算符的优先级高于当前标记时,我们使用binexpr(OpPrec[tokentype]) 来提高运算符的优先级,而不是使用primary(),来获取下一个整数文本。

Once we hit a token at our precedence level or lower, we will simply:

一旦我们命中优先级相同或更低级别的标记,我们只需:

  return (left);

This will either be a sub-tree with lots of nodes and operators at a higher precedence that the operator that called us, or it might be a single integer literal for an operator at the same predence as us.

这可能是一个子树,其中包含大量节点和运算符,优先级高于调用我们的运算符,也可能是与我们优先级相同的运算符的单个整数字面值。

Now we have a single function to do expression parsing. It uses a small helper function to enforce the operator precedence, and thus implements the semantics of our language.

现在我们有一个函数来进行表达式解析。它使用一个小的辅助函数来强制运算符优先级,从而实现我们语言的语义。

Putting Both Parsers Into Action 将两个解析器都投入运行

You can make two programs, one with each parser:
你可以同时编译两个程序:

$ make parser                                        # Pratt Parser
cc -o parser -g expr.c interp.c main.c scan.c tree.c

$ make parser2                                       # Precedence Climbing
cc -o parser2 -g expr2.c interp.c main.c scan.c tree.c

You can also test both parsers with the same input files from the previous part of our journey:

也可以使用和之前的部分一样的测试文件来进行测试

$ make test
(./parser input01; \
 ./parser input02; \
 ./parser input03; \
 ./parser input04; \
 ./parser input05)
15                                       # input01 result
29                                       # input02 result
syntax error on line 1, token 5          # input03 result
Unrecognised character . on line 3       # input04 result
Unrecognised character a on line 1       # input05 result

$ make test2
(./parser2 input01; \
 ./parser2 input02; \
 ./parser2 input03; \
 ./parser2 input04; \
 ./parser2 input05)
15                                       # input01 result
29                                       # input02 result
syntax error on line 1, token 5          # input03 result
Unrecognised character . on line 3       # input04 result
Unrecognised character a on line 1       # input05 result

Conclusion and What’s Next 总结展望

It’s probably time to step back a bit and see where we’ve got to. We now have:

  • a scanner that recognises and returns the tokens in our language
  • a parser that recognises our grammar, reports syntax errors and builds an Abstract Syntax Tree
  • a precedence table for the parser that implements the semantics of our language
  • an interpreter that traverses the Abstract Syntax Tree depth-first and calculates the result of the expression in the input

也许是时候退一步,看看我们该做什么了。我们现在有:

  • 识别并返回我们语言中的标记的扫描仪
  • 识别语法、报告语法错误并构建抽象语法树的解析器
  • 实现我们语言语义的解析器的优先表
  • 深度优先遍历抽象语法树并计算输入中表达式结果的解释器

What we don’t have yet is a compiler. But we are so close to making our first compiler!

我们还没有一个编译器。但是我们离制作第一个编译器已经很近了!

In the next part of our compiler writing journey, we will replace the interpreter. In its place, we will write a translator that generates x86-64 assembly code for each AST node that has a maths operator. We will also generate some assembly preamble and postamble to support the assembly code that the generator outputs.

在编译器编写过程的下一部分,我们将替换解释器。我们将编写一个转换器,为每个具有数学运算符的AST节点生成x86-64汇编代码。我们还将生成一些汇编前导码和后导码,以支持生成器输出的汇编代码。

 类似资料: