搬运自https://github.com/DoctorWkt/acwj,一个介绍如何使用C语言编写一个可自举的类C语言编译器的说明。进行了粗略的翻译。
It ‘s time to add some “proper” statements to the grammar of our language. I want to be able to write lines of code like this:
是时候在我们的语言语法中添加一些“适当的”语句了。我希望能够像这样编写代码:
print 2 + 3 * 5;
print 18 - 6/3 + 4*2;
Of course, as we are ignoring whitespace, there’ s no necessity that all the tokens for one statement are on the same line.Each statement starts with the keyword print
and is terminated with a semicolon.So these are going to become new tokens in our language.
当然,由于我们忽略了空格,一条语句的所有标记不一定都在同一行。每条语句都以关键字print开头,以分号结尾。因此,这些将成为我们语言中的新标记。
We’ve already seen the BNF notation for expressions. Now let’s define the BNF syntax for the above types of statements:
我们已经看到了expression的BNF表达,现在让我们为statements定义BNF语法
statements: statement
| statement statements
;
statement: 'print' expression ';'
;
An input file consists of several statements. They are either one statement, or a statement followed by more statements. Each statement starts with the keyword print
, then one expression, then a semicolon.
一个输入文件由几个语句组成。它们要么是一条语句,要么是一条语句后跟多条语句。每个语句都以关键字print开头,然后是一个表达式,然后是分号。
Before we can get to the code that parses the above syntax, we need to add a few more bits and pieces to the existing code. Let’s start with the lexical scanner.
在开始解析上述语法的代码之前,我们需要向现有代码中添加更多的片段。让我们从词法扫描器开始。
Adding a token for semicolons will be easy. Now, the print
keyword. Later on, we’ll have many keywords in the language, plus identifiers for our variables, so we’ll need to add some code which helps us to deal with them.
为分号添加标记将很容易。现在,着重于print
关键字。稍后,我们将在语言中有许多关键字,以及变量的标识符,因此我们需要添加一些代码来帮助我们处理它们。
In scan.c
, I’ve added this code which I’ve borrowed from the SubC compiler. It reads in alphanumeric characters into a buffer until it hits a non-alphanumeric character.
在scan.c
中,我添加了从SubC编译器中借用的代码。它将字母数字字符读入缓冲区,直到碰到非字母数字字符。
// Scan an identifier from the input file and
// store it in buf[]. Return the identifier's length
static int scanident(int c, char *buf, int lim) {
int i = 0;
// Allow digits, alpha and underscores
while (isalpha(c) || isdigit(c) || '_' == c) {
// Error if we hit the identifier length limit,
// else append to buf[] and get next character
if (lim - 1 == i) {
printf("identifier too long on line %d\n", Line);
exit(1);
} else if (i < lim - 1) {
buf[i++] = c;
}
c = next();
}
// We hit a non-valid character, put it back.
// NUL-terminate the buf[] and return the length
putback(c);
buf[i] = '\0';
return (i);
}
We also need a function to recognise keywords in the language. One way would be to have a list of keywords, and to walk the list and strcmp()
each one against the buffer from scanident()
. The code from SubC has an optimisation: match against the first letter before doing the strcmp()
. This speeds up the comparison against dozens of keywords. Right now we don’t need this optimisation but I’ve put it in for later:
我们还需要一个函数来识别语言中的关键字。一种方法是使用一个关键字列表,并根据ScanIdentit()
的缓冲区遍历列表和strcmp()
。SubC的代码有一个优化:在执行strcmp()
之前匹配第一个字母。这加快了与几十个关键字的比较。现在我们不需要这种优化,但我已将其放在后面:
// Given a word from the input, return the matching
// keyword token number or 0 if it's not a keyword.
// Switch on the first letter so that we don't have
// to waste time strcmp()ing against all the keywords.
static int keyword(char *s) {
switch (*s) {
case 'p':
if (!strcmp(s, "print"))
return (T_PRINT);
break;
}
return (0);
}
Now, at the bottom of the switch statement in scan()
, we add this code to recognise semicolons and keywords:
现在,在scan()
中switch语句的底部,我们添加以下代码来识别分号和关键字:
case ';':
t->token = T_SEMI;
break;
default:
// If it's a digit, scan the
// literal integer value in
if (isdigit(c)) {
t->intvalue = scanint(c);
t->token = T_INTLIT;
break;
} else if (isalpha(c) || '_' == c) {
// Read in a keyword or identifier
scanident(c, Text, TEXTLEN);
// If it's a recognised keyword, return that token
if (tokentype = keyword(Text)) {
t->token = tokentype;
break;
}
// Not a recognised keyword, so an error for now
printf("Unrecognised symbol %s on line %d\n", Text, Line);
exit(1);
}
// The character isn't part of any recognised token, error
printf("Unrecognised character %c on line %d\n", c, Line);
exit(1);
I’ve also added a global Text
buffer to store the keywords and identifiers:
我还添加了一个全局文本缓冲区Text
来存储关键字和标识符:
#define TEXTLEN 512 // Length of symbols in input
extern_ char Text[TEXTLEN + 1]; // Last identifier scanned
Up to now our input files have contained just a single expression; therefore, in our Pratt parser code in binexpr()
(in expr.c
), we had this code to exit the parser:
到目前为止,我们的输入文件只包含一个表达式;因此,在binexpr()
中的Pratt解析器代码(在expr.c
中)中,我们用以下代码退出解析器:
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
return (left);
With our new grammar, each expression is terminated by a semicolon. Thus, we need to change the code in the expression parser to spot the T_SEMI
tokens and exit the expression parsing:
在我们的新语法中,每个表达式都以分号结尾。因此,我们需要更改表达式解析器中的代码,以T_SEMI
找到分号并退出表达式解析:
// Return an AST tree whose root is a binary operator.
// Parameter ptp is the previous token's precedence.
struct ASTnode *binexpr(int ptp) {
struct ASTnode *left, *right;
int tokentype;
// Get the integer literal on the left.
// Fetch the next token at the same time.
left = primary();
// If we hit a semicolon, return just the left node
tokentype = Token.token;
if (tokentype == T_SEMI)
return (left);
while (op_precedence(tokentype) > ptp) {
...
// Update the details of the current token.
// If we hit a semicolon, return just the left node
tokentype = Token.token;
if (tokentype == T_SEMI)
return (left);
}
}
I want to keep the generic code generator in gen.c
separate from the CPU-specific code in cg.c
. That also means that the rest of the compiler should only ever call the functions in gen.c
, and only gen.c
should call the code in cg.c
.
我想将gen.c
中的通用代码生成器与cg.c
中的CPU特定代码分开。这也意味着编译器的其余部分应该只调用gen.c
中的函数,并且只有gen.c
能调用cg.c
中的代码。
To this end, I’ve defined some new “front-end” functions in gen.c
:
在最后,我gen.c
中定义了一些新的“前端”函数
void genpreamble() { cgpreamble(); }
void genpostamble() { cgpostamble(); }
void genfreeregs() { freeall_registers(); }
void genprintint(int reg) { cgprintint(reg); }
We have a new file stmt.c
. This will hold the parsing code for all the main statements in our language. Right now, we need to parse the BNF grammar for statements which I gave up above. This is done with this single function. I’ve converted the recursive definition into a loop:
我们有一个新文件stmt.c
。这将保存我们语言中所有主要语句的解析代码。现在,我们需要为我在上面放弃的语句解析BNF语法。这是通过这个单一函数完成的。我已将递归定义转换为循环:
// Parse one or more statements
void statements(void) {
struct ASTnode *tree;
int reg;
while (1) {
// Match a 'print' as the first token
match(T_PRINT, "print");
// Parse the following expression and
// generate the assembly code
tree = binexpr(0);
reg = genAST(tree);
genprintint(reg);
genfreeregs();
// Match the following semicolon
// and stop if we are at EOF
semi();
if (Token.token == T_EOF)
return;
}
}
In each loop, the code finds a T_PRINT token. It then calls binexpr()
to parse the expression. Finally, it finds the T_SEMI token. If a T_EOF token follows, we break out of the loop.
在每个循环中,这段从T_PRINT
标记开始寻找。然后,其调用binexpr()
来解析该表达式。最后,它找到T_SEMI
token。如果最后跟随的是T_EOF
标记,则我们跳出这个循环
After each expression tree, the code in gen.c
is called to convert the tree into assembly code and to call the assembly printint()
function to print out the final value.
在每个表达式树之后,调用gen.c
中的代码将树转换为汇编代码,并调用汇编的printint()
函数打印最终值。
There are a couple of new helper functions in the above code, which I’ve put into a new file, misc.c
:
上面的代码中有一对辅助函数,我将其放在了misc.c
中:
// Ensure that the current token is t,
// and fetch the next token. Otherwise
// throw an error
void match(int t, char *what) {
if (Token.token == t) {
scan(&Token);
} else {
printf("%s expected on line %d\n", what, Line);
exit(1);
}
}
// Match a semicon and fetch the next token
void semi(void) {
match(T_SEMI, ";");
}
These form part of the syntax checking in the parser. Later on, I’ll add more short functions to call match()
to make our syntax checking easier.
它们构成解析器中语法检查的一部分。稍后,我将添加更多的短函数来调用match()
,以简化语法检查。
main()
main函数的修改main()
used to call binexpr()
directly to parse the single expression in the old input files. Now it does this:
main
函数之前直接调用binexpr()
函数来解析单表达式。现在这么做:
scan(&Token); // Get the first token from the input
genpreamble(); // Output the preamble
statements(); // Parse the statements in the input
genpostamble(); // Output the postamble
fclose(Outfile); // Close the output file and exit
exit(0);
That’s about it for the new and changed code. Let’s give the new code a whirl. Here is the new input file, input01
:
print 12 * 3;
print
18 - 2
* 4; print
1 + 2 +
9 - 5/2 + 3*5;
Yes I’ve decided to check that we have have tokens spread out across multiple lines. To compile and run the input file, do a make test
:
是的,我决定检查一下我们是否有分散在多行的标记。要编译并运行输入文件,请执行make test
:
$ make test
cc -o comp1 -g cg.c expr.c gen.c main.c misc.c scan.c stmt.c tree.c
./comp1 input01
cc -o out out.s
./out
36
10
25
We’ve added our first “real” statement grammar to our language. I’ve defined it in BNF notation, but it was easier to implement it with a loop and not recursively. Don’t worry, we’ll go back to doing recursive parsing soon.
Along the way we had to modify the scanner, add support for keywords and identifiers, and to more cleanly separate the generic code generator and the CPU-specific generator.
In the next part of our compiler writing journey, we will add variables to the language. This will require a significant amount of work.
我们已经在我们的语言中添加了第一个“真实”语句语法。我已经用BNF符号定义了它,但是用循环实现它更容易,而不是递归实现。别担心,我们很快就会回到递归解析。
在此过程中,我们必须修改扫描器,添加对关键字和标识符的支持,并更清晰地分离通用代码生成器和特定于CPU的生成器。
在编译器编写过程的下一部分中,我们将向语言中添加变量。这将需要大量的工作。