【acwj】01, Introduction to Lexical Scanning 词法扫描介绍

况承福

2023-12-01

搬运自https://github.com/DoctorWkt/acwj，一个介绍如何使用C语言编写一个可自举的类C语言编译器的说明。进行了粗略的翻译。

【acwj】01, Introduction to Lexical Scanning 词法扫描介绍

We start our compiler writing journey with a simple lexical scanner. As I mentioned in the previous part, the job of the scanner is to identify the lexical elements, or tokens, in the input language.

我们从一个简单的词法扫描器开始我们的编译器编写之旅。就像我之前提到的，扫描器的工作是在输入语言中分辨词法元素，或者说标记。

We will start with a language that has only five lexical elements:

我们将从一个只有五个词法元素的语言开始：

the four basic maths operators: *, /, + and -
四个基本的数学运算符
decimal whole numbers which have 1 or more digits 0 … 9
数字0～9

Each token that we scan is going to be stored in this structure (from defs.h):

每个我们扫描得到的标记存储在这样的结构里（defs.h）：

// Token structure
struct token {
  int token;
  int intvalue;
};

where the token field can be one of these values (from defs.h):

这里的token从以下值中的取（defs.h）：

// Tokens
enum {
  T_PLUS, T_MINUS, T_STAR, T_SLASH, T_INTLIT
};

When the token is a T_INTLIT (i.e. an integer literal), the intvalue field will hold the value of the integer that we scanned in.

当token的值是T_INILIT时（即字面意义的整数），intvalue字段将存储我们扫描得到的数字。

Functions in `scan.c` scan.c中的函数

The scan.c file holds the functions of our lexical scanner. We are going to read in one character at a time from our input file. However, there will be times when we need to “put back” a character if we have read too far ahead in the input stream. We also want to track what line we are currently on so that we can print the line number in our debug messages. All of this is done by the next() function:

文件scan.c中存放了我们的词法扫描器。我们将从输入文件中一次读取一个字符。但是，当我们在输入流中读得太远时，有时候我们需要“放回”一个字符。我们也需要追踪我们所在的行，以便我们能在debug信息中打印出行号。所有的这些都在next()函数中完成。

// Get the next character from the input file.
static int next(void) {
  int c;

  if (Putback) {                // Use the character put
    c = Putback;                // back if there is one
    Putback = 0;
    return c;
  }

  c = fgetc(Infile);            // Read from input file
  if ('\n' == c)
    Line++;                     // Increment line count
  return c;
}

The Putback and Line variables are defined in data.h along with our input file pointer:

变量Putback和Line，还有我们的输入文件指针Infile定义在data.h中。

extern_ int     Line;
extern_ int     Putback;
extern_ FILE    *Infile;

All C files will include this where extern_ is replaced with extern. But main.c will remove the extern_; hence, these variables will “belong” to main.c.

所有的C文件都将包含data.h，其中extern_将会被替换为extern。但是main.c中将会移除extern_，因此，这些变量是main.c独有的。（即main.c在其他所有文件之前先定义了这些变量，然后其他文件再用extern对这些变量进行了修饰）

Finally, how do we put a character back into the input stream? Thus:

最后，我们要怎么将一个字符放回到输入流中呢？像这样子：

// Put back an unwanted character
static void putback(int c) {
  Putback = c;
}

Ignoring Whitespace 忽略空白字符

We need a function that reads and silently skips whitespace characters until it gets a non-whitespace character, and returns it. Thus:

我们需要一个函数来忽略空白字符，直到其获取了一个非空白字符，并返回。像这样子：

// Skip past input that we don't need to deal with, 
// i.e. whitespace, newlines. Return the first
// character we do need to deal with.
static int skip(void) {
  int c;

  c = next();
  while (' ' == c || '\t' == c || '\n' == c || '\r' == c || '\f' == c) {
    c = next();
  }
  return (c);
}

Scanning Tokens: `scan()` 扫描标记：scan()

So now we can read characters in while skipping whitespace; we can also put back a character if we read one character too far ahead. We can now write our first lexical scanner:

因此，现在我们能在跳过空白的同时读取字符；如果我们读的太多的时候，我们也能放回一个字符。现在，我们可以写我们的第一个词法扫描器了：

// Scan and return the next token found in the input.
// Return 1 if token valid, 0 if no tokens left.
int scan(struct token *t) {
  int c;

  // Skip whitespace
  c = skip();

  // Determine the token based on
  // the input character
  switch (c) {
  case EOF:
    return (0);
  case '+':
    t->token = T_PLUS;
    break;
  case '-':
    t->token = T_MINUS;
    break;
  case '*':
    t->token = T_STAR;
    break;
  case '/':
    t->token = T_SLASH;
    break;
  default:
    // More here soon
  }

  // We found a token
  return (1);
}

That’s it for the simple one-character tokens: for each recognised character, turn it into a token. You may ask: why not just put the recognised character into the struct token? The answer is that later we will need to recognise multi-character tokens such as == and keywords like if and while. So it will make life easier to have an enumerated list of token values.

这就是简单的单字符标记：对于每一个已识别的字符，将其转化为标记。你可能问：为什么不将识别了的标记值放入struct token中（而是记录其枚举值）？理由是，之后我们将需要识别多字符标记，例如==；还有一些关键字，例如if、while。因此，拥有一个枚举列表的标记值将会使得接下来的开发更简单。

Integer Literal Values 整数字面值

In fact, we already have to face this situation as we also need to recognise integer literal values like 3827 and 87731. Here is the missing default code from the switch statement:

实际上，我们已经不得不面对这个局面（指识别多字符标记），因为我们也需要识别整数字面值，例如3827、87731等。以下是switch中空缺的default的代码

  default:

    // If it's a digit, scan the
    // literal integer value in
    if (isdigit(c)) {
      t->intvalue = scanint(c);
      t->token = T_INTLIT;
      break;
    }

    printf("Unrecognised character %c on line %d\n", c, Line);
    exit(1);

Once we hit a decimal digit character, we call the helper function scanint() with this first character. It will return the scanned integer value. To do this, it has to read each character in turn, check that it’s a legitimate digit, and build up the final number. Here is the code:

一旦我们识别到了一个十进制字符，我们调用辅助函数scanint()并将该十进制字符作为参数传递。它将返回扫描了的整数值。为了完成这个工作，它需要轮流读取每一个字符，并检查是否是一个合法的十进制数，然后创建最终的数字。以下是代码：

// Scan and return an integer literal
// value from the input file. Store
// the value as a string in Text.
static int scanint(int c) {
  int k, val = 0;

  // Convert each character into an int value
  while ((k = chrpos("0123456789", c)) >= 0) {
    val = val * 10 + k;
    c = next();
  }

  // We hit a non-integer character, put it back.
  putback(c);
  return val;
}

We start with a zero val value. Each time we get a character in the set 0 to 9 we convert this to an int value with chrpos(). We make val 10 times bigger and then add this new digit to it.

我们从值为0的val开始，每次获取集合0到9中一个字符，并将其转化为一个int值。然后将val乘以10，并将新的数字加进去。

For example, if we have the characters 3, 2, 8, we do:

例如，如果我们有字符3，2，8，我们这么做：

val= 0 * 10 + 3, i.e. 3
val= 3 * 10 + 2, i.e. 32
val= 32 * 10 + 8, i.e. 328

Right at the end, did you notice the call to putback(c)? We found a character that’s not a decimal digit at this point. We can’t simply discard it, but luckily we can put it back in the input stream to be consumed later.

在代码的最后，你注意到函数putback(c)的调用了吗？我们在这个点找到了一个不是十进制数的字符。我们不能简单地忽视它，但是幸运的是我们可以将其放回到输入流中，以便下次使用。

You may also ask at this point: why not simply subtract the ASCII value of ‘0’ from c to make it an integer? The answer is that, later on, we will be able to do chrpos("0123456789abcdef") to convert hexadecimal digits as well.

你可能会在这个点提问：为什么不直接把ASCII码的值减去0作为十进制字符的值？理由是，我们接下来将同样使用chrpos("0123456789abcdef")来转换十六进制的数字。

Here’s the code for chrpos():

以下是chrpos()的代码：

// Return the position of character c
// in string s, or -1 if c not found
static int chrpos(char *s, int c) {
  char *p;

  p = strchr(s, c);
  return (p ? p - s : -1);
}

And that’s it for the lexical scanner code in scan.c for now.

词法扫描器的代码在scan.c中。

Putting the Scanner to Work 让扫描器工作

The code in main.c puts the above scanner to work. The main() function opens up a file and then scans it for tokens:

main.c中的代码让以上扫描器的代码工作。main函数打开了一个文件，然后扫描标记：

void main(int argc, char *argv[]) {
  ...
  init();
  ...
  Infile = fopen(argv[1], "r");
  ...
  scanfile();
  exit(0);
}

And scanfile() loops while there is a new token and prints out the details of the token:

scanfile()循环直至没有新标记，并打印标记的细节

// List of printable tokens
char *tokstr[] = { "+", "-", "*", "/", "intlit" };

// Loop scanning in all the tokens in the input file.
// Print out details of each token found.
static void scanfile() {
  struct token T;

  while (scan(&T)) {
    printf("Token %s", tokstr[T.token]);
    if (T.token == T_INTLIT)
      printf(", value %d", T.intvalue);
    printf("\n");
  }
}

Some Example Input Files 一些输入文件的例子

I’ve provided some example input files so you can see what tokens the scanner finds in each file, and what input files the scanner rejects.

我提供了一些输入文件样例，以便你能看到哪些标记是扫描器在文件中找到的，而哪些标记是扫描器拒绝的。

$ make
cc -o scanner -g main.c scan.c

$ cat input01
2 + 3 * 5 - 8 / 3

$ ./scanner input01
Token intlit, value 2
Token +
Token intlit, value 3
Token *
Token intlit, value 5
Token -
Token intlit, value 8
Token /
Token intlit, value 3

$ cat input04
23 +
18 -
45.6 * 2
/ 18

$ ./scanner input04
Token intlit, value 23
Token +
Token intlit, value 18
Token -
Token intlit, value 45
Unrecognised character . on line 3

Conclusion and What’s Next 总结和接下来的工作

We’ve started small and we have a simple lexical scanner that recognises the four main maths operators and also integer literal values. We saw that we needed to skip whitespace and put back characters if we read too far into the input.

我们已经开始了一点点工作，完成了一个简单的次法扫描器，它能识别4个简单的数学运算和整数字面符。我们发现，如果我们在输入中读得过多，我们需要跳过空白并放回字符。

Single character tokens are easy to scan, but multi-character tokens are a bit harder. But at the end, the scan() function returns the next token from the input file in a struct token variable:

单个字符标记能容易地扫描，但是多字符标记更难一点。但是最后，scan()函数返回了struct token变量中输入文件中的下一个标记：

struct token {
  int token;
  int intvalue;
};

在我们编译器编写之旅的下一个部分，我们将建立一个递归下降解析器来解释输入文件的语法，并计算和打印每个文件的最终值。

【acwj】01, Introduction to Lexical Scanning 词法扫描介绍

【acwj】01, Introduction to Lexical Scanning 词法扫描介绍

Functions in `scan.c` scan.c中的函数

Ignoring Whitespace 忽略空白字符

Scanning Tokens: `scan()` 扫描标记：scan()

Integer Literal Values 整数字面值

Putting the Scanner to Work 让扫描器工作

Some Example Input Files 一些输入文件的例子

Conclusion and What’s Next 总结和接下来的工作

相关阅读

相关文章

相关问答

相关文档

【acwj】01, Introduction to Lexical Scanning 词法扫描介绍

【acwj】01, Introduction to Lexical Scanning 词法扫描介绍

Functions in scan.c scan.c中的函数

Ignoring Whitespace 忽略空白字符

Scanning Tokens: scan() 扫描标记：scan()

Integer Literal Values 整数字面值

Putting the Scanner to Work 让扫描器工作

Some Example Input Files 一些输入文件的例子

Conclusion and What’s Next 总结和接下来的工作

相关阅读

相关文章

相关问答

相关文档

Functions in `scan.c` scan.c中的函数

Scanning Tokens: `scan()` 扫描标记：scan()