编译器之词法分析器（Lexical Analyzer）

伍嘉

2023-12-01

定义

（来自维基百科）
词法分析（英语：lexical analysis）是计算机科学中将字符序列转换为标记（token）序列的过程。进行词法分析的程序或者函数叫作词法分析器（lexical analyzer，简称lexer），也叫扫描器（scanner）。词法分析器一般以函数的形式存在，供语法分析器调用。

任务描述

为下面指定的简单编程语言创建一个词法分析器。程序应从文件和/或stdin读取输入，并将输出写入文件和/或stdout。如果使用的语言具有lexer模块/库/类，那么提供两种版本的解决方案将是很好的选择：一个不带lexer模块，另一个带lexer模块。

支持以下标记
操作符

Name	Common name	Character sequence
Op_multiply	multiply	*
Op_divide	divide	/
Op_mod	mod	%
Op_add	plus	+
Op_subtract	minus	-
Op_negate	unary minus	-
Op_less	less than	<
Op_lessequal	less than or equal	<=
Op_greater	greater than	>
Op_greaterequal	greater than or equal	>=
Op_equal	equal	==
Op_notequal	not equal	!=
Op_not	unary not	!
Op_assign	assignment	=
Op_and	logical and	&&
Op_or	logical or	¦¦

符号

Name	Common name	Character
LeftParen	left parenthesis	(
RightParen	right parenthesis	)
LeftBrace	left brace	{
RightBrace	right brace	}
Semicolon	semi-colon	;
Comma	comma	,

关键字

Name	Character sequence
Keyword_if	if
Keyword_else	else
Keyword_while	while
Keyword_print	print
Keyword_putc	putc

标识符和字面量
这些与之前的标记不同，都有与之关联的值。

Name	Common name	Format description	Format regex	Value
Identifier	identifier	one or more letter/number/underscore characters, but not starting with a number	`[_a-zA-Z][_a-zA-Z0-9]*`	as is
Integer	integer literal	one or more digits	`[0-9]+`	as is, interpreted as a number
Integer	char literal	exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes	`‘([^’\n]`	\n
String	string literal	zero or more characters (anything except newline or double quote), enclosed by double quotes	`“[^”\n]*"`	the characters without the double quotes and with escape sequences converted

对于char和string字面量，支持使用\n转义序列来表示换行符
对于char和string字面量，要表示反斜杠，使用\
不支持其他特殊序列。这意味着：
- 字符字面量不能表示单引号字符（值39）
- 字符串字面量不能表示包含双引号字符的字符串

零宽标记

Name	Location
End_of_input	when the end of the input stream is reached

空格

任意两个标记之间都允许零个或多个空格字符或用 / * … * / 括起来的注释，以下说明除外。
“最长标识匹配”用于解决冲突（例如将 <= 匹配为单个标识而不是 < 和 = 两个标识）
两个具有字母数字字符或下划线的标记之间必须有空格。
- keywords, identifiers, integer literals.
- 例如 ifprint 被识别为一个标识符，而不是关键字 if 和 print
- 例如 42fred 是无效的，既不会被识别为数字也不会被识别为标识符
标记内不允许使用空格（字符和字符串属于值的一部分除外）
- 例如 & & 是无效的，不会被解释为&&操作符

例如以下两个程序片段是等效的：

if ( p /* meaning n is prime */ ) {
    print ( n , " " ) ;
    count = count + 1 ; /* number of primes found so far */
}

if(p){print(n," ");count=count+1;}

所有标记名称

End_of_input  Op_multiply   Op_divide     Op_mod       Op_add     Op_subtract
Op_negate     Op_not        Op_less       Op_lessequal Op_greater Op_greaterequal
Op_equal      Op_notequal   Op_assign     Op_and       Op_or      Keyword_if
Keyword_else  Keyword_while Keyword_print Keyword_putc LeftParen  RightParen
LeftBrace     RightBrace    Semicolon     Comma        Identifier Integer
String

输出格式
程序输出应该是一系列的行，每行包括以下用空格分隔的字段：

标识开始位置的行号
标识开始位置的列号
标识名
标识值 (只对于Identifier, Integer, String)
字段之间的空格数取决于自己，最好对齐

诊断功能
以下错误情况需要捕获：

Error	Example
Empty character constant	`‘’`
Unknown escape sequence.	`\r`
Multi-character constant.	`‘xx’`
End-of-file in comment. Closing comment characters not found.
End-of-file while scanning string literal. Closing string character not found.
End-of-line while scanning string literal. Closing string character not found before end-of-line.
Unrecognized character.	`\|`
Invalid number. Starts like a number, but ends in non-numeric characters.	`123abc`

测试用例

测试用例一

输入

/*
  Hello world
 */
print("Hello, World!\n");

输出：

    4      1 Keyword_print
    4      6 LeftParen
    4      7 String         "Hello, World!\n"
    4     24 RightParen
    4     25 Semicolon
    5      1 End_of_input

测试用例二

输入

/*
  Show Ident and Integers
 */
phoenix_number = 142857;
print(phoenix_number, "\n");

输出

    4      1 Identifier     phoenix_number
    4     16 Op_assign
    4     18 Integer         142857
    4     24 Semicolon
    5      1 Keyword_print
    5      6 LeftParen
    5      7 Identifier     phoenix_number
    5     21 Comma
    5     23 String         "\n"
    5     27 RightParen
    5     28 Semicolon
    6      1 End_of_input

测试用例三

输入

/*
  All lexical tokens - not syntactically correct, but that will
  have to wait until syntax analysis
 */
/* Print   */  print    /* Sub     */  -
/* Putc    */  putc     /* Lss     */  <
/* If      */  if       /* Gtr     */  >
/* Else    */  else     /* Leq     */  <=
/* While   */  while    /* Geq     */  >=
/* Lbrace  */  {        /* Eq      */  ==
/* Rbrace  */  }        /* Neq     */  !=
/* Lparen  */  (        /* And     */  &&
/* Rparen  */  )        /* Or      */  ||
/* Uminus  */  -        /* Semi    */  ;
/* Not     */  !        /* Comma   */  ,
/* Mul     */  *        /* Assign  */  =
/* Div     */  /        /* Integer */  42
/* Mod     */  %        /* String  */  "String literal"
/* Add     */  +        /* Ident   */  variable_name
/* character literal */  '\n'
/* character literal */  '\\'
/* character literal */  ' '

输出

    5     16   Keyword_print
    5     40   Op_subtract
    6     16   Keyword_putc
    6     40   Op_less
    7     16   Keyword_if
    7     40   Op_greater
    8     16   Keyword_else
    8     40   Op_lessequal
    9     16   Keyword_while
    9     40   Op_greaterequal
   10     16   LeftBrace
   10     40   Op_equal
   11     16   RightBrace
   11     40   Op_notequal
   12     16   LeftParen
   12     40   Op_and
   13     16   RightParen
   13     40   Op_or
   14     16   Op_subtract
   14     40   Semicolon
   15     16   Op_not
   15     40   Comma
   16     16   Op_multiply
   16     40   Op_assign
   17     16   Op_divide
   17     40   Integer             42
   18     16   Op_mod
   18     40   String          "String literal"
   19     16   Op_add
   19     40   Identifier      variable_name
   20     26   Integer             10
   21     26   Integer             92
   22     26   Integer             32
   23      1   End_of_input

测试用例四

输入

/*** test printing, embedded \n and comments with lots of '*' ***/
print(42);
print("\nHello World\nGood Bye\nok\n");
print("Print a slash n - \\n.\n");

输出

    2      1 Keyword_print
    2      6 LeftParen
    2      7 Integer            42
    2      9 RightParen
    2     10 Semicolon
    3      1 Keyword_print
    3      6 LeftParen
    3      7 String          "\nHello World\nGood Bye\nok\n"
    3     38 RightParen
    3     39 Semicolon
    4      1 Keyword_print
    4      6 LeftParen
    4      7 String          "Print a slash n - \\n.\n"
    4     33 RightParen
    4     34 Semicolon
    5      1 End_of_input

代码实现

C实现
 C#实现
 Go实现
 Java实现
 JavaScript实现
 Python实现

编译器之词法分析器（Lexical Analyzer）

定义

任务描述

代码实现

相关阅读

相关文章

相关问答

相关文档