(来自维基百科)
词法分析(英语:lexical analysis)是计算机科学中将字符序列转换为标记(token)序列的过程。进行词法分析的程序或者函数叫作词法分析器(lexical analyzer,简称lexer),也叫扫描器(scanner)。词法分析器一般以函数的形式存在,供语法分析器调用。
为下面指定的简单编程语言创建一个词法分析器。 程序应从文件和/或stdin读取输入,并将输出写入文件和/或stdout。 如果使用的语言具有lexer模块/库/类,那么提供两种版本的解决方案将是很好的选择:一个不带lexer模块,另一个带lexer模块。
支持以下标记
操作符
Name | Common name | Character sequence |
---|---|---|
Op_multiply | multiply | * |
Op_divide | divide | / |
Op_mod | mod | % |
Op_add | plus | + |
Op_subtract | minus | - |
Op_negate | unary minus | - |
Op_less | less than | < |
Op_lessequal | less than or equal | <= |
Op_greater | greater than | > |
Op_greaterequal | greater than or equal | >= |
Op_equal | equal | == |
Op_notequal | not equal | != |
Op_not | unary not | ! |
Op_assign | assignment | = |
Op_and | logical and | && |
Op_or | logical or | ¦¦ |
符号
Name | Common name | Character |
---|---|---|
LeftParen | left parenthesis | ( |
RightParen | right parenthesis | ) |
LeftBrace | left brace | { |
RightBrace | right brace | } |
Semicolon | semi-colon | ; |
Comma | comma | , |
关键字
Name | Character sequence |
---|---|
Keyword_if | if |
Keyword_else | else |
Keyword_while | while |
Keyword_print | |
Keyword_putc | putc |
标识符和字面量
这些与之前的标记不同,都有与之关联的值。
Name | Common name | Format description | Format regex | Value |
---|---|---|---|---|
Identifier | identifier | one or more letter/number/underscore characters, but not starting with a number | [_a-zA-Z][_a-zA-Z0-9]* | as is |
Integer | integer literal | one or more digits | [0-9]+ | as is, interpreted as a number |
Integer | char literal | exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes | ‘([^’\n] | \n |
String | string literal | zero or more characters (anything except newline or double quote), enclosed by double quotes | “[^”\n]*" | the characters without the double quotes and with escape sequences converted |
零宽标记
Name | Location |
---|---|
End_of_input | when the end of the input stream is reached |
空格
例如以下两个程序片段是等效的:
if ( p /* meaning n is prime */ ) {
print ( n , " " ) ;
count = count + 1 ; /* number of primes found so far */
}
if(p){print(n," ");count=count+1;}
所有标记名称
End_of_input Op_multiply Op_divide Op_mod Op_add Op_subtract
Op_negate Op_not Op_less Op_lessequal Op_greater Op_greaterequal
Op_equal Op_notequal Op_assign Op_and Op_or Keyword_if
Keyword_else Keyword_while Keyword_print Keyword_putc LeftParen RightParen
LeftBrace RightBrace Semicolon Comma Identifier Integer
String
输出格式
程序输出应该是一系列的行,每行包括以下用空格分隔的字段:
诊断功能
以下错误情况需要捕获:
Error | Example |
---|---|
Empty character constant | ‘’ |
Unknown escape sequence. | \r |
Multi-character constant. | ‘xx’ |
End-of-file in comment. Closing comment characters not found. | |
End-of-file while scanning string literal. Closing string character not found. | |
End-of-line while scanning string literal. Closing string character not found before end-of-line. | |
Unrecognized character. | | |
Invalid number. Starts like a number, but ends in non-numeric characters. | 123abc |
测试用例
输入
/*
Hello world
*/
print("Hello, World!\n");
输出:
4 1 Keyword_print
4 6 LeftParen
4 7 String "Hello, World!\n"
4 24 RightParen
4 25 Semicolon
5 1 End_of_input
输入
/*
Show Ident and Integers
*/
phoenix_number = 142857;
print(phoenix_number, "\n");
输出
4 1 Identifier phoenix_number
4 16 Op_assign
4 18 Integer 142857
4 24 Semicolon
5 1 Keyword_print
5 6 LeftParen
5 7 Identifier phoenix_number
5 21 Comma
5 23 String "\n"
5 27 RightParen
5 28 Semicolon
6 1 End_of_input
输入
/*
All lexical tokens - not syntactically correct, but that will
have to wait until syntax analysis
*/
/* Print */ print /* Sub */ -
/* Putc */ putc /* Lss */ <
/* If */ if /* Gtr */ >
/* Else */ else /* Leq */ <=
/* While */ while /* Geq */ >=
/* Lbrace */ { /* Eq */ ==
/* Rbrace */ } /* Neq */ !=
/* Lparen */ ( /* And */ &&
/* Rparen */ ) /* Or */ ||
/* Uminus */ - /* Semi */ ;
/* Not */ ! /* Comma */ ,
/* Mul */ * /* Assign */ =
/* Div */ / /* Integer */ 42
/* Mod */ % /* String */ "String literal"
/* Add */ + /* Ident */ variable_name
/* character literal */ '\n'
/* character literal */ '\\'
/* character literal */ ' '
输出
5 16 Keyword_print
5 40 Op_subtract
6 16 Keyword_putc
6 40 Op_less
7 16 Keyword_if
7 40 Op_greater
8 16 Keyword_else
8 40 Op_lessequal
9 16 Keyword_while
9 40 Op_greaterequal
10 16 LeftBrace
10 40 Op_equal
11 16 RightBrace
11 40 Op_notequal
12 16 LeftParen
12 40 Op_and
13 16 RightParen
13 40 Op_or
14 16 Op_subtract
14 40 Semicolon
15 16 Op_not
15 40 Comma
16 16 Op_multiply
16 40 Op_assign
17 16 Op_divide
17 40 Integer 42
18 16 Op_mod
18 40 String "String literal"
19 16 Op_add
19 40 Identifier variable_name
20 26 Integer 10
21 26 Integer 92
22 26 Integer 32
23 1 End_of_input
输入
/*** test printing, embedded \n and comments with lots of '*' ***/
print(42);
print("\nHello World\nGood Bye\nok\n");
print("Print a slash n - \\n.\n");
输出
2 1 Keyword_print
2 6 LeftParen
2 7 Integer 42
2 9 RightParen
2 10 Semicolon
3 1 Keyword_print
3 6 LeftParen
3 7 String "\nHello World\nGood Bye\nok\n"
3 38 RightParen
3 39 Semicolon
4 1 Keyword_print
4 6 LeftParen
4 7 String "Print a slash n - \\n.\n"
4 33 RightParen
4 34 Semicolon
5 1 End_of_input