问题：

ANTLR4行注释和文本解析问题

刘选

2023-03-14

我正在编写c头样式文件的解析器，并面临正确的行注释处理问题。

CustomLexer。g4级

lexer grammar CustomLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
UNDEF           : '#undef';
IF              : '#if';
ELIF            : '#elif';
ELSE            : '#else';
IFDEF           : '#ifdef';
IFNDEF          : '#ifndef';
ENDIF           : '#endif';
ENABLED         : 'ENABLED';
DISABLED        : 'DISABLED';
EITHER          : 'EITHER';
ANY             : 'ANY';
DEFINED         : 'defined';
BOTH            : 'BOTH';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
HEXADECIMAL     : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX  : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
NUMBER          : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}';
OPAREN          : '(';
CPAREN          : ')';
OBRACE          : '{';
CBRACE          : '}';
ADD             : '+';
SUBTRACT        : '-';
MULTIPLY        : '*';
DIVIDE          : '/';
MODULUS         : '%';
OR              : '||';
AND             : '&&';
EQUALS          : '==';
NEQUALS         : '!=';
GTEQUALS        : '>=';
LTEQUALS        : '<=';
GT              : '>';
LT              : '<';
EXCL            : '!';
QMARK           : '?';
COLON           : ':';
COMA            : ',';
OTHER           : .;

fragment Int    : [0-9] Digit* | '0';
fragment Digit  : [0-9];

mode COMMENT_MODE;
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_IF         : '#if' -> type(IF), popMode;
  COMMENT_MODE_ENDIF      : '#endif' -> type(ENDIF), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
  
  COMMENT_MODE_PART       : ~[\r\n];

CustomParser。g4：

parser grammar CustomParser;

options { tokenVocab=CustomLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | comment? undefDirective
 | comment? ifDirective
 | comment? ifdefDirective
 | comment? ifndefDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

subDirectives
 : ifDirective+
 | ifdefDirective+
 | ifndefDirective+
 | defineDirective+
 | undefDirective+
 | comment+
 ;

ifdefDirective
 : IFDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifndefDirective
 : IFNDEF IDENTIFIER subDirectives+ ENDIF
 ;

ifDirective
 : ifStatement elseIfStatement* elseStatement? ENDIF
 ;

ifStatement
 : IF expression (subDirectives)*
 ;

elseIfStatement
 : ELIF expression (subDirectives)*
 ;

elseStatement
 : ELSE (subDirectives)*
 ;

defineDirective
 : BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
 | BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
 ;

undefDirective
 : BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment+
 ;

expression
 : simpleExpression
 | customExpression
 | enabledExpression
 | disabledExpression
 | bothExpression
 | eitherExpression
 | anyExpression
 | definedExpression
 | comparisonExpression
 | arithmeticExpression
 ;

arithmeticExpression
 : arithmeticExpression  (MULTIPLY | DIVIDE) arithmeticExpression
 | arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
 | OPAREN arithmeticExpression CPAREN
 | expressionIdentifier
 ;

comparisonExpression
 : comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
 | comparisonExpression (AND | OR) comparisonExpression
 | EXCL? OPAREN comparisonExpression CPAREN
 | eitherExpression
 | enabledExpression
 | bothExpression
 | anyExpression
 | definedExpression
 | disabledExpression
 | customExpression
 | simpleExpression
 | expressionIdentifier
 ;

enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;

identifiers
 : IDENTIFIER COMA?
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

info_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

它可以很好地处理我在头文件中的95%的指令和评论，但很少有场景仍然没有得到正确处理：

1.行注释

输入：

//1
//#define ID1 //2

这是令牌列表：

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

我想实现第07行的令牌是第09行令牌的一部分，并解析为COMMENT\u START令牌

2、用文本定义指令

其他定义规则正常工作，但：

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

这些“定义”指令正在解析一个异常

如果您能帮助我解决这两个问题，或者对如何优化我的lexer/解析器提出建议，我将不胜感激。

提前感谢！

===========================================更新=========================================第一个测试用例：

输入：

//1
//#define ID1 //2

当前结果：

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.      line_comment
08.        COMMENT_START: "//"
09.    defineDirective:8
10.      DEFINE: "#define"
11.      IDENTIFIER: "ID1"
12.      info_comment
13.        COMMENT_START: "//"
14.        COMMENT_MODE_PART: "2"
15.<EOF>

预期结果：

01. compilationUnit
02.  statement:2
03.    comment:2
04.      line_comment
05.        COMMENT_START: "//"
06.        COMMENT_MODE_PART: "1"
07.    defineDirective:8
08.      COMMENT_START: "//"  
09.      DEFINE: "#define"
10.      IDENTIFIER: "ID1"
11.      info_comment
12.        COMMENT_START: "//"
13.        COMMENT_MODE_PART: "2"
14.<EOF>

第二个测试用例：

输入：

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

当前结果：

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \""
07.  IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>

预期结果：

01.compilationUnit
02. statement:2
03.  defineDirective:5
04.   DEFINE: "#define"
05.   IDENTIFIER: "USER_DESC_2"
06.   STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>

在预期结果中，STRING表示结果文本。在这里，我真的不知道增强字符串Lexer标记定义或引入新的解析规则来覆盖这种情况是否更好

晋骏喆

2023-03-14

混合这篇文章、你之前的问题和巴特的回答，假设定义指令是在形式上

optional_// #define IDENTIFIER replacement_value optional_line_comment

并且给定输入文件input.txt

/**
 * BLOCK COMMENT
 */
#pragma once
//#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

// Line 1
// Line 2
//

//======================= this is a section ======================
// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

//1
//#define ID1 //2

#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL

#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100) 
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)

如果我已经很好地理解了你的两个问题，那么语法必须为每个指令或没有后跟指令的注释生成一个语句。指令前面可以有注释，注释成为语句的一部分。一个指令可以被注释掉，后面跟着一个内联行注释（即，在同一行上）。

语法Header. g4（无痕迹）：

grammar Header;

compilationUnit
    @init {System.out.println("Last update 1253");}
    :   ( statement {System.out.println("Statement found : `" + $statement.text + "`");}
        )* EOF
    ;

statement
    :   comment? pragma_directive
    |   comment? define_directive
    |   section
    |   comment
    ;

pragma_directive
     :   PRAGMA char_sequence
     ;

define_directive
    :   define_identifier replacement_comment[$define_identifier.statement_line]
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [int statement_line]
    :   anything+ line_comment?
    |   {getCurrentToken().getLine() == $statement_line}? line_comment
    |   {getCurrentToken().getLine() != $statement_line}?
    ;

section
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : . ;

执行：

$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Header.g4 
$ javac Header*.java
$ grun Header compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@84,315:321='#define',<'#define'>,19:0]
[@85,322:322=' ',<WS>,channel=1,19:7]
[@86,323:340='IDENTIFIER_20_30_A',<IDENTIFIER>,19:8]
[@87,341:343='   ',<WS>,channel=1,19:26]
[@88,344:344='[',<OTHER>,19:29]
[@89,345:345=' ',<WS>,channel=1,19:30]
[@90,346:346='1',<NUMBER>,19:31]
[@91,347:347=',',<OTHER>,19:32]
...
[@139,644:668='//=======================',<SEPARATOR>,34:0]
[@140,669:669=' ',<WS>,channel=1,34:25]
[@141,670:673='this',<IDENTIFIER>,34:26]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1253
Statement found : `/**
 * BLOCK COMMENT
 */
#pragma once`
Statement found : `//#pragma once`
...
Statement found : `#define DEFAULT_A 10.0`
...
Statement found : `// Line 2`
Statement found : `//`
...
Statement found : `//#define IDENTIFIER_3 Version.h // Line 5`
Statement found : `// Line 6
#define IDENTIFIER_THREE`
Statement found : `//1
//#define ID1 //2`
Statement found : `#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
Statement found : `#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)`
Statement found : `#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)`

语法Header\u跟踪。g4（带跟踪）：

grammar Header_trace;

compilationUnit
    @init {System.out.println("Last update 1137");}
    :   statement[this.getRuleNames() /* parser rule names */]* EOF
    ;

statement [String[] rule_names]
    locals [String rule_name, int start_line, int end_line]
    @after { System.out.print("The next statement is a " + $rule_name);
             $start_line = $start.getLine();
             $end_line   = $stop.getLine();
             if ($start_line == $end_line)
                 System.out.print(" on line " + $start_line);
             else
                 System.out.print(" on lines " + $start_line + " to " + $end_line);
             System.out.println(" : ");
             System.out.println("`" + $text + "`");
           }
    :   comment? pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}
    |   comment? define_directive [rule_names] {$rule_name = $define_directive.rule_name;}
    |   section [rule_names]                   {$rule_name = $section.rule_name;}
    |   comment_only [rule_names]              {$rule_name = $comment_only.rule_name;}
     // comment_only can be replaced by comment when the trace is removed
    ;

pragma_directive [String[] rule_names] returns [String rule_name]
     :   PRAGMA char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
     ;

define_directive [String[] rule_names] returns [String rule_name]
    locals [String dir_rule_name, int statement_line = 0]
    @init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}
    :   define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]
            { $rule_name = $replacement_comment.rule_name; }
    ;
    
define_identifier returns [int statement_line]
    :   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
    ;

replacement_comment [String dir_rule_name, int statement_line] returns [String rule_name]
    :   any+=anything+ line_comment?
            { $rule_name = $dir_rule_name + " with replacement value";
              System.out.print("          anything matched : " );
              if ($any.size() > 0)
                  for (AnythingContext r : $any)
                      System.out.print(r.getText());
              else
                  System.out.print("(nothing)");

              System.out.println();
            }
    |   {getCurrentToken().getLine() == $statement_line}?
        line_comment
            { $rule_name = $dir_rule_name + " WITHOUT replacement value and with inline line comment"; }
    |   {getCurrentToken().getLine() != $statement_line}?
            { $rule_name = $dir_rule_name + " WITHOUT replacement value"; }
    ;

section [String[] rule_names] returns [String rule_name]
    :   LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment_only [String[] rule_names] returns [String rule_name]
    :   comment
            { $rule_name = rule_names[$ctx.getRuleIndex()]; }
    ;

comment
    :   BLOCK_COMMENT
    |   line_comment
    |   SEPARATOR ( IDENTIFIER | EQUALS )*
    ;

line_comment
    :   LINE_COMMENT_DELIMITER anything*
    ;

anything
    :   IDENTIFIER
    |   CHAR_SEQUENCE 
    |   STRING
    |   NUMBER
    |   OTHER
    ;

char_sequence
    :   CHAR_SEQUENCE
    |   IDENTIFIER
    ;
 
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA        : '#pragma';
SECTION       : '@section';
DEFINE        : '#define';
STRING        : '"' .*? '"';
EQUALS        : '='+ ;
SEPARATOR     : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER    : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER        : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS            : [ \t]+ -> channel(HIDDEN) ;
NL            : (   '\r' '\n'?
                  | '\n'
                ) -> channel(HIDDEN) ;
OTHER         : .;

执行：

$ a4 Header_trace.g4 
$ javac Header*.java
$ grun Header_trace compilationUnit -tokens input.txt
[@0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[@1,24:24='\n',<NL>,channel=1,3:3]
[@2,25:31='#pragma',<'#pragma'>,4:0]
[@3,32:32=' ',<WS>,channel=1,4:7]
[@4,33:36='once',<IDENTIFIER>,4:8]
[@5,37:37='\n',<NL>,channel=1,4:12]
...
[@257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1137
The next statement is a pragma_directive on lines 1 to 4 : 
`/**
 * BLOCK COMMENT
 */
#pragma once`
...
          anything matched : 10.0
The next statement is a define_directive with replacement value on line 20 : 
`#define DEFAULT_A 10.0`
The next statement is a comment_only on line 22 : 
`//================================================================`
...
The next statement is a comment_only on line 31 : 
`// Line 2`
The next statement is a comment_only on line 32 : 
`//`
...
          anything matched : Version.h
The next statement is a define_directive with replacement value on line 39 : 
`//#define IDENTIFIER_3 Version.h // Line 5`
The next statement is a define_directive WITHOUT replacement value on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 44 to 45 : 
`//1
//#define ID1 //2`
          anything matched : "Preheat for "PREHEAT_1_LABEL
The next statement is a define_directive with replacement value on line 47 : 
`#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
...

碰巧多亏了LINE_COMMENT_DELIMITER？，就像您在定义指令规则的开头对COMMENT_START？所做的那样，并且因为//之后没有特殊标记，所以在遇到行注释分隔符时不再需要切换到模式COMMENT_MODE。

第一种方法有一个困难：

define_directive
    :   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER anything+ line_comment?
    |   LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();}
        IDENTIFIER same_line_line_comment[$statement_line]
    |   LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER

same_line_line_comment [int statement_line]
    :   {getCurrentToken().getLine() == $statement_line}?
        line_comment

以下几行

// Line 6
#define IDENTIFIER_THREE

//1

使用第二个选项而不是第三个选项进行分析：

compare statement line 42 with comment line 44
line 44:0 rule same_line_line_comment failed predicate: {getCurrentToken().getLine() == $statement_line}?
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 41 to 42 : 
`// Line 6
#define IDENTIFIER_THREE`

尽管子规则“same\u line\u line\u comment”被一个假值保护，但语义谓词没有任何效果。FailedPredicateException不受欢迎，跟踪消息错误。这可能与查找可见谓词有关。

解决方案是将#define指令的处理分为一个固定部分define\u identifier规则和一个可变部分replacement\u comment规则和语义谓词（为了在解析决策中有效，必须将其放在备选方案的开头）。

ANTLR4行注释和文本解析问题

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档