[Solved] ANTLR4 - How to match something until two characters match?

Question

Asked by Stefan S on December 13, 2021 (source).

Flutter:

Framework • revision 18116933e7 (vor 8 Wochen) • 2021-10-15 10:46:35 -0700
Engine • revision d3ea636dc5
Tools • Dart 2.14.4

Antrl4:
antlr4: ^4.9.3

I would like to implement a simple tool that formats text like in the following definition: https://www.motoslave.net/sugarcube/2/docs/#markup-style

So basically each __ is the start of an underlined text and the next __ is the end.

I got some issues with the following input:

^^subscript=^^

Shell: line 1:13 token recognition error at '^'
Shell: line 1:14 extraneous input '' expecting {'==', '//', '''', '__', '~~', '^^', TEXT}

MyLexer.g4:


STRIKETHROUGH : '==';
EMPHASIS : '//';
STRONG : '\'\'';
UNDERLINE : '__';
SUPERSCRIPT : '~~';
SUBSCRIPT : '^^';

TEXT
 : ( ~[<[$=/'_^~] | '<' ~'<' | '=' ~'=' | '/' ~'/' | '\'' ~'\'' | '_' ~'_' | '~' ~'~' | '^' ~'^' )+
;

MyParser.g4:


options {
    tokenVocab=SugarCubeLexer;
    //language=Dart;
}

parse
 : block EOF
 ;

block
 : statement*
 ;

statement
 : strikethroughStyle
 | emphasisStyle
 | strongStyle
 | underlineStyle
 | superscriptStyle
 | subscriptStyle
 | unstyledStatement
 ;

unstyledStatement
 : plaintext
 ;

strikethroughStyle
 : STRIKETHROUGH (emphasisStyle | strongStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* STRIKETHROUGH
 ;

emphasisStyle
 : EMPHASIS (strikethroughStyle | strongStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* EMPHASIS
 ;

strongStyle
 : STRONG (strikethroughStyle | emphasisStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* STRONG
 ;

underlineStyle
 : UNDERLINE (strikethroughStyle | emphasisStyle | strongStyle | superscriptStyle | subscriptStyle | unstyledStatement)* UNDERLINE
 ;

superscriptStyle
 : SUPERSCRIPT (strikethroughStyle | emphasisStyle | strongStyle | underlineStyle | subscriptStyle | unstyledStatement)* SUPERSCRIPT
 ;

subscriptStyle
 : SUBSCRIPT (strikethroughStyle | emphasisStyle | strongStyle | underlineStyle | superscriptStyle | unstyledStatement)* SUBSCRIPT
 ;

plaintext
 : TEXT
 ;

I would be super happy for any help. Thanks

Answer

Question answered by Mike C (source).

It's you TEXT rule:

TEXT
    : (
        ~[<[$=/'_^~]
        | '<' ~'<'
        | '=' ~'='
        | '/' ~'/'
        | '\'' ~'\''
        | '_' ~'_'
        | '~' ~'~'
        | '^' ~'^'
    )+
    ;

You can't write a Lexer rule in ANTLR like you're trying to do (i.e. a '^' unless it's followed by another '^'). The ~'^' means "any character that's not ^")

if you run your input through grun with a -tokens option, you'll see that the TEXT token pulls everything through the EOL

[@0,0:1='^^',<'^^'>,1:0]
[@1,2:14='subscript=^^\n',<TEXT>,1:2]
[@2,15:14='<EOF>',<EOF>,2:0]

Try something like this:

grammar MyParser
    ;

parse: block EOF;

block: statement*;

statement
    : STRIKETHROUGH statement STRIKETHROUGH # Strikethrough
    | EMPHASIS statement EMPHASIS           # Emphasis
    | STRONG statement STRONG               # Strong
    | UNDERLINE statement UNDERLINE         # Underline
    | SUPERSCRIPT statement SUPERSCRIPT     # SuperScript
    | SUBSCRIPT statement SUBSCRIPT         # Subscript
    | plaintext                             # unstyledStatement
    ;

plaintext: TEXT+;

STRIKETHROUGH: '==';
EMPHASIS:      '//';
STRONG:        '\'\'';
UNDERLINE:     '__';
SUPERSCRIPT:   '~~';
SUBSCRIPT:     '^^';

TEXT: .;
ANTLR4 DART FLUTTER TWINE
SHARE: