First steps with ANTLR4 in C #

texte de grammaire csharp

1. What is ANTLR?

ANTLR4 is a lexer / parser generator. A parser takes a text input, and according to a grammar, extracts the recognized lexicons terms (action of the lexer), and produce an AST: Abstract Syntax Tree (action of the parser).

We can have parsers without grammars (the PEG: Parsing Expression Grammar, which are based on syntax extraction rules) and parsers with grammar that differ in their complexity, as indicated by the hierarchy of Chomsky:

  • enumerable recursively (type 0). All formal grammars: ie formed by terminal, non-terminal symbols and production rules that define non-terminal symbols). included all other types of grammars.
  • recursive languages (type 0-1), , a mixture of type 0 and 1 (excluding Chomsky’s ranking)
  • context-sensitive (type 1): the production rules are written in the form αΑβ → αγβ with Α un non-terminal and α, β et γ some terminals and/or some non-terminals, with non-empty γ
  • context free (type 2): production rules in the form Α → γ with Α non-terminal and γ terminal and/or non-terminal. This is the case for the main programming languages, although name resolution is context-sensitive because of scopes. A subset of these grammars is used to make more efficient parsers easier: the LL parsers (Left to right Leftmost derivation)
  • regular languages (type 3) : the production rules are restricted by a terminal symbol on the left and on the right, possibly followed by a non-terminal. These languages ​​are obtained by regular expressions

2. Why ANTLR?

  • supports a context free grammar (type 2)
  • supports grammars of type LL (*), ie a recursive grammar on the left which is not limited to a number k of tokens to identify a non-terminal
  • is based on a standard and practical Extended Backus-Naur Form EBNF (Extended Backus-Naur Form) grammar variant
  • offers a large number of grammars that can be used as a basis for writing new grammars
  • proposes the visitor and the listener templates to browse an AST
  • proposes an enriched grammar with for example some actions that are methods written in the target language, semantic predicates, channels (block of another language in the language), etc … that allow to increase the power of description of grammars LL
  • has an extension for Visual Studio
  • produces a C # lexer / parser
  • is an open source project that is alive and well used

3. For what purpose ?

  • read / process / execute / translate text or structured binaries
  • build a programming language
  • build a Data Design Language (DDL)
  • build a domain-specific language: DSL (Domain Specific Language)
  • build a high-level language like Ubiquitous Language in the Domain Driven Design (DDD) approach
  • build tools and frameworks that revolve around languages

4. Implementation in Visual Studio with C#

The following examples can be used on Visual Studio 2010 and +. They are presented here under Visual Studio 2015.

4.1 Installing the ANTLR language support

From the extension manager, install ANTLR Language support :

install 2015 antlr language support
This adds support for the grammars g3 and g4 (syntax highlighting, intellisense), and the properties panel of the grammar files. It is necessary to reference the ANTLR source codes that are generated in obj/ in order to make Intellisense work on the generated code.

4.2. New project C# ANTLR

For example, we add a new console application project (classic desktop) from the context menu of the solution explorer (a library would be more appropriate in a larger context). We target the .NET 4.6.2 framework

4.3. Adding the ANTLR runtime to the project

From the NUGET package manager console, add the ANTLR runtime to the new project:

install antlrv4 runtime

this runtime corresponds to the target .NET framework, another is available ( Standard ) for target CORE 2.

4.4. Addition of the ANTLR C # generator to the project

From the NUGET package manager console, add the lexer / parser generator C # ANTLR:

install antlrv4 code generator

Without that, no generated code!

4.5. Adding a grammar

G4 grammars are available at: https://github.com/antlr/grammars-v4

  • in the project, add a grammar file from the context menu of the solution explorer: add > new element > ANTLR4 Combined Grammar
  • for the tests, we will use a simple grammar of arithmetic expression. In the empty grammar file that has just been added to the project, copy the contents of the example file: calculator.g4 available at: https://github.com/antlr/grammars-v4/tree/master/calculator
  • in the file properties, set Generate Listener to True and Generate Visitor to True to generate all aspects of the lexer / parser
  • the ANTLR generator activates each time the grammar file is saved
  • the generated program elements are visible in obj/:

antlr visitor and listener

4.6. Grammar test

The Main method of the project receives the typical code that allows to parse a text according to grammar calculator.g4:

while (true)
{
    try
    {
        Write("enter calculator text: ");
        var str = ReadLine();
        var inputStream = new AntlrInputStream(str);
        var lexer = new calculatorLexer(inputStream);
        var commonTokenStream = new CommonTokenStream(lexer);
        var parser = new calculatorParser(commonTokenStream);
        var ectx = parser.equation();
        WriteLine("parse tree (LISP style): " + ectx.ToStringTree(parser));
    } catch (Exception Ex)
    {
        Console.Error.WriteLine(Ex.Message);
    }
}

examples in the console:

enter calculator text: hello world
line 1:6 missing {'>', '<', '='} at 'world'
parse tree (LISP style): (equation (expression (multiplyingExpression (powExpression (signedAtom (atom (variable hello)))))) relop (expression (multiplyingExpression (powExpression (signedAtom (atom (variable world)))))))

enter calculator text: y=x*x+x-4
parse tree (LISP style): (equation (expression (multiplyingExpression (powExpression (signedAtom (atom (variable y)))))) (relop =) (expression (multiplyingExpression (powExpression (signedAtom (atom (variable x)))) * (powExpression (signedAtom (atom (variable x))))) + (multiplyingExpression (powExpression (signedAtom (atom (variable x))))) - (multiplyingExpression (powExpression (signedAtom (atom (scientific 4)))))))

enter calculator text: d=(sin(a)+cos(a))*r
parse tree (LISP style): (equation (expression (multiplyingExpression (powExpression (signedAtom (atom (variable d)))))) (relop =) (expression (multiplyingExpression (powExpression (signedAtom (atom ( (expression (multiplyingExpression (powExpression (signedAtom (func (funcname sin) ( (expression (multiplyingExpression (powExpression (signedAtom (atom (variable a)))))) ))))) + (multiplyingExpression (powExpression (signedAtom (func (funcname cos) ( (expression (multiplyingExpression (powExpression (signedAtom (atom (variable a)))))) )))))) )))) * (powExpression (signedAtom (atom (variable r)))))))

enter calculator text:

4.7. Using the parse tree

Now that the grammar, lexer and parser are in place in the C# project, all that remains is to exploit the result: the Parse Tree. The parse tree is a tree of the terms of the rules of production, it is not a logical AST because it also contains information that is used to parser. This is a CST (Concrete Syntax Tree).
Nevertheless it allows to do all the work.

4.7.1. From the parse tree to the logical AST

We add a method to represent the tree-shaped AST in the console:

void Explore(ParserRuleContext ctx, int indentLevel = 0)
{
    var ruleName = calculatorParser.ruleNames[ctx.RuleIndex];
    var sep = "".PadLeft(indentLevel);
    WriteLine(sep + ruleName);
    sep = "".PadLeft(indentLevel+4);
    foreach ( var c in ctx.children )
    {
        if (c is ParserRuleContext)
            Explore((ParserRuleContext)c, indentLevel + 4);
        else
            WriteLine(sep + c.ToString());
    }
}

We obtain the tree view in the console, for the example expression: y=x*x+2

enter calculator text: y=x*x+2
parse tree (AST LISP style): (equation (expression (multiplyingExpression (powExpression (signedAtom (atom (variable y)))))) (relop =) (expression (multiplyingExpression (powExpression (signedAtom (atom (variable x)))) * (powExpression (signedAtom (atom (variable x))))) + (multiplyingExpression (powExpression (signedAtom (atom (scientific 2)))))))

equation
    expression
        multiplyingExpression
            powExpression
                signedAtom
                    atom
                        variable
                            y
    relop
        =
    expression
        multiplyingExpression
            powExpression
                signedAtom
                    atom
                        variable
                            x
            *
            powExpression
                signedAtom
                    atom
                        variable
                            x
        +
        multiplyingExpression
            powExpression
                signedAtom
                    atom
                        scientific
                            2

We immediately see that the AST is very polluted. A simple way to produce a logical AST is to remove nodes that have only one child, because they correspond to non-terminal rules that are only used for parser but do not describe the logic of the syntax of the expression analyzed.

The ExploreAST method does this work:

void ExploreAST(ParserRuleContext ctx, int indentLevel = 0)
{
    var ruleName = calculatorParser.ruleNames[ctx.RuleIndex];
    var sep = "".PadLeft(indentLevel);
    bool keepRule = ctx.ChildCount > 1;
    if (keepRule)
        WriteLine(sep + ruleName);
    foreach (var c in ctx.children)
    {
        if (c is ParserRuleContext)
            ExploreAST((ParserRuleContext)c, indentLevel + ((keepRule) ? 4 : 0));
        else
        {
            var sep2 =
                "".PadLeft(indentLevel+((keepRule)?4:0));
            WriteLine(sep2 + c.ToString());
        }
    }
}

We obtain as follows:

equation
    y
    =
    expression
        multiplyingExpression
            x
            *
            x
        +
        2

this AST can then be read as a ternary tree, each node having three children: in order the left operand, the operator and the right operand. This is specific to this particular grammar. The transformation does not necessarily work in all cases. We will eventually use an explicit transformation of the parse tree to another AST model

3 thoughts on “First steps with ANTLR4 in C #

  1. Hello.
    Thank you for this really thorough article.
    I understand now how to create a parser for a grammar.
    What i’d like to do now is to be able to create / edit files in Visual Studio that use my custom DSL. I would need to implement syntax highlighting, navigation, validation (errors, warnings…).
    Do you know which tools / strategy I should use ?

    The Modeling SDK for Visual Studio seems to only provide tools to make graphical (designer) DSLs, and no “text based” DSLs.

    Thanks in advance for your help

  2. Hello Myster DD,
    Thanks you very much for your feed back. I am glad this post was helpfull for you.

    Unfortunately the Antrl lexer/parser generator purpose is not to build directly such functionality in Visual Studio, but rather to develop your own text analyser / editor / compiler.
    Effectively, the Modeling SDK for Visual Studio is suitable to create the business models (business class diagrams) the DSL can manipulate and to generate the code of a corresponding business framework. This is called a vertical language. If you wish to create a technical DSL (an horizontal language) it is not appropriate.

    Nevertheless there is a solution to achieve your goal: according that you have already made a lexer/parser for your DSL, you may code a VSIX module for visual studio that implements a text editor for your specific language.

    You will find good examples at:

    – Matt Duffield’s blog : shows how to code an ‘Editor Classifier’ that implements syntax coloring associated with a file extension.
    https://mattduffield.wordpress.com/2012/07/31/writing-a-brightscript-syntax-highlight-extension-for-visual-studio-2010/

    – Micheal’s coding spot : a tutorial about Visual Studio extensions dealing about enhancing various functionalities of Visual Studio
    https://michaelscodingspot.com/visual-studio-2017-extension-development-tutorial-part-1/

    You will have to integrate your Antlr lexer/parser in the extension to provide the text analysis mecanism.
    If you walktrought the parse tree (CST), you can found interesting data from the abstract syntax nodes indicating the indexes of the tokens in the analyzed text:
    – RuleExpressionContext for a non terminal rule
    owns properties Start.StartIndex,Start.StopIndex,Stop.StartIndex,Stop.StopIndex
    – ITerminalNode for a terminal rule
    owns properties Symbol.StartIndex,Symbol.StopIndex

    Hoping this will be usefull for you

    1. Thank you very much for your detailed answer ! I’ll try to follow the samples and get something done. I’ve done a similar thing for Eclipse in Java, using the XText framework, which provides everything needed to create a plugin for the IDE. Too bad there’s no equivalent for Visual Studio.

Leave a Reply