Creating a Parser for Boolean Expressions

06/11/2014

Gérald Barré

.NET

The idea for this article came from a question on Stackoverflow. The question is about the possibility of writing the following authorization rule in ASP.NET Core:

[Authorize(Roles = "Expert && (ReadWrite || ReadOnly)")]

This syntax is not supported by the Authorize attribute. It is still possible to achieve the same result by using the attribute twice:

[Authorize(Roles = "Expert")]
[Authorize(Roles = "ReadWrite, ReadOnly")]

But that's not much fun. In this post, I will show you how to support the first syntax (and more) by writing a small parser for this language.

The goal is to parse the content of the string and extract the relevant information: operators and role names, while respecting operator precedence and parentheses. The result is a syntax tree representing the expression:

Before we get to that point, let's start with the basics. A language, without going into the theory, is:

An alphabet: a set of valid characters
Words: a coherent set of characters from the alphabet
A grammar: a coherent set of words. For those interested, there are different types of grammars. Noam Chomsky, a prominent linguist, classified them into 4 categories, known as the Chomsky hierarchy

To parse our language, we must begin by defining these 3 elements.

The alphabet is composed of:

Letters [a-zA-Z],
Characters &, |, ( and ),
And spaces.

There are 3 types of words:

The operators: && and ||,
The parentheses ( and ),
Roles: set of letters [a-zA-Z] +.

The context-free grammar is the following:

S:: = Expression
Expression:: = SubExpression
             | SubExpression '&&' Expression
             | SubExpression '||' Expression
SubExpression:: = '(' Expression ')'
                | RoleName
RoleName:: = [a-zA-Z]

Let's take the example of the expression above to understand how it derives from grammar:

Expression
SubExpression && Expression
RoleName      && SubExpression
'Expert'      && (Expression)
'Expert'      && (SubExpression || Expression)
...
'Expert'      && ('ReadWrite'   || 'ReadOnly')

#The code

The first step is to extract the tokens (the individual words from the input string). We iterate through the string character by character and collect the tokens into a list:

List<string> tokens = new List<string>();
StringBuilder sb = new StringBuilder();

for (int i = 0; i < text.Length; i++)
{
    char c = text[i];
    switch (c)
    {
        case ' ':
            if (sb.Length == 0)
                continue;

            sb.Append(c);
            break;

        case '(':
        case ')':
            if (sb.Length > 0)
            {
                tokens.Add(sb.ToString());
                sb.Clear();
            }

            tokens.Add(c.ToString(CultureInfo.InvariantCulture));
            break;

        case '&':
        case '|':
            if (sb.Length != 0)
            {
                char prev = sb[sb.Length - 1];
                if (c == prev) // && or ||
                {
                    sb.Remove(sb.Length - 1, 1); // remove last char
                    tokens.Add(sb.ToString().Trim());
                    sb.Clear();
                    tokens.Add(c == '&' ? "&&" : "||");
                    break;
                }
            }

            sb.Append(c);
            break;

        default:
            sb.Append(c);
            break;
    }
}

if (sb.Length > 0)
{
    tokens.Add(sb.ToString());
}

This produces a list of tokens, but we still need to verify that they form a valid expression according to the grammar. To do this, we transcribe the grammar directly into C#:

1 rule => 1 method
1 '|' or '&' => if / else if

static Node Parse(string[] tokens)
{
    int index = 0;
    return ParseExp(tokens, ref index); // Rule 1
}

static Node ParseExp(string[] tokens, ref int index)
{
    Node leftExp = ParseSubExp(tokens, ref index);
    if (index >= tokens.Length) // last token => Rule 2
        return leftExp;

    string token = tokens[index];
    if (token == "&&") // Rule 3
    {
        index++;
        Node rightExp = ParseExp(tokens, ref index);
        return new AndNode(leftExp, rightExp);
    }
    else if (token == "||") // Rule 4
    {
        index++;
        Node rightExp = ParseExp(tokens, ref index);
        return new OrNode(leftExp, rightExp);
    }
    else
    {
        throw new Exception("Expected '&&' or '||' or EOF");
    }
}

static Node ParseSubExp(string[] tokens, ref int index)
{
    string token = tokens[index];
    if (token == "(") // Rule 5
    {
        index++;
        Node node = ParseExp(tokens, ref index);
        if (tokens[index] != ")")
            throw new Exception("Expected ')'");

        index++; // Skip ')'
        return node;
    }
    else
    {
        index++;
        return new RoleNode(token); // Rule 6
    }
}

The code mirrors the grammar directly. We start at the grammar entry point, which calls the Expression rule, which in turn calls SubExpression, and then selects the appropriate rule based on the current token.

In the end, we get a tree representing the original expression, which we can then evaluate. The full code, including the evaluator, is available on GitHub: https://gist.github.com/meziantou/10603804.

In 130 lines of code, we can write a parser capable of understanding a Boolean expression (OR, AND, XOR, NOT, and operator precedence). For those who want to go further:

This is an LL parser. This type of parser is fairly limited. For more complex needs, there are others: LALR, LR, GLR.
This parser could be improved, for example by streaming tokens instead of extracting them all upfront. This allows processing large expressions without degrading performance.
Writing a parser is fun once or twice for small grammars. After that, you may want to look at tools such as ANTLR or Gold Parser that generate parsers for you.

The theory of formal languages is a fascinating subject. Feel free to explore it further 😉

Do you have a question or a suggestion about this post? Contact me!

Follow me:

Enjoy this blog?

💖 Sponsor on GitHub