AASHAN

Creating a Programming Language - Part 2: The Lexer

Mon Feb 27 2023 · 7 min read
Creating a Programming Language - Part 2: The Lexer

Untangling the Code

As we already have scratched the surface, lets see what a lexer might look like. A typical lexer consists of the ability to distinguish between tokens. For example, 22 * 234 should always create three tokens each for 22, * and 234.

We can divide all the possible tokens tokens into three categories. The first category is for the actual text that might be present in the code. This might consist of AlphaNumericToken for representing variables and reserved words, NumericToken for representing any number present in the code, BadToken for representing any error that might be in the code and EndOfLineToken, EndOfFileToken and WhiteSpaceToken which are self explanatory and are mostly ignored by the other components.

The second category consists of operator tokens. This should include all the arithmetic operators that you want your compiler to recognize. The operators can be multi charactered or be alphanumeric as well. It generally does not matter. But for the sake of simplicity, we will only be considering single letter special characters like +, -, *, / as out primary operators.

Finally, the third category consists of the tokens that represent an expression. These expressions can have multiple other tokens depending on hte type. Since our programming language is fairly simple, we can use BinaryExpression, ParenthesizedExpression, VariableAssignmentExpresion and UnaryExpression for our purpose.

And I again take this moment to remind everyone reading this that the source code is available in github if anyone wishes to try out Balance.

The Lexer

Now, lets dive into the fun part. A lexer should be able to hold all the tokens provided in the input text. It should have the ability to peek into certain area of the original text to find out certain information. This can be achieved using a pointer that is always pointing to the current position where the text is being read. There is a read head (the pointer) reads only from a certain section of the input text. Once the input is read and tokenized, the pointer is increased to the next token. The lexer should also have the ability to peek into possible future values in order to determine duplication scenarios. For example, lets consider we have two tokens + for adding two numbers and ++ for incrementing the value. Lets also suppose the input text is 22++. Now, since we are reading sequentially from start to end, the compiler should not add two + tokens and instead should add one ++ token which can only be determined if the lexer has ability to peek into further characters in the input text.

Now we can see that the basic wireframe of a lexer will be something like this


class Lexer {
    
    protected position: number = 0;
    
    constructor (protected text: string) {}
    
    public current = () : string => {
        return this.peek(0);
    }
    
    public next = (): void => {this.position++;}
    
    public peek = (offset: number) => {
        const position = this.position + offset;
        if (position >= this.text.length) {
            return '\0';
        }

        return this.text[position];
    }
    
    public nextToken = () => {
        // Logic to extract token goes here
    }
}

Of course this is just the basic and the original balance lexer can be found here at Github

The nextToken method

Now that we got a structure of the lexer out of the way, we need to actually tokenize the input string into individual tokens. This can be achieved by completing the nextToken method that we left earlier. Let's see how we can do that.

Well, thinking about it, it is fairly straight forward. We look for each character, if it an alphanumeric character, we peek forward till we find something which is not an alphanumeric character, then we stop and return the seeked value into a new AlphaNumericToken. Its the same for NumericToken and any other operator token that we discussed earlier. So, the code becomes something like this.

public nextToken = (): SyntaxToken => {

    if (this.position >= this.text.length) {
            return new SyntaxToken(SyntaxKind.EndOfFileToken, this.position, '\0', null);
        }

        if (isDigit(this.current())) {
            const start = this.position;
            while(isDigit(this.current())) {
                this.next();
            }
            const text = this.text.substring(start, this.position);
            const value = parseInt(text);
            if (isNaN(value) || !isFinite(value)) {
                this.errors.push(`The number ${text} is not a valid number at ${this.position}!`);
            }
            return new SyntaxToken(SyntaxKind.NumberToken, start, text, value);
        }

        if (isWhiteSpace(this.current())) {
            const start = this.position;
            while(isWhiteSpace(this.current())) {
                this.next();
            }
            const text = this.text.substring(start, this.position);
            return new SyntaxToken(SyntaxKind.WhiteSpaceToken, start, text, text);
        }

        if(isAlphaNumeric(this.current())) {
            const start = this.position;
            while(isAlphaNumeric(this.current())) {
                this.next();
            }
            const text = this.text.substring(start, this.position);
            return new SyntaxToken(SyntaxKind.AlphaNumericToken, start, text, text);
        }
        
        const current = this.current();
        let token: SyntaxToken;
        switch(current){
            case '+':
                token = new SyntaxToken(SyntaxKind.PlusToken, this.position, '+', null);
                this.next();
                return token;
            case '-':
                token = new SyntaxToken(SyntaxKind.MinusToken, this.position, '-', null);
                this.next();
                return token;
            case '*':
                token = new SyntaxToken(SyntaxKind.AsteriskToken, this.position, '*', null);
                this.next();
                return token;
            case '/':
                token = new SyntaxToken(SyntaxKind.SlashToken, this.position, '/', null);
                this.next();
                return token;
            case '(':
                token = new SyntaxToken(SyntaxKind.OpenParenthesisToken, this.position, '(', null);
                this.next();
                return token;
            case ')':
                token = new SyntaxToken(SyntaxKind.CloseParenthesisToken, this.position, ')', null);
                this.next();
                return token;
            case '%':
                token = new SyntaxToken(SyntaxKind.PercentageToken, this.position, '%', null);
                this.next();
                return token;
            case '=':
                token = new SyntaxToken(SyntaxKind.EqualsToken, this.position, '=', null);
                this.next();
                return token;
            case ';':
                token = new SyntaxToken(SyntaxKind.EndOfLineToken, this.position, ';', null);
                this.next();
                return token;
            default:
                this.next();
                return new SyntaxToken(SyntaxKind.BadToken, this.position - 1, this.text.substring(this.position - 1, 1), null);
        }
}

This should give us all the necessary tokens that are supported by our programming language. This is where the task of lexer comes to end.

Have some questions? Let's get in touch

Related posts