DeveloperBreeze

A Journey into Compiler Construction in 2025

Table of Contents

  1. Overview & Motivation
  2. Prerequisites & Setup
  3. Designing Our DSL
  4. Building the Lexer
  5. Writing a Recursive-Descent Parser
  6. Constructing the AST
  7. Generating LLVM IR
  8. Optimization and Code Generation
  9. Integrating the Runtime and Testing
  10. Conclusion & Next Steps

1. Overview & Motivation

Modern applications demand flexible, domain-focused solutions. A DSL allows domain experts to express complex logic concisely and can be optimized for performance. Leveraging LLVM gives you access to state-of-the-art optimization and code generation tools, so you can create languages that compile to highly efficient native code. In this tutorial, we’ll implement a simple mathematical expression language (with potential for future extensions) that computes results in real time.

2. Prerequisites & Setup

Tools & Libraries:

  • C++ Compiler: A modern C++ compiler with C++17 support (e.g., GCC 10+, Clang 12+).
  • CMake: For building and managing your project.
  • LLVM: Version 15+ (or later). Ensure you have the LLVM libraries and headers installed.
  • Optional: Git for version control and sample project management.

Project Structure Example:

mydsl/
├── CMakeLists.txt
├── include/
│   └── DSL/
│       ├── Lexer.h
│       ├── Parser.h
│       └── AST.h
└── src/
    ├── Lexer.cpp
    ├── Parser.cpp
    ├── AST.cpp
    ├── CodeGen.cpp
    └── main.cpp

Your CMakeLists.txt should find and link LLVM libraries. For example:

cmake_minimum_required(VERSION 3.15)
project(MyDSL)

find_package(LLVM REQUIRED CONFIG)
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
message(STATUS "Using LLVMConfig.cmake in: ${LLVM_DIR}")

include_directories(${LLVM_INCLUDE_DIRS})
add_definitions(${LLVM_DEFINITIONS})

set(SOURCE_FILES
    src/main.cpp
    src/Lexer.cpp
    src/Parser.cpp
    src/AST.cpp
    src/CodeGen.cpp
)

add_executable(mydsl ${SOURCE_FILES})
llvm_map_components_to_libnames(llvm_libs support core irreader nativecodegen)
target_link_libraries(mydsl ${llvm_libs})

3. Designing Our DSL

For this tutorial, we’ll create a DSL for evaluating mathematical expressions. Our language will support:

  • Numeric literals (e.g., 42, 3.14)
  • Basic arithmetic operators: +, -, *, /
  • Parentheses for grouping expressions

Later, you can extend it to include variables, functions, or even control flow constructs.

Example DSL Code:

(3 + 4) * (5 - 2) / 2

4. Building the Lexer

The lexer (or tokenizer) converts a stream of characters into a sequence of tokens. Each token represents a logical unit, such as a number or operator.

Header: Lexer.h

#ifndef DSL_LEXER_H
#define DSL_LEXER_H

#include <string>
#include <vector>

enum class TokenType {
    Number,
    Plus,
    Minus,
    Asterisk,
    Slash,
    LParen,
    RParen,
    EndOfFile,
    Invalid
};

struct Token {
    TokenType type;
    std::string text;
    double value; // Only valid for Number tokens.
};

class Lexer {
public:
    Lexer(const std::string& input);
    Token getNextToken();

private:
    const std::string input;
    size_t pos = 0;
    char currentChar();

    void advance();
    void skipWhitespace();
    Token number();
};

#endif // DSL_LEXER_H

Implementation: Lexer.cpp

#include "DSL/Lexer.h"
#include <cctype>
#include <cstdlib>

Lexer::Lexer(const std::string& input) : input(input) {}

char Lexer::currentChar() {
    if (pos < input.size()) {
        return input[pos];
    }
    return '\0';
}

void Lexer::advance() {
    pos++;
}

void Lexer::skipWhitespace() {
    while (std::isspace(currentChar())) {
        advance();
    }
}

Token Lexer::number() {
    size_t start = pos;
    while (std::isdigit(currentChar()) || currentChar() == '.') {
        advance();
    }
    std::string numStr = input.substr(start, pos - start);
    double value = std::strtod(numStr.c_str(), nullptr);
    return { TokenType::Number, numStr, value };
}

Token Lexer::getNextToken() {
    skipWhitespace();

    char current = currentChar();

    if (current == '\0') {
        return { TokenType::EndOfFile, "", 0 };
    }
    if (std::isdigit(current) || current == '.') {
        return number();
    }

    Token token;
    token.text = std::string(1, current);
    token.value = 0;
    switch (current) {
        case '+': token.type = TokenType::Plus; break;
        case '-': token.type = TokenType::Minus; break;
        case '*': token.type = TokenType::Asterisk; break;
        case '/': token.type = TokenType::Slash; break;
        case '(': token.type = TokenType::LParen; break;
        case ')': token.type = TokenType::RParen; break;
        default: token.type = TokenType::Invalid; break;
    }
    advance();
    return token;
}

5. Writing a Recursive-Descent Parser

We’ll implement a recursive-descent parser that constructs an Abstract Syntax Tree (AST) from tokens. Our grammar is defined with standard operator precedence:

expression → term ((‘+’ | ‘-’) term)*
term       → factor ((‘*’ | ‘/’) factor)*
factor     → Number | ‘(’ expression ‘)’

Header: Parser.h

#ifndef DSL_PARSER_H
#define DSL_PARSER_H

#include "Lexer.h"
#include "AST.h"
#include <memory>

class Parser {
public:
    Parser(Lexer& lexer);
    std::unique_ptr<ASTNode> parseExpression();

private:
    Lexer& lexer;
    Token currentToken;

    void eat(TokenType type);
    std::unique_ptr<ASTNode> factor();
    std::unique_ptr<ASTNode> term();
};

#endif // DSL_PARSER_H

Implementation: Parser.cpp

#include "DSL/Parser.h"
#include <stdexcept>

Parser::Parser(Lexer& lexer) : lexer(lexer) {
    currentToken = lexer.getNextToken();
}

void Parser::eat(TokenType type) {
    if (currentToken.type == type) {
        currentToken = lexer.getNextToken();
    } else {
        throw std::runtime_error("Unexpected token: " + currentToken.text);
    }
}

std::unique_ptr<ASTNode> Parser::factor() {
    if (currentToken.type == TokenType::Number) {
        auto node = std::make_unique<NumberExprAST>(currentToken.value);
        eat(TokenType::Number);
        return node;
    } else if (currentToken.type == TokenType::LParen) {
        eat(TokenType::LParen);
        auto node = parseExpression();
        eat(TokenType::RParen);
        return node;
    }
    throw std::runtime_error("Invalid syntax in factor.");
}

std::unique_ptr<ASTNode> Parser::term() {
    auto node = factor();
    while (currentToken.type == TokenType::Asterisk || currentToken.type == TokenType::Slash) {
        TokenType op = currentToken.type;
        eat(op);
        auto right = factor();
        node = std::make_unique<BinaryExprAST>(op, std::move(node), std::move(right));
    }
    return node;
}

std::unique_ptr<ASTNode> Parser::parseExpression() {
    auto node = term();
    while (currentToken.type == TokenType::Plus || currentToken.type == TokenType::Minus) {
        TokenType op = currentToken.type;
        eat(op);
        auto right = term();
        node = std::make_unique<BinaryExprAST>(op, std::move(node), std::move(right));
    }
    return node;
}

6. Constructing the AST

Our AST nodes will represent numeric literals and binary operations. Later, these nodes are traversed to generate LLVM IR.

Header: AST.h

#ifndef DSL_AST_H
#define DSL_AST_H

#include <memory>
#include <llvm/IR/Value.h>
#include <llvm/IR/IRBuilder.h>

// Base class for all expression nodes.
class ASTNode {
public:
    virtual ~ASTNode() = default;
    virtual llvm::Value* codegen(llvm::IRBuilder<>& builder) = 0;
};

// Expression for numeric literals.
class NumberExprAST : public ASTNode {
public:
    NumberExprAST(double value) : value(value) {}
    llvm::Value* codegen(llvm::IRBuilder<>& builder) override;

private:
    double value;
};

// Expression for a binary operator.
class BinaryExprAST : public ASTNode {
public:
    BinaryExprAST(llvm::TokenType op, std::unique_ptr<ASTNode> lhs,
                  std::unique_ptr<ASTNode> rhs)
        : op(op), lhs(std::move(lhs)), rhs(std::move(rhs)) {}
    llvm::Value* codegen(llvm::IRBuilder<>& builder) override;

private:
    TokenType op;
    std::unique_ptr<ASTNode> lhs, rhs;
};

#endif // DSL_AST_H

Implementation: AST.cpp

#include "DSL/AST.h"
#include <llvm/IR/Constants.h>
#include <llvm/IR/LLVMContext.h>

llvm::Value* NumberExprAST::codegen(llvm::IRBuilder<>& builder) {
    return llvm::ConstantFP::get(builder.getDoubleTy()->getContext(), llvm::APFloat(value));
}

llvm::Value* BinaryExprAST::codegen(llvm::IRBuilder<>& builder) {
    llvm::Value* L = lhs->codegen(builder);
    llvm::Value* R = rhs->codegen(builder);
    if (!L || !R) return nullptr;

    switch (op) {
        case TokenType::Plus:
            return builder.CreateFAdd(L, R, "addtmp");
        case TokenType::Minus:
            return builder.CreateFSub(L, R, "subtmp");
        case TokenType::Asterisk:
            return builder.CreateFMul(L, R, "multmp");
        case TokenType::Slash:
            return builder.CreateFDiv(L, R, "divtmp");
        default:
            return nullptr;
    }
}

Note: We use LLVM’s IRBuilder to simplify the creation of IR instructions. Adjust the code if your LLVM API has evolved by 2025.


7. Generating LLVM IR

Now, we integrate our AST with LLVM to generate IR. Our goal is to compile the DSL expression to a function that computes and returns a double.

Implementation: CodeGen.cpp

#include "DSL/AST.h"
#include <llvm/IR/Module.h>
#include <llvm/IR/Verifier.h>
#include <llvm/IR/LegacyPassManager.h>
#include <llvm/IR/Function.h>
#include <llvm/Support/TargetSelect.h>
#include <llvm/IR/IRBuilder.h>
#include <memory>
#include <iostream>

llvm::Function* generateFunction(llvm::LLVMContext& context, llvm::Module& module, ASTNode* root) {
    // Create function type: double ().
    llvm::FunctionType* funcType = llvm::FunctionType::get(llvm::Type::getDoubleTy(context), false);
    llvm::Function* function = llvm::Function::Create(funcType, llvm::Function::ExternalLinkage, "main_expr", module);

    llvm::BasicBlock* block = llvm::BasicBlock::Create(context, "entry", function);
    llvm::IRBuilder<> builder(block);

    llvm::Value* retVal = root->codegen(builder);
    if (!retVal) {
        std::cerr << "Error generating code for the expression." << std::endl;
        return nullptr;
    }

    builder.CreateRet(retVal);
    if (llvm::verifyFunction(*function, &llvm::errs())) {
        function->eraseFromParent();
        return nullptr;
    }

    return function;
}

void optimizeModule(llvm::Module& module) {
    llvm::legacy::PassManager passManager;
    // Add some basic optimization passes.
    // In a production compiler, you'd add many more!
    passManager.add(llvm::createInstructionCombiningPass());
    passManager.add(llvm::createReassociatePass());
    passManager.add(llvm::createGVNPass());
    passManager.add(llvm::createCFGSimplificationPass());
    passManager.run(module);
}

8. Optimization and Code Generation

After generating LLVM IR, we optimize it using LLVM’s pass managers. The above optimizeModule function demonstrates adding a few standard optimization passes. In 2025, you might incorporate cutting-edge passes or even machine learning–based tuning for further enhancements.

To compile the IR to machine code, consider using LLVM’s JIT (via LLVM’s ORC JIT APIs) or emitting an object file that can be linked into a larger application.


9. Integrating the Runtime and Testing

Main Entry Point: main.cpp

#include "DSL/Lexer.h"
#include "DSL/Parser.h"
#include "DSL/AST.h"
#include <llvm/IR/LLVMContext.h>
#include <llvm/IR/Module.h>
#include <llvm/Support/TargetSelect.h>
#include <iostream>
#include <memory>

extern llvm::Function* generateFunction(llvm::LLVMContext& context, llvm::Module& module, ASTNode* root);
extern void optimizeModule(llvm::Module& module);

int main() {
    // Initialize LLVM.
    llvm::InitializeNativeTarget();
    llvm::InitializeNativeTargetAsmPrinter();
    llvm::LLVMContext context;
    llvm::Module module("MyDSLModule", context);

    std::string input;
    std::cout << "Enter an expression: ";
    std::getline(std::cin, input);

    Lexer lexer(input);
    Parser parser(lexer);
    std::unique_ptr<ASTNode> astRoot;
    try {
        astRoot = parser.parseExpression();
    } catch (const std::exception& ex) {
        std::cerr << "Parsing error: " << ex.what() << std::endl;
        return 1;
    }

    llvm::Function* func = generateFunction(context, module, astRoot.get());
    if (!func) {
        std::cerr << "Failed to generate LLVM function." << std::endl;
        return 1;
    }

    optimizeModule(module);

    // For demonstration, print the LLVM IR.
    module.print(llvm::outs(), nullptr);

    // In a full implementation, you could now JIT compile and execute the function.
    return 0;
}

This basic runtime lets you enter a mathematical expression, compiles it into LLVM IR, optimizes the code, and prints the IR. Expanding this further, you can use LLVM’s JIT compilation APIs to execute the code on the fly, integrate debugging information, or even embed the DSL into larger systems.


10. Conclusion & Next Steps

In this tutorial, you learned how to build a DSL from scratch with modern C++ and LLVM:

  • Lexing & Parsing: Tokenizing input and building an AST using recursive-descent parsing.
  • AST & Code Generation: Creating an AST that maps directly to LLVM IR, enabling advanced optimizations.
  • Optimization & Execution: Leveraging LLVM’s optimization passes and setting the stage for JIT compilation.

Next Steps:

  • Enhance the DSL: Add support for variables, functions, and control flow constructs.
  • Improve Error Handling: Develop a robust error recovery strategy in your parser.
  • Integrate JIT Execution: Use LLVM’s ORC JIT to compile and run your DSL expressions dynamically.
  • Experiment with Optimizations: Explore custom optimization passes and advanced LLVM features available in 2025.

This project is just the beginning. Compiler construction is a deep field with many avenues for research and innovation. Happy coding, and enjoy pushing the boundaries of language design and performance!

Continue Reading

Handpicked posts just for you — based on your current read.

Discussion 0

Please sign in to join the discussion.

No comments yet. Start the discussion!