A Journey into Compiler Construction in 2025
Table of Contents
- Overview & Motivation
- Prerequisites & Setup
- Designing Our DSL
- Building the Lexer
- Writing a Recursive-Descent Parser
- Constructing the AST
- Generating LLVM IR
- Optimization and Code Generation
- Integrating the Runtime and Testing
- Conclusion & Next Steps
1. Overview & Motivation
Modern applications demand flexible, domain-focused solutions. A DSL allows domain experts to express complex logic concisely and can be optimized for performance. Leveraging LLVM gives you access to state-of-the-art optimization and code generation tools, so you can create languages that compile to highly efficient native code. In this tutorial, we’ll implement a simple mathematical expression language (with potential for future extensions) that computes results in real time.
2. Prerequisites & Setup
Tools & Libraries:
- C++ Compiler: A modern C++ compiler with C++17 support (e.g., GCC 10+, Clang 12+).
- CMake: For building and managing your project.
- LLVM: Version 15+ (or later). Ensure you have the LLVM libraries and headers installed.
- Optional: Git for version control and sample project management.
Project Structure Example:
mydsl/
├── CMakeLists.txt
├── include/
│ └── DSL/
│ ├── Lexer.h
│ ├── Parser.h
│ └── AST.h
└── src/
├── Lexer.cpp
├── Parser.cpp
├── AST.cpp
├── CodeGen.cpp
└── main.cpp
Your CMakeLists.txt
should find and link LLVM libraries. For example:
cmake_minimum_required(VERSION 3.15)
project(MyDSL)
find_package(LLVM REQUIRED CONFIG)
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
message(STATUS "Using LLVMConfig.cmake in: ${LLVM_DIR}")
include_directories(${LLVM_INCLUDE_DIRS})
add_definitions(${LLVM_DEFINITIONS})
set(SOURCE_FILES
src/main.cpp
src/Lexer.cpp
src/Parser.cpp
src/AST.cpp
src/CodeGen.cpp
)
add_executable(mydsl ${SOURCE_FILES})
llvm_map_components_to_libnames(llvm_libs support core irreader nativecodegen)
target_link_libraries(mydsl ${llvm_libs})
3. Designing Our DSL
For this tutorial, we’ll create a DSL for evaluating mathematical expressions. Our language will support:
- Numeric literals (e.g.,
42
, 3.14
) - Basic arithmetic operators:
+
, -
, *
, /
- Parentheses for grouping expressions
Later, you can extend it to include variables, functions, or even control flow constructs.
Example DSL Code:
(3 + 4) * (5 - 2) / 2
4. Building the Lexer
The lexer (or tokenizer) converts a stream of characters into a sequence of tokens. Each token represents a logical unit, such as a number or operator.
Header: Lexer.h
#ifndef DSL_LEXER_H
#define DSL_LEXER_H
#include <string>
#include <vector>
enum class TokenType {
Number,
Plus,
Minus,
Asterisk,
Slash,
LParen,
RParen,
EndOfFile,
Invalid
};
struct Token {
TokenType type;
std::string text;
double value; // Only valid for Number tokens.
};
class Lexer {
public:
Lexer(const std::string& input);
Token getNextToken();
private:
const std::string input;
size_t pos = 0;
char currentChar();
void advance();
void skipWhitespace();
Token number();
};
#endif // DSL_LEXER_H
Implementation: Lexer.cpp
#include "DSL/Lexer.h"
#include <cctype>
#include <cstdlib>
Lexer::Lexer(const std::string& input) : input(input) {}
char Lexer::currentChar() {
if (pos < input.size()) {
return input[pos];
}
return '\0';
}
void Lexer::advance() {
pos++;
}
void Lexer::skipWhitespace() {
while (std::isspace(currentChar())) {
advance();
}
}
Token Lexer::number() {
size_t start = pos;
while (std::isdigit(currentChar()) || currentChar() == '.') {
advance();
}
std::string numStr = input.substr(start, pos - start);
double value = std::strtod(numStr.c_str(), nullptr);
return { TokenType::Number, numStr, value };
}
Token Lexer::getNextToken() {
skipWhitespace();
char current = currentChar();
if (current == '\0') {
return { TokenType::EndOfFile, "", 0 };
}
if (std::isdigit(current) || current == '.') {
return number();
}
Token token;
token.text = std::string(1, current);
token.value = 0;
switch (current) {
case '+': token.type = TokenType::Plus; break;
case '-': token.type = TokenType::Minus; break;
case '*': token.type = TokenType::Asterisk; break;
case '/': token.type = TokenType::Slash; break;
case '(': token.type = TokenType::LParen; break;
case ')': token.type = TokenType::RParen; break;
default: token.type = TokenType::Invalid; break;
}
advance();
return token;
}
5. Writing a Recursive-Descent Parser
We’ll implement a recursive-descent parser that constructs an Abstract Syntax Tree (AST) from tokens. Our grammar is defined with standard operator precedence:
expression → term ((‘+’ | ‘-’) term)*
term → factor ((‘*’ | ‘/’) factor)*
factor → Number | ‘(’ expression ‘)’
Header: Parser.h
#ifndef DSL_PARSER_H
#define DSL_PARSER_H
#include "Lexer.h"
#include "AST.h"
#include <memory>
class Parser {
public:
Parser(Lexer& lexer);
std::unique_ptr<ASTNode> parseExpression();
private:
Lexer& lexer;
Token currentToken;
void eat(TokenType type);
std::unique_ptr<ASTNode> factor();
std::unique_ptr<ASTNode> term();
};
#endif // DSL_PARSER_H
Implementation: Parser.cpp
#include "DSL/Parser.h"
#include <stdexcept>
Parser::Parser(Lexer& lexer) : lexer(lexer) {
currentToken = lexer.getNextToken();
}
void Parser::eat(TokenType type) {
if (currentToken.type == type) {
currentToken = lexer.getNextToken();
} else {
throw std::runtime_error("Unexpected token: " + currentToken.text);
}
}
std::unique_ptr<ASTNode> Parser::factor() {
if (currentToken.type == TokenType::Number) {
auto node = std::make_unique<NumberExprAST>(currentToken.value);
eat(TokenType::Number);
return node;
} else if (currentToken.type == TokenType::LParen) {
eat(TokenType::LParen);
auto node = parseExpression();
eat(TokenType::RParen);
return node;
}
throw std::runtime_error("Invalid syntax in factor.");
}
std::unique_ptr<ASTNode> Parser::term() {
auto node = factor();
while (currentToken.type == TokenType::Asterisk || currentToken.type == TokenType::Slash) {
TokenType op = currentToken.type;
eat(op);
auto right = factor();
node = std::make_unique<BinaryExprAST>(op, std::move(node), std::move(right));
}
return node;
}
std::unique_ptr<ASTNode> Parser::parseExpression() {
auto node = term();
while (currentToken.type == TokenType::Plus || currentToken.type == TokenType::Minus) {
TokenType op = currentToken.type;
eat(op);
auto right = term();
node = std::make_unique<BinaryExprAST>(op, std::move(node), std::move(right));
}
return node;
}
6. Constructing the AST
Our AST nodes will represent numeric literals and binary operations. Later, these nodes are traversed to generate LLVM IR.
Header: AST.h
#ifndef DSL_AST_H
#define DSL_AST_H
#include <memory>
#include <llvm/IR/Value.h>
#include <llvm/IR/IRBuilder.h>
// Base class for all expression nodes.
class ASTNode {
public:
virtual ~ASTNode() = default;
virtual llvm::Value* codegen(llvm::IRBuilder<>& builder) = 0;
};
// Expression for numeric literals.
class NumberExprAST : public ASTNode {
public:
NumberExprAST(double value) : value(value) {}
llvm::Value* codegen(llvm::IRBuilder<>& builder) override;
private:
double value;
};
// Expression for a binary operator.
class BinaryExprAST : public ASTNode {
public:
BinaryExprAST(llvm::TokenType op, std::unique_ptr<ASTNode> lhs,
std::unique_ptr<ASTNode> rhs)
: op(op), lhs(std::move(lhs)), rhs(std::move(rhs)) {}
llvm::Value* codegen(llvm::IRBuilder<>& builder) override;
private:
TokenType op;
std::unique_ptr<ASTNode> lhs, rhs;
};
#endif // DSL_AST_H
Implementation: AST.cpp
#include "DSL/AST.h"
#include <llvm/IR/Constants.h>
#include <llvm/IR/LLVMContext.h>
llvm::Value* NumberExprAST::codegen(llvm::IRBuilder<>& builder) {
return llvm::ConstantFP::get(builder.getDoubleTy()->getContext(), llvm::APFloat(value));
}
llvm::Value* BinaryExprAST::codegen(llvm::IRBuilder<>& builder) {
llvm::Value* L = lhs->codegen(builder);
llvm::Value* R = rhs->codegen(builder);
if (!L || !R) return nullptr;
switch (op) {
case TokenType::Plus:
return builder.CreateFAdd(L, R, "addtmp");
case TokenType::Minus:
return builder.CreateFSub(L, R, "subtmp");
case TokenType::Asterisk:
return builder.CreateFMul(L, R, "multmp");
case TokenType::Slash:
return builder.CreateFDiv(L, R, "divtmp");
default:
return nullptr;
}
}
Note: We use LLVM’s IRBuilder
to simplify the creation of IR instructions. Adjust the code if your LLVM API has evolved by 2025.
7. Generating LLVM IR
Now, we integrate our AST with LLVM to generate IR. Our goal is to compile the DSL expression to a function that computes and returns a double.
Implementation: CodeGen.cpp
#include "DSL/AST.h"
#include <llvm/IR/Module.h>
#include <llvm/IR/Verifier.h>
#include <llvm/IR/LegacyPassManager.h>
#include <llvm/IR/Function.h>
#include <llvm/Support/TargetSelect.h>
#include <llvm/IR/IRBuilder.h>
#include <memory>
#include <iostream>
llvm::Function* generateFunction(llvm::LLVMContext& context, llvm::Module& module, ASTNode* root) {
// Create function type: double ().
llvm::FunctionType* funcType = llvm::FunctionType::get(llvm::Type::getDoubleTy(context), false);
llvm::Function* function = llvm::Function::Create(funcType, llvm::Function::ExternalLinkage, "main_expr", module);
llvm::BasicBlock* block = llvm::BasicBlock::Create(context, "entry", function);
llvm::IRBuilder<> builder(block);
llvm::Value* retVal = root->codegen(builder);
if (!retVal) {
std::cerr << "Error generating code for the expression." << std::endl;
return nullptr;
}
builder.CreateRet(retVal);
if (llvm::verifyFunction(*function, &llvm::errs())) {
function->eraseFromParent();
return nullptr;
}
return function;
}
void optimizeModule(llvm::Module& module) {
llvm::legacy::PassManager passManager;
// Add some basic optimization passes.
// In a production compiler, you'd add many more!
passManager.add(llvm::createInstructionCombiningPass());
passManager.add(llvm::createReassociatePass());
passManager.add(llvm::createGVNPass());
passManager.add(llvm::createCFGSimplificationPass());
passManager.run(module);
}
8. Optimization and Code Generation
After generating LLVM IR, we optimize it using LLVM’s pass managers. The above optimizeModule
function demonstrates adding a few standard optimization passes. In 2025, you might incorporate cutting-edge passes or even machine learning–based tuning for further enhancements.
To compile the IR to machine code, consider using LLVM’s JIT (via LLVM’s ORC JIT APIs) or emitting an object file that can be linked into a larger application.
9. Integrating the Runtime and Testing
Main Entry Point: main.cpp
#include "DSL/Lexer.h"
#include "DSL/Parser.h"
#include "DSL/AST.h"
#include <llvm/IR/LLVMContext.h>
#include <llvm/IR/Module.h>
#include <llvm/Support/TargetSelect.h>
#include <iostream>
#include <memory>
extern llvm::Function* generateFunction(llvm::LLVMContext& context, llvm::Module& module, ASTNode* root);
extern void optimizeModule(llvm::Module& module);
int main() {
// Initialize LLVM.
llvm::InitializeNativeTarget();
llvm::InitializeNativeTargetAsmPrinter();
llvm::LLVMContext context;
llvm::Module module("MyDSLModule", context);
std::string input;
std::cout << "Enter an expression: ";
std::getline(std::cin, input);
Lexer lexer(input);
Parser parser(lexer);
std::unique_ptr<ASTNode> astRoot;
try {
astRoot = parser.parseExpression();
} catch (const std::exception& ex) {
std::cerr << "Parsing error: " << ex.what() << std::endl;
return 1;
}
llvm::Function* func = generateFunction(context, module, astRoot.get());
if (!func) {
std::cerr << "Failed to generate LLVM function." << std::endl;
return 1;
}
optimizeModule(module);
// For demonstration, print the LLVM IR.
module.print(llvm::outs(), nullptr);
// In a full implementation, you could now JIT compile and execute the function.
return 0;
}
This basic runtime lets you enter a mathematical expression, compiles it into LLVM IR, optimizes the code, and prints the IR. Expanding this further, you can use LLVM’s JIT compilation APIs to execute the code on the fly, integrate debugging information, or even embed the DSL into larger systems.
10. Conclusion & Next Steps
In this tutorial, you learned how to build a DSL from scratch with modern C++ and LLVM:
- Lexing & Parsing: Tokenizing input and building an AST using recursive-descent parsing.
- AST & Code Generation: Creating an AST that maps directly to LLVM IR, enabling advanced optimizations.
- Optimization & Execution: Leveraging LLVM’s optimization passes and setting the stage for JIT compilation.
Next Steps:
- Enhance the DSL: Add support for variables, functions, and control flow constructs.
- Improve Error Handling: Develop a robust error recovery strategy in your parser.
- Integrate JIT Execution: Use LLVM’s ORC JIT to compile and run your DSL expressions dynamically.
- Experiment with Optimizations: Explore custom optimization passes and advanced LLVM features available in 2025.
This project is just the beginning. Compiler construction is a deep field with many avenues for research and innovation. Happy coding, and enjoy pushing the boundaries of language design and performance!