Reference

Assembler Phases

The Keurnel Assembler transforms assembly source code into executable machine code through a series of well-defined phases. Each phase performs specific transformations and validations.

Pipeline Overview

The Keurnel Assembler follows a traditional multi-pass architecture, where each phase builds upon the output of the previous phase. This modular design ensures clean separation of concerns and allows for easier debugging and optimization.

Pre-processing
Lexical Analysis
Parsing (Syntax Analysis)
Semantic Analysis
Code Generation
Linking & Output

Phase Details

1

Pre-processing

Implemented

The pre-processing phase prepares the source code for assembly by handling directives, macros, and file inclusions.

2

Lexical Analysis

Implemented

The lexer tokenizes the pre-processed source code into a stream of tokens for the parser. It is architecture-agnostic and uses an ArchitectureProfile for classification.

3

Parsing (Syntax Analysis)

Implemented

The parser analyzes the token stream to build an Abstract Syntax Tree (AST) representing the program structure.

4

Semantic Analysis

Implemented

Semantic analysis validates the logical correctness of the assembly program beyond syntax.

5

Code Generation

Planned

The code generator translates the validated AST into machine code instructions.

6

Linking & Output

Planned

The final phase combines all code segments and produces the executable binary output.

Pre-processor

Processing Pipeline

The pre-processor executes in a specific order to ensure correct dependency resolution:

1. Handle Includes
2. Macro Table Build
3. Collect Calls
4. Expand Macros
5. Symbol Table
6. Conditionals

1. Include File Processing

The %include directive allows you to split your assembly code across multiple files. The pre-processor handles includes through a three-pass algorithm for validation and replacement.

Internal Data Structure:

// PreProcessingInclusion - tracks each include directive
PreProcessingInclusion {
    IncludedFilePath string  // Path of the included file
    LineNumber       int     // Line number for error reporting
}

Three-Pass Processing:

  • Pass 1
    Collection & Validation— Collects all %include "path" directives. Validates that only .kasm files are included (any other extension causes a pre-processing error)
  • Pass 2
    Duplicate Detection— Detects duplicate include directives. A file may only be included once; duplicates are a pre-processing error
  • Pass 3
    Content Replacement— Replaces each directive with file content wrapped in traceability comments: ; FILE: path and ; END FILE: path

Supported file types:

.kasm — Assembly source files
; Include shared macros
%include "lib/macros.kasm"

; Relative path support
%include "../common/utils.kasm"

Validation Rules:

  • Only .kasm files may be included
  • Each file can only be included once (duplicates cause errors)
  • File content is wrapped in boundary comments for traceability

2. Macro Expansion & Substitution

Macros allow you to define reusable code blocks that are expanded inline during pre-processing. The Keurnel pre-processor uses a multi-pass approach: first detecting macros, building a macro table, collecting all calls, and then performing the expansion.

Internal Data Structures:

// MacroParameter - represents a single macro parameter
MacroParameter {
    Name string  // Parameter name (e.g., "paramA", "paramB")
}

// MacroCall - tracks each invocation of a macro
MacroCall {
    Name       string    // Name of the macro being called
    Arguments  []string  // Arguments in order they are provided
    LineNumber int       // Line number for error reporting
}

// Macro - complete macro definition
Macro {
    Name       string                     // Macro identifier
    Parameters map[string]MacroParameter  // Parameters indexed by name
    Body       string                     // Code to expand
    Calls      []MacroCall                // All invocations found
}

Processing Stages:

  • 1
    PreProcessingMacroTable()— Extracts all macro definitions into a table indexed by name. Internally uses PreProcessingHasMacros() to detect macros via regex pattern %macro\s+\w+\s*\d*. Parses parameter count and captures body until %endmacro
  • 2
    PreProcessingColectMacroCalls()— Finds all macro invocations in source, validates argument count matches parameter count, and stores line numbers for error reporting
  • 3
    PreProcessingReplaceMacroCalls()— Performs actual expansion by replacing %1, %2, etc. with arguments and inserting ; MACRO: name comment

Syntax Reference:

  • %macro NAME N— Begins macro definition with N parameters (N is optional, defaults to 0)
  • %endmacro— Ends the macro definition block
  • %1, %2, ...— Positional parameter placeholders replaced during expansion

Error Handling:

If the number of arguments in a macro call doesn't match the expected parameter count, the assembler will panic with a detailed error message including the macro name, expected count, actual count, and line number.

Example: Macro definition and expansion
; Define a macro with 2 parameters
%macro PRINT_REG 2
    push %1
    mov rdi, %2
    call print_value
    pop %1
%endmacro

; Usage in source code
PRINT_REG rax, format_hex
PRINT_REG rbx, format_dec

; After pre-processing expands to:
; MACRO: PRINT_REG
push rax
mov rdi, format_hex
call print_value
pop rax

; MACRO: PRINT_REG
push rbx
mov rdi, format_dec
call print_value
pop rbx

3. Constant Definitions & Symbol Table

The pre-processor builds a symbol table for use in conditional assembly. Symbols can be defined explicitly with %define or implicitly through macro definitions.

Three-Pass Symbol Table Building:

  • Pass 1
    Collection & Validation— Collects all %define SYMBOL_NAME directives. Validates that symbol names are non-empty valid identifiers
  • Pass 2
    Duplicate Detection— Detects duplicate %define directives. A symbol may only be defined once; duplicates are a pre-processing error
  • Pass 3
    Macro Integration— Adds all macro names from the macro table as defined symbols, so %ifdef/%ifndef can test for macro existence

Definition syntax:

  • %define SYMBOL— Defines a symbol (sets it to true in symbol table)

Symbol Table Behavior:

  • Symbols are stored as map[string]bool — presence indicates definition
  • Macro names are automatically added to the symbol table
  • Use %ifdef MACRO_NAME to check if a macro is defined
Example: Symbol definitions and macro detection
; Define symbols for conditional assembly
%define DEBUG
%define PLATFORM_X86_64

; Define a macro - automatically added to symbol table
%macro LOG_MSG 1
    push rdi
    mov rdi, %1
    call print_string
    pop rdi
%endmacro

; Check if macro exists before using
%ifdef LOG_MSG
    LOG_MSG "Application started"
%else
    ; No logging macro available
%endif

; Platform-specific code
%ifdef PLATFORM_X86_64
    ; 64-bit specific instructions
    mov rax, [rbp + 8]
%endif

4. Conditional Assembly Directives

Conditional directives allow you to include or exclude code based on symbol definitions. The pre-processor uses a two-pass algorithm with stack-based block matching.

Internal Data Structure:

// conditionalBlock - tracks a complete conditional section
conditionalBlock {
    ifDirective string  // "ifdef" or "ifndef"
    symbol      string  // symbol being tested
    ifStart     int     // byte offset of %ifdef/%ifndef start
    ifEnd       int     // byte offset of %ifdef/%ifndef end
    elseStart   int     // byte offset of %else start (-1 if absent)
    elseEnd     int     // byte offset of %else end (-1 if absent)
    endifStart  int     // byte offset of %endif start
    endifEnd    int     // byte offset of %endif end
    lineNumber  int     // line number for error reporting
}

Two-Pass Processing:

  • Pass 1
    Structural Validation— Uses a stack to match every %ifdef/%ifndef with its %endif. Validates at most one %else per block. Panics on structural errors
  • Pass 2
    Evaluation & Replacement— Evaluates each block against the defined symbols map. Processes blocks in reverse order to preserve byte offsets during replacement

Available directives:

%ifdef SYMBOL— Include if symbol is defined
%ifndef SYMBOL— Include if symbol is NOT defined
%else— Alternative branch (optional, max one per block)
%endif— End conditional block (required)

Error Conditions:

  • %else without matching %ifdef/%ifndef
  • Duplicate %else within the same conditional block
  • %endif without matching %ifdef/%ifndef
  • Unclosed %ifdef/%ifndef (no matching %endif)
Example: Conditional assembly with symbol table
%define DEBUG

%ifdef DEBUG
    ; Debug-only code - included because DEBUG is defined
    call log_registers
    call dump_stack
%else
    ; Production code - excluded
%endif

%ifndef RELEASE
    ; Include assertions in non-release builds
    %include "debug/assertions.kasm"
%endif

; Nested conditionals are supported
%ifdef FEATURE_A
    %ifdef FEATURE_B
        ; Code when both FEATURE_A and FEATURE_B are defined
    %endif
%endif

Lexical Analysis (Lexer) In Progress

The lexer (tokenizer) transforms a pre-processed .kasm source string into an ordered sequence of tokens. Each token carries a type, literal value, and source location. The lexer sits between the pre-processor and the parser in the assembly pipeline.

Architecture-Agnostic Design

The lexer does not hardcode any register names, instruction mnemonics, or keywords. Instead, it receives an ArchitectureProfile at construction time that supplies these sets for the target architecture. This means the same lexer can tokenize source code for x86_64, ARM, RISC-V, or any future architecture without modification.

pre-processed source
        │
        ▼
┌──────────────────────────────────────────────┐
│              Lexer                            │
│  LexerNew(input, profile) → Start() → []Token│
│                                              │
│  ┌────────────────────┐                      │
│  │ ArchitectureProfile│ ← injected at        │
│  │  · Registers()     │   construction       │
│  │  · Instructions()  │                      │
│  │  · Keywords()      │                      │
│  └────────────────────┘                      │
└──────────────────────┬───────────────────────┘
                       │ ordered token slice
                       ▼
                   parser input

Architecture Profile Interface

An ArchitectureProfile represents a validated, immutable vocabulary for a specific hardware architecture. It provides three lookup maps for registers, instructions, and keywords.

Interface Definition:

type ArchitectureProfile interface {
    // Registers returns the set of recognised register names (lower-case).
    Registers() map[string]bool
    // Instructions returns the set of recognised instruction mnemonics (lower-case).
    Instructions() map[string]bool
    // Keywords returns the set of reserved language keywords (lower-case).
    Keywords() map[string]bool
}

Profile Properties:

  • All maps store lower-case keys — classification is case-insensitive
  • Maps are guaranteed non-nil — no nil guards needed before lookup
  • Profile is immutable after construction — safe for concurrent use
  • Each lookup is O(1) using pre-built maps

Built-in Profiles:

NewX8664Profile() returns an ArchitectureProfile populated with the x86_64 register set, instruction set, and default keywords. Additional profiles for ARM64 and RISC-V may be added in the future.

Lexer Construction

The lexer is constructed via LexerNew(input, profile) which accepts the pre-processed source string and an ArchitectureProfile. Construction is infallible — any valid string (including empty) is accepted.

Construction Semantics:

// LexerNew is the sole constructor
lexer := LexerNew(preProcessedSource, NewX8664Profile())

// After construction:
// - Position and ReadPosition start at 0
// - Line starts at 1
// - Column starts at 0 (incremented to 1 on first char)
// - Tokens slice is initialized as empty, non-nil
// - First character is loaded via readChar()

tokens := lexer.Start() // Perform tokenization

State After Construction:

Position— Starts at 0
ReadPosition— Starts at 0
Line— Starts at 1 (1-based)
Column— Starts at 0, becomes 1 on first char
Ch— Holds first character (or 0/NUL if empty)
Tokens— Empty, non-nil slice

Token Types

Every token emitted by the lexer is classified into exactly one token type. The classification is determined by the character context at the point of consumption, combined with the ArchitectureProfile lookup tables.

TypeConstantDescription
WhitespaceTokenWhitespaceSpaces, tabs, \r, \n. Never emitted.
CommentTokenComment; to end of line. Never emitted.
DirectiveTokenDirective%-prefixed word (e.g. %define, %include)
InstructionTokenInstructionKnown mnemonic from profile (e.g. mov, add, syscall)
RegisterTokenRegisterKnown register from profile (e.g. rax, x0, a0)
ImmediateTokenImmediateDecimal (42) or hex (0xFF) numeric literal
StringTokenString"…" delimited literal (quotes not stored)
KeywordTokenKeywordReserved keyword from profile (e.g. namespace)
IdentifierTokenIdentifierOther words, labels (_start:), or single punctuation

Note:

Whitespace and comments are consumed but never emitted — the token stream contains only semantically meaningful tokens.

Word Classification

Words (contiguous sequences of letters, digits, underscores, and dots) are classified using the ArchitectureProfile via case-insensitive lookup. The original casing is preserved in the literal.

Classification Priority:

  • 1
    Context Check— If previous token is TokenKeyword, classify as TokenIdentifier regardless of lookup (prevents namespace mov from classifying mov as instruction)
  • 2
    Label Check— If : follows the word, append it and classify as TokenIdentifier (e.g. _start:)
  • 3
    Register Lookup— If lower-cased word matches profile.Registers()TokenRegister
  • 4
    Instruction Lookup— If lower-cased word matches profile.Instructions()TokenInstruction
  • 5
    Keyword Lookup— If lower-cased word matches profile.Keywords()TokenKeyword
  • 6
    Fallback— Otherwise → TokenIdentifier
Example: Word classification
; Input source
namespace myModule    ; "namespace" → Keyword, "myModule" → Identifier
mov rax, rbx         ; "mov" → Instruction, "rax"/"rbx" → Register
_start:              ; "_start:" → Identifier (label)
add eax, 42          ; "add" → Instruction, "eax" → Register, "42" → Immediate

Tokenization Process (Start)

Start() performs a single-pass, left-to-right scan of the input and returns an ordered slice of tokens. It is the sole public method that drives tokenization and is guaranteed to be infallible.

Guarantees:

  • Single pass — no backtracking or multi-pass scanning
  • Infallible — cannot fail or panic on any input
  • Complete coverage — every character is handled by exactly one branch
  • Graceful termination — stops when Ch equals 0 (NUL)
  • Accurate positions — each token carries correct Line and Column values

Important:

Start() may be called only once per Lexer instance. Calling it again would re-scan from the exhausted position and return an empty slice.

Token Structure

Each token is a value type carrying four fields. Tokens are safe to copy, compare, and store without aliasing concerns.

Token Fields:

type Token struct {
    Type    TokenType  // Classification (TokenRegister, TokenInstruction, etc.)
    Literal string     // Verbatim text from source (without delimiters for strings)
    Line    int        // 1-based line number where token starts
    Column  int        // 1-based column number where token starts
}

TokenType Methods:

Identifier()Directive()Instruction()Register()Immediate()StringLiteral()Whitespace()Comment()

x86_64 Register Set

The x86_64 profile includes the following register names. All entries are lower-case in the lookup table; classification is case-insensitive.

64-bit General Purpose:

rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8–r15

32-bit General Purpose:

eax, ebx, ecx, edx, esi, edi, ebp, esp, r8d–r15d

16-bit Registers:

ax, bx, cx, dx, si, di, bp, sp

8-bit Registers:

al, bl, cl, dl, ah, bh, ch, dh, sil, dil, bpl, spl

Segment Registers:

cs, ds, es, fs, gs, ss

Special Registers:

rip, eip, rflags, eflags

x86_64 Instruction Categories

The x86_64 profile includes a comprehensive set of instruction mnemonics organized by category.

Data Transfer:

mov, movzx, movsx, lea, push, pop, xchg

Arithmetic:

add, sub, mul, imul, div, idiv, inc, dec, neg

Bitwise / Shift:

and, or, xor, not, shl, shr, sal, sar, rol, ror

Comparison:

cmp, test

Control Flow:

jmp, je, jne, jz, jnz, jg, jge, jl, jle, ja, jae, jb, jbe, call, ret, syscall, int

System / Misc:

nop, hlt, cli, sti, loop, loope, loopne

Architecture & File Layout

The lexer is split across multiple files within v0/internal/kasm. Each file owns a single concern, ensuring modifications to one area don't affect others.

FileResponsibility
lexer.goLexer struct, LexerNew, Start, scanning methods
token.goToken struct definition
token_types.goTokenType enum and convenience methods
architecture_profile.goArchitectureProfile interface, defaultKeywords()
profile_x86_64.goNewX8664Profile() — concrete x86_64 profile

Adding a New Architecture:

Adding support for a new architecture requires only two steps: (1) Create a profile file implementing ArchitectureProfile, and (2) Wire the profile in the orchestrator. No changes to the lexer core are required.

Parsing (Syntax Analysis)

The parser transforms an ordered sequence of Token values (produced by the lexer) into a structured Abstract Syntax Tree (AST). Each AST node represents a syntactic construct in the .kasm language — an instruction with its operands, a label declaration, a namespace block, a use import, or a directive. The parser sits between the lexer and the semantic analyser / code-generation stages in the assembly pipeline.

Architecture-Agnostic Design

The parser is architecture-agnostic: it does not validate instruction mnemonics, register names, or operand counts. It recognises the shape of constructs (e.g. "instruction followed by operands separated by commas") but defers validation to a later semantic-analysis pass. Because the parser operates on token types — not literal values — the same parser handles any architecture for which a lexer profile exists.

lexer output ([]Token)
        │
        ▼
┌──────────────────────────────────────────────────────────┐
│                       Parser                              │
│  ParserNew(tokens) → Parse() → (*Program, []ParseError)  │
└──────────────────────┬────────────────────────────────────┘
                       │ AST + diagnostics
                       ▼
              semantic analysis / code generation

Parser Construction

A Parser represents a ready-to-parse consumer of a token slice. If a Parser value exists, it is guaranteed to hold a valid token slice and initialised position state. There is no uninitialised or partially-constructed state.

Parser Struct:

type Parser struct {
    Position int          // Current index into the Tokens slice
    Tokens   []Token      // The input token slice from the lexer
    errors   []ParseError // Accumulated parse errors
}

Construction Requirements:

  • ParserNew(tokens) is the sole constructor — accepts the []Token slice from Lexer.Start()
  • Infallible — cannot fail. An empty slice produces an empty Program; nil treated as empty
  • Position starts at 0, pointing to the first token
  • Token slice stored by reference — parser does not copy or modify tokens
Example: Parser construction
// Lexer produces token slice
tokens := lexer.Start()

// Parser consumes the token slice
parser := ParserNew(tokens)

// Parse returns AST and any errors
program, errors := parser.Parse()

Parsing Process (Parse)

Parse() performs a single left-to-right pass over the token slice and returns a *Program AST and a slice of ParseError values. It is the sole public method that drives parsing.

Parse() Guarantees:

  • Complete consumption — consumes entire token slice, stopping when Position reaches end
  • Progress guarantee — each loop branch consumes at least one token (no infinite loops)
  • Partial results — returns all successfully parsed nodes even when errors occur
  • Source positions — each error carries Line and Column from originating token
  • Single use — may be called only once per Parser instance

Error Handling:

The parser does not abort on the first error — it continues parsing to report as many issues as possible. If no errors occurred, the error slice is empty (not nil).

Token Consumption Helpers

The parser advances through the token slice using a set of helper methods. All advancement goes through these helpers — bounds-checking is centralised and out-of-bounds access is impossible.

MethodBehavior
current()Returns token at Position, or sentinel zero-value Token if at/past end
peek()Returns token at Position + 1 without advancing; sentinel if no next token
advance()Increments Position by one, returns token at previous position; sentinel if at end
expect(tokenType)If current matches, consume and return; otherwise record ParseError (no advance)
isAtEnd()Returns true when Position is at/past token slice length

AST Node Types

Every construct in the .kasm language maps to exactly one AST node type. The parser produces a flat list of top-level statements inside a Program. Because .kasm is a line-oriented assembly language, there is no nested expression tree — operands are leaves, not recursive sub-expressions.

Statement Types:

type Program struct {
    Statements []Statement  // Ordered slice in source order
}

// Statement kinds:
type InstructionStmt struct {
    Mnemonic string     // Instruction name (e.g., "mov", "add")
    Operands []Operand  // Zero or more operands
    Line, Column int    // Source position
}

type LabelStmt struct {
    Name string         // Label name WITHOUT trailing ":"
    Line, Column int    // Source position
}

type NamespaceStmt struct {
    Name string         // Namespace identifier
    Line, Column int    // Source position
}

type UseStmt struct {
    ModuleName string   // Module to import
    Line, Column int    // Source position
}

type DirectiveStmt struct {
    Literal string      // Full directive including "%" prefix
    Args []Token        // Argument tokens
    Line, Column int    // Source position
}
Statement KindDescriptionExample
InstructionStmtAn instruction mnemonic followed by zero or more operandsmov rax, 1
LabelStmtA label declaration (identifier ending in :)_start:
NamespaceStmtA namespace keyword followed by a namenamespace myModule
UseStmtA use instruction followed by a module nameuse stdio
DirectiveStmtA pre-processor directive that survived into the token stream%section .text

Operand Types

An Operand represents a single argument to an instruction. Operands are not recursive — there are no sub-expressions. Each operand is one of the following kinds:

Operand Types:

type RegisterOperand struct {
    Name string         // Register name (original casing preserved)
    Line, Column int
}

type ImmediateOperand struct {
    Value string        // Numeric literal as string ("42", "0xFF")
    Line, Column int
}

type IdentifierOperand struct {
    Name string         // Symbolic reference (label name, data symbol)
    Line, Column int
}

type StringOperand struct {
    Value string        // String content (delimiters already stripped)
    Line, Column int
}

type MemoryOperand struct {
    Components []MemoryComponent  // Base, displacement, index, operators
    Line, Column int
}
Operand KindToken TypeExamples
RegisterOperandTokenRegisterrax, r8, eax
ImmediateOperandTokenImmediate42, 0xFF, 0b1010
IdentifierOperandTokenIdentifierlabel, msg, data_ptr
StringOperandTokenString"Hello", "World\n"
MemoryOperandcomposite[rbp], [rax + 8], [rbx + rcx*4]

Memory Operand Parsing:

Memory operands are enclosed in [ and ]. The parser consumes the opening bracket, collects inner tokens (base register, optional displacement, optional index), and consumes the closing bracket. An unterminated [ produces a ParseError.

Statement Dispatch

The main parsing loop inspects the current token's type to determine which parsing method to invoke. Because each token type maps to at most one statement kind, dispatch is a simple switch — there is no ambiguity.

Dispatch Rules:

  • 1
    TokenInstruction→ parse as InstructionStmt (or UseStmt if literal is use)
  • 2
    TokenIdentifier with trailing :→ parse as LabelStmt
  • 3
    TokenIdentifier without :→ operand outside instruction context = parse error, recover
  • 4
    TokenKeyword→ dispatch by literal: namespaceNamespaceStmt; unknown → error
  • 5
    TokenDirective→ parse as DirectiveStmt
  • 6
    TokenRegister, TokenImmediate, TokenString outside instruction parse error (operand without instruction)
  • 7
    Any other token at top level parse error, advance past token

Instruction Parsing

When the parser encounters a TokenInstruction, it collects the instruction's operands. Operands are separated by , tokens — the comma is consumed but not stored.

Instruction Parsing Rules:

  • Consume instruction token, record literal as mnemonic
  • Consume zero or more operands separated by commas
  • Parser accepts any number of operands — operand-count validation is a semantic concern
  • If literal (case-insensitive) is use, delegate to UseStmt parsing
Example: Various instruction forms
; Zero operands
ret                ; InstructionStmt { Mnemonic: "ret", Operands: [] }
syscall            ; InstructionStmt { Mnemonic: "syscall", Operands: [] }
nop                ; InstructionStmt { Mnemonic: "nop", Operands: [] }

; One operand
push rax           ; InstructionStmt { Mnemonic: "push", Operands: [RegisterOperand{rax}] }
jmp _exit          ; InstructionStmt { Mnemonic: "jmp", Operands: [IdentifierOperand{_exit}] }

; Two operands
mov rax, 60        ; InstructionStmt { Mnemonic: "mov", Operands: [Register, Immediate] }
add rbx, [rsp]     ; InstructionStmt { Mnemonic: "add", Operands: [Register, Memory] }

Error Handling and Recovery

The parser must be resilient. A syntax error in one statement must not prevent parsing of subsequent statements. Because .kasm is line-oriented, recovery is straightforward — skip to the next statement boundary.

ParseError Structure:

type ParseError struct {
    Message string  // Human-readable error description
    Line    int     // 1-based line number
    Column  int     // 1-based column number
}

Error Handling Guarantees:

  • No panics — malformed sequences, empty slices, and unexpected tokens are handled gracefully
  • Error accumulation — multiple errors may be reported in a single Parse() call
  • Recovery strategy — advance past tokens until a recognisable statement start is found
  • Source order — errors are returned in the order they were encountered

Common Parse Errors:

  • namespace without a following identifier
  • use without a following module name
  • Unclosed memory operand ([ without matching ])
  • Operand without preceding instruction
  • Unknown token at top level

Architecture & File Layout

The parser lives in v0/kasm alongside the lexer and token definitions. Because the parser consumes Token and TokenType from the same package, no cross-package import is required for the core data types.

FileResponsibility
parsing.goParser struct, ParserNew, Parse, parsing methods
ast.goAST node types (Program, Statement, Operand, etc.)
parse_error.goParseError type definition

Separation of Concerns:

  • Parser does not import any architecture-specific package — operates on token types only
  • AST nodes in ast.go separate from parsing logic for reusability
  • ParseError is a plain data struct, not an error interface

Semantic Analysis

The semantic analyser validates a *Program AST (produced by the parser) against the rules of the .kasm language and the target architecture. It detects errors that are syntactically legal but semantically invalid — unknown instructions, wrong operand counts, mismatched operand types, duplicate labels, unresolved symbol references, and namespace violations. The semantic analyser sits between the parser and the code-generation stage in the assembly pipeline.

Architecture-Aware Design

The semantic analyser is architecture-aware: it receives an architecture description (instruction groups with their variants) at construction time and uses it to validate instruction operands. Because the architecture description is injected, the same analyser logic handles any architecture for which instruction metadata exists.

parser output (*Program AST)
        │
        ▼
┌──────────────────────────────────────────────────────────────────┐
│                     Semantic Analyser                             │
│  AnalyserNew(program, instructions) → Analyse() → []SemanticError│
│                                                                  │
│  ┌─────────────────────────────┐                                 │
│  │  Instruction metadata       │ ← injected at construction      │
│  │  (groups, variants, operand │                                 │
│  │   types)                    │                                 │
│  └─────────────────────────────┘                                 │
└──────────────────────┬───────────────────────────────────────────┘
                       │ validated AST + diagnostics
                       ▼
                 code generation

Analyser Construction

An Analyser represents a ready-to-validate consumer of a *Program AST. If an Analyser value exists, it is guaranteed to hold a valid program reference and initialised internal state.

Analyser Struct:

type Analyser struct {
    program      *Program                   // The AST to analyse
    instructions map[string]Instruction     // Instruction lookup (upper-case keys)
    labels       map[string]labelDecl       // Label name → declaration location
    namespaces   map[string]namespaceDecl   // Namespace name → declaration location
    modules      map[string]useDecl         // Module name → import location
    errors       []SemanticError            // Accumulated semantic errors
}

// Helper types for tracking declarations
type labelDecl struct {
    Name string
    Line, Column int
}

type namespaceDecl struct {
    Name string
    Line, Column int
}

type useDecl struct {
    Name string
    Line, Column int
}

Construction Requirements:

  • AnalyserNew(program, instructions) is the sole constructor
  • Infallible — cannot fail. Empty program produces zero errors; nil treated as empty
  • Instruction table must provide O(1) lookup via map[string]Instruction
  • Internal tables (labels, namespaces, modules) initialised as empty during construction
Example: Analyser construction
// Parser produces AST
program, parseErrors := parser.Parse()

// Build instruction table from architecture groups
instructions := buildInstructionTable(x86_64Groups)

// Analyser validates the AST
analyser := AnalyserNew(program, instructions)
semanticErrors := analyser.Analyse()

Analysis Process (Analyse)

Analyse() performs a single left-to-right pass over the Program.Statements slice and returns a []SemanticError slice. It is the sole public method that drives analysis.

Two-Phase Analysis:

  • Phase 1
    Collection— Gather all label declarations and namespace declarations into lookup tables so that forward references can be resolved
  • Phase 2
    Validation— Validate every statement against the collected tables and the instruction metadata

Analyse() Guarantees:

  • Single pass per statement — visits every statement exactly once, in source order
  • Read-only — does not modify the AST (inspects and records diagnostics only)
  • Multi-error reporting — continues analysing to report as many issues as possible
  • Forward reference supportjmp label before label: resolves correctly
  • Single use — may be called only once per Analyser instance

Forward References:

Because .kasm allows forward references (e.g. jmp label before label: is declared), the collection phase must complete before the validation phase begins.

Instruction Validation

When the analyser encounters an InstructionStmt, it must validate the mnemonic and its operands against the architecture's instruction metadata.

Mnemonic Validation:

  • Lookup mnemonic (case-insensitive) in instruction table
  • If not found: "unknown instruction '<mnemonic>'"

Operand Count Validation:

  • If instruction has variants, check operand count matches at least one variant
  • If no match: "instruction '<mnemonic>' expects <n> operand(s), got <m>"
  • If no variants defined, skip count validation (allows partial metadata)

Operand Type Mapping:

AST Node KindSemantic Type
RegisterOperand"register"
ImmediateOperand"immediate"
MemoryOperand"memory"
IdentifierOperand"identifier" (compatible with "relative", "far")
StringOperand"string"

Operand Type Validation:

  • Use Instruction.FindVariant(operandTypes...) to match variant
  • If no match: "no variant of '<mnemonic>' accepts operands (<type1>, <type2>, ...)"
Example: Instruction validation
; Valid - matches variant (register, immediate)
mov rax, 60          ; ✓ Found: MOV r64, imm32

; Invalid - unknown instruction
xyz rax, rbx         ; ✗ Error: "unknown instruction 'xyz'"

; Invalid - wrong operand count
push rax, rbx        ; ✗ Error: "instruction 'push' expects 1 operand(s), got 2"

; Invalid - wrong operand types
mov 42, rax          ; ✗ Error: "no variant of 'mov' accepts operands (immediate, register)"

Label Validation

Labels are declaration-site identifiers. The analyser must ensure they are unique within their scope and that all references can be resolved.

Duplicate Label Detection:

  • Maintain label table (map of name → location)
  • First declaration accepted
  • Second+ produces error with original location

Undefined Reference Detection:

  • Check every IdentifierOperand
  • Run after all labels collected (phase 2)
  • Non-instruction identifiers not checked
Example: Label validation
; Forward reference - valid
jmp _exit            ; ✓ Resolved in phase 2

_start:              ; ✓ First declaration
    mov rax, 60

_start:              ; ✗ Error: "duplicate label '_start', previously declared at 4:1"
    nop

_exit:               ; ✓ Resolves the forward reference
    syscall

jmp undefined        ; ✗ Error: "undefined reference to 'undefined'"

Namespace Validation

Namespaces group related code under a name. The analyser validates namespace declarations for uniqueness.

Validation Rules:

  • Record namespace name when NamespaceStmt encountered
  • If duplicate: "duplicate namespace '<name>', previously declared at <line>:<column>"
  • Name must be valid identifier (non-empty, doesn't start with digit)

Future Extension:

Future versions may introduce namespace-scoped label resolution (e.g. namespace.label). The namespace table is preserved for downstream stages that implement scoped resolution.

Use Statement Validation

use imports a module by name. The analyser validates the module reference and detects duplicates.

Validation Rules:

  • Record module name when UseStmt encountered
  • If duplicate: "duplicate use of module '<name>', previously imported at <line>:<column>"
  • Module name must be valid identifier (non-empty)

Note:

Module resolution (locating the module's source file or compiled artefact) is not the analyser's responsibility. The analyser validates the statement and records it — a later linker or module resolver consumes the information.

Directive Validation

Directives that survive into the AST (not consumed by the pre-processor) are captured as DirectiveStmt nodes.

Validation Rules:

  • If directive not recognised: "unrecognised directive '<literal>'"
  • Pre-processor consumes: %include, %macro, %endmacro, %define, %ifdef, %ifndef, %else, %endif
  • Any surviving directive is either undefined or a user error

Future Directives:

Future language-level directives (e.g. %section, %align) will be recognised and validated with their arguments.

Immediate Value Validation

ImmediateOperand values are stored as verbatim strings by the parser. The analyser validates they represent legal numeric values.

Validation Rules:

  • Decimal: one or more digits (09)
  • Hexadecimal: 0x or 0X followed by hex digits
  • If invalid: "invalid immediate value '<value>'"

Overflow Detection:

Overflow detection is optional in the initial implementation. When implemented, the analyser will warn (not error) when an immediate exceeds the maximum value for the instruction's operand size.

Memory Operand Validation

MemoryOperand nodes contain a Components slice of raw tokens. The analyser validates the structure of the memory reference.

Validation Rules:

  • Non-empty: must contain at least one component
  • Base must be register or identifier: first non-operator component cannot be immediate
  • Valid operators only: only + and - allowed
  • Displacement: registers, immediates, or identifiers after operators
Example: Memory operand validation
; Valid memory operands
mov rax, [rbp]           ; ✓ Register base
mov rax, [rbp + 8]       ; ✓ Register + immediate displacement
mov rax, [rsp - 16]      ; ✓ Register - immediate displacement
mov rax, [data_ptr]      ; ✓ Identifier base

; Invalid memory operands
mov rax, []              ; ✗ Error: "empty memory operand"
mov rax, [42]            ; ✗ Error: "memory operand base must be a register or identifier, got immediate"
mov rax, [rbp * 2]       ; ✗ Error: "invalid operator '*' in memory operand"

Validation Summary

The following table summarizes all validation checks performed by the semantic analyser.

CheckStatementError Condition
Unknown instructionInstructionStmtMnemonic not in table
Operand count mismatchInstructionStmtNo variant matches count
Operand type mismatchInstructionStmtNo variant matches types
Duplicate labelLabelStmtName already declared
Undefined referenceInstructionStmtIdentifier not in label table
Duplicate namespaceNamespaceStmtName already declared
Duplicate useUseStmtModule already imported
Unrecognised directiveDirectiveStmtLiteral not in recognised set
Invalid immediateInstructionStmtCannot parse as number
Empty memory operandInstructionStmtComponents slice empty
Invalid memory baseInstructionStmtFirst component is immediate
Invalid memory operatorInstructionStmtOperator not + or -

Architecture & File Layout

The semantic analyser lives in v0/kasm alongside the parser, lexer, and AST definitions.

FileResponsibility
semantic.goAnalyser struct, AnalyserNew, Analyse, validation methods
semantic_error.goSemanticError type definition

Dependencies:

  • Imports v0/architecture for Instruction and InstructionVariant types
  • Does not import architecture-specific packages — receives instruction table via constructor
  • SemanticError is a plain data struct (like ParseError)

Development Roadmap

The Keurnel Assembler is under active development. Currently, only the Pre-processing phase is fully implemented. The remaining phases are being developed iteratively to ensure a robust and efficient assembly pipeline.

Implemented
In Progress
Planned