ADR-003: Fixed Transform Pipeline

Date: 2025-12-16

Status

Implemented - December 16, 2025 (Now Default)

Implementation Summary

The fixed transform pipeline is fully implemented and is now the default and only content processing system in DocBuilder.

Deliverables:

✅ New pipeline package (internal/hugo/pipeline/)
✅ Document type replacing Page/PageShim
✅ Generator and Transform function types
✅ 3 generators (main/repo/section indexes)
✅ 11 transforms (parse FM, normalize indexes, build FM, extract title, strip heading, rewrite links/images, keywords, metadata, edit link, serialize)
✅ Comprehensive unit tests (71 passing in internal/hugo)
✅ Old system removed - Transform registry, patch system, and all legacy code deleted

Migration Complete (December 16, 2025):

Removed internal/hugo/transforms/ directory (24 files, registry-based system)
Removed internal/hugo/fmcore/ directory (3 files, patch merge system)
Removed visualize command and Page/PageShim abstractions
Simplified integration code (content_copy.go: 216→13 lines)
Net code reduction: -6,233 lines (-88%)

Test Status:

Pipeline unit tests: PASS (6 test functions, 12 sub-tests)
Hugo package tests: PASS (71 tests)
Full short test suite: PASS (all packages)
golangci-lint: 0 issues

Context

Current Architecture

DocBuilder’s content transformation system uses a registry-based, dependency-ordered pipeline with a front matter patching system:

Components:

TransformRegistry: Global registry where transforms register themselves via init()
Transform interface: Requires Name(), Priority(), DependsOn(), and Apply(page *Page) Patch
Dependency resolution: Topological sort based on DependsOn() declarations
Patch system: Three merge modes (MergeDeep, MergeReplace, MergeSetIfMissing) with priority-based ordering
Protected keys: Reserved front matter fields that block MergeDeep patches

Current transforms:

front_matter_builder_v2 (priority 50): Initializes base front matter
extract_index_title (priority 55): Extracts H1 from README/index files
strip_heading: Removes first H1 from content
relative_link_rewriter: Fixes relative markdown links
image_link_rewriter: Fixes image paths
Various metadata injectors (repo info, edit links, etc.)

Example transform:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


func init() {
    RegisterTransform(&ExtractIndexTitle{})
}

type ExtractIndexTitle struct{}

func (t *ExtractIndexTitle) Name() string { return "extract_index_title" }
func (t *ExtractIndexTitle) Priority() int { return 55 }
func (t *ExtractIndexTitle) DependsOn() []string { return []string{"front_matter_builder_v2"} }

func (t *ExtractIndexTitle) Apply(pg *Page) Patch {
    // Extract title from H1
    return Patch{
        Mode:     fmcore.MergeReplace,  // Required to override protected "title" key
        Priority: 55,
        FrontMatterUpdates: map[string]any{
            "title": extractedTitle,
        },
    }
}

Problems with Current Architecture

Hidden complexity: Dependencies and execution order are not obvious from reading the code
Non-local reasoning: Understanding transform behavior requires checking:
- Registration order in init()
- Declared dependencies in DependsOn()
- Priority values across multiple transforms
- Protected key system in patching logic
- Merge mode semantics (MergeDeep vs MergeReplace)
Debugging difficulty:
- Recent bug: extract_index_title extracted correct title but was silently blocked by protected keys
- Required temporary debug logging to discover the issue
- Solution was non-obvious: change MergeDeep to MergeReplace
Indirection overhead:
- Registry pattern adds abstraction without benefit
- Topological sort runs on every build
- Patch merging adds cognitive overhead
False flexibility:
- Users cannot configure transforms dynamically
- Registry/dependency system suggests extensibility we don’t support
- Added complexity without delivering value
Maintenance burden:
- Adding transforms requires understanding registration, priorities, dependencies, and patch semantics
- Easy to introduce subtle bugs (wrong merge mode, missing dependency, priority conflicts)

Key Insight

DocBuilder is greenfield and we control the pipeline. We don’t need dynamic transform registration or user-configurable pipelines. We need a solid, predictable pipeline for our specific use case.

Decision

Replace the registry-based, patch-driven pipeline with a fixed, explicit transform pipeline.

Core Principles

Fixed execution order: Transforms are called in explicit sequence defined in code
Direct mutation: Transforms modify Document directly (no patching)
No dynamic registration: No init() registry, no dependency declarations
Simple interfaces: Transform = function that modifies a document
Transparent behavior: Reading the pipeline code shows exact execution order

New Architecture

Core Interfaces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


// FileTransform modifies a document in the pipeline.
// Can optionally return new documents to inject into the pipeline.
// New documents will be queued and processed through ALL transforms from the beginning.
type FileTransform func(doc *Document) ([]*Document, error)

// FileGenerator creates new documents based on analysis of discovered documents.
// Generators run before transforms to create missing files (e.g., _index.md).
type FileGenerator func(ctx *GenerationContext) ([]*Document, error)

// GenerationContext provides access to discovered files for analysis.
type GenerationContext struct {
    Discovered []*Document  // All discovered files from repositories
    Config     *config.Config
}

// Document represents a file being processed through the pipeline.
type Document struct {
    // Content is the markdown body (transformed in-place)
    Content string
    
    // FrontMatter is the YAML front matter (modified directly)
    FrontMatter map[string]any
    
    // Metadata for transforms to use
    Path         string  // Hugo content path (e.g., "repo-name/section/file.md")
    IsIndex      bool    // True if this is _index.md or README.md
    Repository   string  // Source repository name
    SourceCommit string  // Git commit SHA
    SourceURL    string  // Repository URL for edit links
    Generated    bool    // True if this was generated (not discovered)
}

Pipeline Execution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


// processContent runs the complete content processing pipeline.
func (g *Generator) processContent(discovered []*Document) ([]*Document, error) {
    // Phase 1: Generation - Create missing files
    generators := []FileGenerator{
        generateMainIndex,           // 1. Create site _index.md
        generateRepositoryIndexes,   // 2. Create repo _index.md files
        generateSectionIndexes,      // 3. Create section _index.md files
    }
    
    ctx := &GenerationContext{
        Discovered: discovered,
        Config:     g.config,
    }
    
    var generated []*Document
    for _, generator := range generators {
        docs, err := generator(ctx)
        if err != nil {
            return nil, fmt.Errorf("generation failed: %w", err)
        }
        generated = append(generated, docs...)
    }
    
    // Combine discovered + generated
    allDocs := append(discovered, generated...)
    
    // Phase 2: Transformation - Process all documents
    transforms := []FileTransform{
        computeBaseFrontMatter,      // 1. Initialize FrontMatter from file
        extractIndexTitle,           // 2. Extract H1 title from index files
        stripHeading,                // 3. Strip H1 if appropriate
        rewriteRelativeLinks,        // 4. Fix markdown links
        rewriteImageLinks,           // 5. Fix image paths
        generateFromKeywords,        // 6. Create new files based on keywords (e.g., )
        addRepositoryMetadata,       // 7. Add repo/commit/source metadata
        addEditLink,                 // 8. Generate edit URL
    }
    
    // Process documents iteratively - newly generated docs go through all transforms
    processedDocs := make([]*Document, 0, len(allDocs))
    queue := append([]*Document{}, allDocs...)
    
    for len(queue) > 0 {
        doc := queue[0]
        queue = queue[1:]
        
        // Run all transforms on this document
        for _, transform := range transforms {
            newDocs, err := transform(doc)
            if err != nil {
                return nil, fmt.Errorf("transform failed for %s: %w", doc.Path, err)
            }
            
            // Prevent generated documents from creating new documents
            if len(newDocs) > 0 && doc.Generated {
                return nil, fmt.Errorf(
                    "generated document %s attempted to create new documents (transforms should not generate from generated docs)",
                    doc.Path,
                )
            }
            
            // Queue new documents for full transform pipeline
            if len(newDocs) > 0 {
                queue = append(queue, newDocs...)
            }
        }
        
        processedDocs = append(processedDocs, doc)
    }
    
    return processedDocs, nil
}

Example Generator (Creates New Files):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


// generateSectionIndexes creates _index.md for sections that don't have one.
func generateSectionIndexes(ctx *GenerationContext) ([]*Document, error) {
    // Group discovered files by section
    sections := make(map[string][]*Document)
    for _, doc := range ctx.Discovered {
        section := filepath.Dir(doc.Path)
        sections[section] = append(sections[section], doc)
    }
    
    var generated []*Document
    for section, docs := range sections {
        // Check if index already exists
        hasIndex := false
        for _, doc := range docs {
            if doc.IsIndex {
                hasIndex = true
                break
            }
        }
        
        if !hasIndex {
            // Generate missing index
            indexDoc := &Document{
                Path:        filepath.Join(section, "_index.md"),
                IsIndex:     true,
                Generated:   true,
                Content:     generateIndexContent(section, docs),
                FrontMatter: map[string]any{
                    "title": titleCase(filepath.Base(section)),
                    "type":  "docs",
                },
            }
            generated = append(generated, indexDoc)
        }
    }
    
    return generated, nil
}

Example Transform (Modifies Existing Files):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


// extractIndexTitle extracts the first H1 heading as the title for index files.
// Only applies if no text exists before the H1.
func extractIndexTitle(doc *Document) ([]*Document, error) {
    if !doc.IsIndex {
        return nil, nil  // Only process index files, no new docs
    }
    
    h1Pattern := regexp.MustCompile(`(?m)^# (.+)$`)
    loc := h1Pattern.FindStringIndex(doc.Content)
    if loc == nil {
        return nil, nil  // No H1 found, no new docs
    }
    
    // Check for text before H1
    textBeforeH1 := strings.TrimSpace(doc.Content[:loc[0]])
    if textBeforeH1 != "" {
        return nil, nil  // Use filename as title, no new docs
    }
    
    // Extract title and set directly
    matches := h1Pattern.FindStringSubmatch(doc.Content)
    doc.FrontMatter["title"] = matches[1]
    
    return nil, nil  // Modified doc in-place, no new docs
}

Example Transform (Generates New Files Based on Keywords):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


// generateFromKeywords scans for special keywords and generates related files.
// Example:  tag creates a glossary page from all terms.
// 
// If this transform returns new documents while processing a Generated document,
// the pipeline will return an error automatically - no need to check here.
func generateFromKeywords(doc *Document) ([]*Document, error) {
    var newDocs []*Document
    
    // Check for  marker
    if strings.Contains(doc.Content, "") {
        // Extract all glossary terms from this document
        terms := extractGlossaryTerms(doc.Content)
        
        if len(terms) > 0 {
            // Generate glossary document
            // This will go through ALL transforms: front matter, link rewriting, etc.
            glossaryDoc := &Document{
                Path:        filepath.Join(doc.Repository, "glossary.md"),
                IsIndex:     false,
                Generated:   true,  // Mark as generated
                Content:     renderGlossary(terms),
                FrontMatter: map[string]any{
                    "title":      "Glossary",
                    "type":       "docs",
                    "generated":  true,
                    "source_doc": doc.Path,
                },
                Repository:   doc.Repository,
                SourceCommit: doc.SourceCommit,
                SourceURL:    doc.SourceURL,
            }
            
            newDocs = append(newDocs, glossaryDoc)
        }
        
        // Remove  marker from original content
        doc.Content = strings.ReplaceAll(doc.Content, "", "")
    }
    
    // Check for other keywords...
    // if strings.Contains(doc.Content, "@api-reference") { ... }
    
    return newDocs, nil
}

Migration Path

Phase 1: Create New Pipeline (Parallel)

Define Document, FileTransform, FileGenerator, GenerationContext types
Create processContent() with generation + transform phases
Convert existing index generation logic to generators
Convert existing transforms to new interface (one by one)
Add comprehensive tests for new pipeline

Phase 2: Switch Over

Update copyContentFiles() to use new pipeline
Run integration tests to verify behavior
Fix any discrepancies

Phase 3: Cleanup

Remove old Transform interface
Remove TransformRegistry
Remove topological sort logic
Remove patch system (Patch, MergeMode, protected keys)
Remove old transform files

Phase 4: Documentation

Update copilot instructions
Document transform pipeline in architecture docs
Add examples for adding new transforms

Consequences

Positive

✅ Predictable: Execution order is explicit in code
✅ Debuggable: Set breakpoint in pipeline, step through transforms sequentially
✅ Testable: Test individual transforms/generators or full pipeline easily
✅ Maintainable: No magic, no hidden dependencies, no indirection
✅ Fast: No registry lookups, no topological sorting, no patch merging
✅ Simple onboarding: New developers see exact transform order immediately
✅ Reliable: Fixed pipeline means consistent, reproducible behavior
✅ Separation of concerns: Generation (creating files) separate from transformation (modifying files)
✅ Dynamic generation: Transforms can create new files based on content analysis (keywords, patterns, etc.)
✅ Composable: New documents flow through remaining transforms automatically

Negative

⚠️ Less flexible: Cannot dynamically add/remove transforms (but we don’t need this)
⚠️ Migration effort: Need to convert all existing transforms

Neutral

Pipeline is now explicitly ordered instead of dependency-ordered
Transforms mutate directly instead of returning patches
Code location becomes important (pipeline defined in generator.go)

Alternatives Considered

1. Keep Current System, Fix Bugs

Description: Continue using registry + patches, improve documentation

Rejected because:

Doesn’t address root cause (unnecessary complexity)
Bug was symptom of overly complex system
Future maintainers will face same issues

2. Plugin Architecture

Description: Make transforms truly pluggable with user configuration

Rejected because:

Massive scope increase
Users don’t need this flexibility
Introduces security/stability risks
Not aligned with project goals

3. Middleware Pattern

Description: Chain of responsibility with explicit next() calls

Rejected because:

More complex than simple function list
Doesn’t add value for our use case
Makes testing harder (mocking next())

Implementation Plan

✅ Completed December 16, 2025

Phase 1: Core Pipeline (Completed)

Created internal/hugo/pipeline/ package
Implemented Document type with front matter and content fields
Built Processor with two-phase execution (generators → transforms)
Added queue-based processing for dynamic document generation

Phase 2: Transforms Migration (Completed)

Converted all 10 essential transforms to FileTransform functions
Implemented 3 generators for index file creation
Removed dependency on registry, patches, and Page abstraction
All transforms use direct mutation pattern

Phase 3: Integration (Completed)

Created copyContentFilesPipeline() integration function
Added environment variable feature flag (DOCBUILDER_NEW_PIPELINE=1)
Maintained backward compatibility with old system
Updated copilot instructions

Phase 4: Testing & Validation (Completed)

Unit tests for all generators and transforms
Edge case coverage (empty FM, no FM, malformed FM)
Integration via feature flag tested
All tests passing, linter clean

Remaining Work (Separate from this ADR):

Remove old registry/patch system
Update golden test expectations (theme system issue)
Make new pipeline the default
Documentation updates

Actual effort: 1 day (vs estimated 3-5 days)

Implementation Details

File Structure

internal/hugo/pipeline/
├── document.go          # Document type, NewDocumentFromDocFile
├── processor.go         # Processor with ProcessContent
├── generators.go        # generateMainIndex, generateRepositoryIndex, generateSectionIndex
├── transforms.go        # All 10 transforms
└── pipeline_test.go     # Comprehensive unit tests

Key Design Decisions

Direct Mutation: Documents are modified in-place, no patch merging
Type Safety: Compile-time verification of transform signatures
Queue-Based: Generators can add new documents during processing
Stateless Transforms: Pure functions with no global state
Feature Flag: Environment variable enables new pipeline without code changes

Open Questions

All questions resolved during implementation:

Error handling: ✅ Transforms return errors, pipeline fails fast
Transform state: ✅ Pass context via RepositoryMetadata parameter
Partial failures: ✅ Fail fast on first error (single-pass pipeline)
Testing strategy: ✅ Both unit tests per transform and integration tests
Front matter parsing: ✅ Handle edge cases (empty FM, no FM, malformed FM)
Generator ordering: ✅ All generators run before any transforms

References

Issue: “README H1 duplicate headers” (revealed patch system complexity)
ADR-002: In-Memory Content Pipeline (established single-pass architecture)
Copilot Instructions: Transform pipeline section (needs update)
Style Guide: Function naming conventions (already compatible)

Decision Makers

@inful (Lead Developer)

Notes

This refactor aligns with DocBuilder’s greenfield status and aggressive refactoring posture. We’re optimizing for clarity and maintainability over theoretical flexibility we don’t need.