ADR-002: Fix Index Stage Pipeline Bypass
Status
Accepted - Implemented 2025-12-13
Context
Current Architecture (Mostly Correct)
DocBuilder already implements an in-memory content pipeline with proper separation of concerns:
- Discovery Stage (
stageDiscoverDocs): Reads files from source repositories into memory once - Transform Stage (
copyContentFiles):- Loads content into memory (
file.LoadContent()) - Runs dependency-ordered transform pipeline (front matter, link rewriting, etc.)
- Processes content entirely in memory via
Pagestruct - Writes final transformed output to disk once
- Loads content into memory (
- Index Stage (
stageIndexes): Generates repository/section index pages - Hugo Render (
stageRunHugo): Runs Hugo on the prepared content tree
The transform pipeline is already in-memory and works correctly with:
- β Single source read during discovery
- β
In-memory transformation via
Pagestruct - β Dependency-based transform ordering
- β Single write after all transforms complete
- β Front matter patching and merging
- β Link rewriting through the pipeline
The Actual Problem: Index Stage Bypass
One specific issue exists in /internal/hugo/indexes.go where README.md files are promoted to _index.md:
The useReadmeAsIndex function:
- Re-reads the source README.md file from disk (bypassing transformed content)
- Manually parses and manipulates front matter (duplicating transform logic)
- Overwrites the already-transformed file at the index location
Impact: When README.md is promoted to _index.md, transformations applied by the pipeline (especially link rewrites) are lost because the index stage writes the original untransformed content.
Why This Happened
The index stage was written before the transform pipeline was fully established. It predates the current dependency-based transform system and operates on the assumption that it needs to read source files directly.
Decision
Fix the index stage to use already-transformed content instead of re-reading source files. This is a targeted fix that eliminates the pipeline bypass without requiring a full architectural refactor.
Core Insight
The existing architecture is already correct - we don’t need to refactor the pipeline. We only need to:
- Capture transformed content after the pipeline runs
- Make it available to the index stage
- Stop re-reading source files in index generation
Minimal Changes Required
Change 1: Add field to track transformed content
Change 2: Capture transformed content in copyContentFiles
Change 3: Fix index generation to use transformed content
Architecture After Fix
Key principle: Transform pipeline remains authoritative. Index stage becomes a pure consumer of transformed content.
Consequences
Positive
- Pipeline Integrity: All content flows through transform pipeline with no bypasses
- Bug Fix: README.md β _index.md conversion preserves link rewrites and other transforms
- Eliminates Duplicate Logic: Front matter parsing happens only in transform pipeline
- Minimal Changes: ~15 lines of code vs. full refactor
- Low Risk: Doesn’t change core architecture, just fixes data flow
- Better Testability: Can verify transformed content is used consistently
- Future-Proof: Makes it easier to add new transforms knowing they’ll apply everywhere
Negative
- Minimal Memory Overhead: Adds
TransformedBytesfield toDocFile- Mitigation: Negligible impact (content already in memory during transform)
- Only populated for markdown files, not assets
- Pass-by-value consideration:
docFilesslice must be passed by reference or returned- Current code already uses
[]docs.DocFileslice which shares backing array - May need to ensure mutations are visible across function boundaries
- Current code already uses
Trade-offs Avoided
By not doing a full refactor, we avoid:
- β Rewriting working transform pipeline
- β Changing stage interfaces
- β Updating all transform implementations
- β Extensive test updates
- β Risk of introducing new bugs in working code
Implementation Plan
Phase 1: Foundation (Day 1-2)
Files Modified: 1 file, 2 lines
- Add
TransformedBytes []bytefield toDocFilestruct ininternal/docs/discovery.go - Add godoc comment explaining field purpose
- Run tests to ensure no breakage from schema change
Acceptance: Field compiles, tests pass
Phase 2: Capture Transformed Content (Day 2-3)
Files Modified: 1 file, ~3 lines
- In
internal/hugo/content_copy.go, after transform pipeline completes: - Ensure this happens inside the loop that processes each file
- Add debug logging to verify field is populated
Acceptance: TransformedBytes populated for markdown files after pipeline
Testing:
- Add test to verify
TransformedBytesmatchesp.Raw - Verify assets skip this (only markdown files)
Phase 3: Fix Index Generation (Day 3-5)
Files Modified: 1 file, ~10-15 lines
3a. Modify useReadmeAsIndex function:
- Replace
os.ReadFile(readmeSourcePath)withfile.TransformedBytes - Remove manual front matter parsing (already in transformed content)
- Add validation that
TransformedBytesis populated - Simplify logic - just copy transformed bytes to index location
3b. Update calling code:
- Ensure
useReadmeAsIndexreceivesDocFilewithTransformedBytes - Pass full
DocFileinstead of just paths where needed
Acceptance: README.md promoted to _index.md preserves transforms
Testing:
- Test README.md with relative links becomes _index.md with rewritten links
- Test README.md with added front matter from pipeline is preserved
- Test multiple repositories with README files
Phase 4: Integration Testing (Day 5-7)
Files Modified: Test files only
- Run existing
TestPipelineReadmeLinks- should now pass - Add test for front matter preservation in README β _index.md
- Test with Relearn theme configuration
- Test edge cases:
- README without front matter
- README in subdirectories
- Repositories without README files
- Run full integration test suite
Acceptance: All existing tests pass, README transforms preserved
Phase 5: Documentation & Cleanup (Day 7-8)
- Update
CONTENT_TRANSFORMS.mdto document that transforms apply to all files including index promotions - Add comments in
indexes.goexplaining why we useTransformedBytes - Update this ADR status to “Accepted”
- Add CHANGELOG entry for bug fix
Optional Cleanup (can be separate PR):
- Remove now-unused manual front matter parsing in index stage
- Consolidate duplicate path resolution logic
- Add metrics for transform pipeline coverage
Timeline Summary
- Total Effort: 1-2 weeks (with testing)
- Code Changes: ~20 lines across 2 files
- Test Changes: ~50-100 lines for comprehensive coverage
- Risk Level: Low (targeted fix, no architectural changes)
Rollback Plan
If issues discovered:
- Immediate: Revert
useReadmeAsIndexto read from disk (restore 1 function) - Short-term: Add feature flag to toggle between old/new behavior
- Long-term: Keep
TransformedBytesfield for future use, fix bugs incrementally
Success Criteria
- β README.md files promoted to _index.md preserve all transforms
- β Links in README β _index.md are correctly rewritten
- β Front matter patches from pipeline are present in index files
- β No regression in existing functionality
- β All tests pass
- β No performance degradation
References
- Transform pipeline implementation
- Index generation
- DocFile struct
- Transform pipeline design
- Page struct with in-memory processing
- BuildState architecture
Related Issues
- README.md link rewriting bypass when promoted to _index.md
- Front matter patches not applied to index files
- Duplicate front matter parsing logic in index stage
Notes
Discovery Process
This ADR was created after investigating why README.md files lost transform pipeline changes when promoted to _index.md. Initial analysis suggested the entire pipeline needed refactoring, but deeper investigation revealed:
- The transform pipeline already works correctly - it processes content in-memory with proper dependency ordering
- The bug is isolated - only the index generation stage bypasses the pipeline
- The fix is minimal - capture and reuse transformed content instead of re-reading sources
Key Learnings
- Don’t assume the architecture is broken - investigate thoroughly before proposing large refactors
- The codebase already implements best practices - in-memory processing, dependency resolution, staged execution
- Targeted fixes are often better - 20 lines beats rewriting thousands
Future Enhancements
This fix enables:
- Confidence that all transforms apply universally
- Easier debugging (single authoritative transformed content)
- Future optimization: avoid duplicate writes for README/index cases
Created: 2025-12-13
Updated: 2025-12-13 (revised after codebase analysis)
Author: Development Team
Decision: Proposed β Implementation Ready