Module Indexing Feature#
Overview#
The Module Indexing feature provides fast, local lookup of Terraform modules using document IDs from vector search results. It creates and maintains a module_index.json file that maps vector search IDs to their corresponding module summary JSON files and metadata.
Problem Solved#
When performing vector searches, you receive a document ID that uniquely identifies a module. Previously, you would need to either: 1. Use the vector database to retrieve the full module 2. Manually reconstruct the filename from repository/ref/path information 3. Search through all JSON files to find the matching module
The Module Indexing feature solves this by maintaining a simple, fast index that lets you: - Look up any module by its ID instantly - Query modules by provider, tags, or repository - Rebuild the index from existing JSON files - Get statistics about your indexed modules
How It Works#
Index Structure#
The module_index.json file maintains a mapping of document IDs to module metadata:
{
"index_version": "1.0",
"generated_at": "2025-10-28T12:00:00Z",
"total_modules": 150,
"modules": {
"a1b2c3d4e5f6...": {
"id": "a1b2c3d4e5f6...",
"repository": "https://github.com/terraform-aws-modules/terraform-aws-vpc",
"ref": "v5.0.0",
"path": ".",
"summary_file": "output/terraform-aws-vpc_v5.0.0_src.json",
"provider": "aws",
"providers": "aws,hashicorp",
"tags": ["aws", "vpc", "network"],
"last_indexed": "2025-10-28T12:00:00Z"
}
}
}
Document ID Generation#
Document IDs are generated using SHA256 hash of {repository}:{ref}:{path}:
- Deterministic: Same module always produces same ID
- Consistent: Matches the ID returned by vector search
- Safe: Valid for use as database/filesystem identifiers
Usage#
Automatic Index Generation#
When you run ingest, the index is automatically created and maintained:
terraform-ingest ingest config.yaml
# Automatically creates/updates module_index.json
CLI Commands#
Rebuild Index#
Rebuild the index from all JSON files in the output directory:
terraform-ingest index rebuild --output-dir ./output
View Statistics#
Get index statistics:
terraform-ingest index stats
Output:
📊 Module Index Statistics:
Total Modules: 150
Unique Providers: 5
Providers: aws, google, azurerm, consul, vault
Unique Tags: 23
Index File: ./output/module_index.json
Look Up Module by ID#
Retrieve module information using a vector search ID:
terraform-ingest index lookup a1b2c3d4e5f6...
Output:
📦 Module: a1b2c3d4e5f6...
Repository: https://github.com/terraform-aws-modules/terraform-aws-vpc
Ref: v5.0.0
Path: .
Provider: aws
Summary File: output/terraform-aws-vpc_v5.0.0_src.json
Tags: aws, vpc, network
Get as JSON:
terraform-ingest index lookup a1b2c3d4e5f6... --json
Search by Provider#
Find all modules for a specific provider:
terraform-ingest index by_provider aws
Or get as JSON:
terraform-ingest index by_provider aws --json
Search by Tag#
Find modules with a specific tag:
terraform-ingest index by_tag vpc
Programmatic Usage#
Python API#
from terraform_ingest.indexer import ModuleIndexer
# Initialize the indexer
indexer = ModuleIndexer("./output")
# Add a module to the index
from terraform_ingest.models import TerraformModuleSummary
doc_id = indexer.add_module(module_summary)
# Look up a module
module = indexer.get_module(doc_id)
# Get the path to the summary file
summary_path = indexer.get_module_summary_path(doc_id)
with open(summary_path) as f:
summary_data = json.load(f)
# Search by provider
aws_modules = indexer.search_by_provider("aws")
# Search by tag
vpc_modules = indexer.search_by_tag("vpc")
# Search by repository
my_modules = indexer.search_by_repository("my-org")
# Get statistics
stats = indexer.get_stats()
print(f"Total modules: {stats['total_modules']}")
print(f"Providers: {stats['providers']}")
# Rebuild from files
count = indexer.rebuild_from_files()
print(f"Indexed {count} modules")
# Save the index
indexer.save()
Integration with Vector Search#
After performing a vector search:
from terraform_ingest.embeddings import VectorDBManager
from terraform_ingest.indexer import ModuleIndexer
vector_db = VectorDBManager(config)
indexer = ModuleIndexer("./output")
# Search for modules
results = vector_db.search("AWS VPC module for networking")
# For each result, look up the full module details
for result in results:
doc_id = result["id"]
index_entry = indexer.get_module(doc_id)
summary_path = indexer.get_module_summary_path(doc_id)
print(f"Found: {index_entry['repository']} ({index_entry['ref']})")
print(f"Summary: {summary_path}")
MCP Tool Integration#
The module index is exposed via MCP (Model Context Protocol) tool for AI agents:
get_module_by_index_id(doc_id, output_dir="./output")
Retrieves full module information by its index document ID.
# Use the MCP tool programmatically
from terraform_ingest.mcp_service import get_module_by_index_id
# Get full module details from vector search result
vector_result = vector_db.search("VPC module")[0]
doc_id = vector_result["id"]
module_summary = get_module_by_index_id(doc_id)
print(f"Module: {module_summary['repository']}")
print(f"Variables: {[v['name'] for v in module_summary['variables']]}")
print(f"Outputs: {[o['name'] for o in module_summary['outputs']]}")
Example Workflow with Vector Search:
- AI agent performs vector search for VPC modules
- Gets results with document IDs
- Calls
get_module_by_index_id(doc_id)to retrieve full module details - Analyzes variables, outputs, and README to generate Terraform code
# Typical AI agent workflow
search_results = search_modules_vector("VPC with subnets and NAT gateway")
for result in search_results:
doc_id = result["id"]
module = get_module_by_index_id(doc_id)
# Now the agent has full module information
print(f"Repository: {module['repository']}")
print(f"Version: {module['ref']}")
print(f"Required Inputs: {[v['name'] for v in module['variables'] if v['required']]}")
print(f"Available Outputs: {[o['name'] for o in module['outputs']]}")
# Use this to generate Terraform code...
Module Index Structure#
ModuleIndexer Class#
Main class for managing the module index.
Methods:
add_module(summary: TerraformModuleSummary) -> str: Add/update module, returns IDget_module(doc_id: str) -> Optional[Dict]: Get module entry by IDremove_module(doc_id: str) -> bool: Remove module from indexget_module_summary_path(doc_id: str) -> Optional[Path]: Get path to summary JSONsearch_by_provider(provider: str) -> List[Dict]: Find modules by providersearch_by_tag(tag: str) -> List[Dict]: Find modules by tagsearch_by_repository(repository: str) -> List[Dict]: Find modules by repositorylist_all() -> List[Dict]: Get all module entriesrebuild_from_files() -> int: Rebuild from JSON filesclear() -> None: Clear all entriessave() -> None: Save index to fileget_stats() -> Dict: Get index statistics
Integration with TerraformIngest#
The indexer is automatically initialized and used by TerraformIngest:
from terraform_ingest.ingest import TerraformIngest
ingester = TerraformIngest.from_yaml("config.yaml")
summaries = ingester.ingest() # Automatically creates index
# Access the indexer
module = ingester.indexer.get_module(doc_id)
stats = ingester.indexer.get_stats()
Advanced Usage#
Custom Index Location#
Specify a custom index filename:
indexer = ModuleIndexer(
output_dir="./output",
index_filename="my_custom_index.json"
)
Batch Operations#
Rebuild and get statistics:
# Rebuild from files
count = indexer.rebuild_from_files()
# Get comprehensive stats
stats = indexer.get_stats()
# Export all entries
all_entries = indexer.list_all()
Filtering and Pagination#
Search with multiple filters:
# Get AWS modules
aws_modules = indexer.search_by_provider("aws")
# Get network-related AWS modules
network_modules = [
m for m in aws_modules
if "network" in m.get("tags", [])
]
# Get from specific repository
vpc_modules = [
m for m in aws_modules
if "vpc" in m["repository"].lower()
]
Performance Characteristics#
- Index Creation: O(n) where n is number of modules
- Lookup by ID: O(1) dictionary lookup
- Search by Provider: O(n) linear scan with string matching
- Memory Usage: ~1-2KB per module entry
- File Size: ~100-200KB for 1000 modules
For 100+ modules, consider: - Loading index once and reusing in memory - Implementing caching for frequent searches - Rebuilding index periodically as new modules are added
Troubleshooting#
Index File Not Found#
Ensure ingestion completed successfully:
terraform-ingest ingest config.yaml
ls -la ./output/module_index.json
Module Not Found in Index#
Rebuild the index:
terraform-ingest index rebuild
Stale Index#
The index is updated automatically during ingest. To force a rebuild:
rm ./output/module_index.json
terraform-ingest index rebuild
Vector Search ID Not in Index#
Ensure: 1. Vector search was performed after modules were indexed 2. You're using the correct output directory 3. The module still exists in the JSON files
Examples#
Example 1: Find Module from Vector Search#
# Perform vector search
results = vector_db.search("VPC with subnets", n_results=5)
# Look up first result
best_match = results[0]
module_info = indexer.get_module(best_match["id"])
print(f"Best match: {module_info['repository']}")
print(f"Version: {module_info['ref']}")
# Get the full summary
summary_path = indexer.get_module_summary_path(best_match["id"])
with open(summary_path) as f:
summary = json.load(f)
print(f"Inputs: {[v['name'] for v in summary['variables']]}")
Example 2: Organize Modules by Provider#
indexer = ModuleIndexer("./output")
# Group all modules by provider
by_provider = {}
for module in indexer.list_all():
provider = module["provider"]
if provider not in by_provider:
by_provider[provider] = []
by_provider[provider].append(module)
# Print summary
for provider, modules in sorted(by_provider.items()):
print(f"{provider}: {len(modules)} modules")
for m in modules:
print(f" • {m['repository']} ({m['ref']})")
Example 3: Export Index for Analysis#
import json
indexer = ModuleIndexer("./output")
# Export as CSV for spreadsheet analysis
modules = indexer.list_all()
with open("module_export.csv", "w") as f:
f.write("Repository,Ref,Path,Provider,Tags\n")
for m in modules:
f.write(f"{m['repository']},{m['ref']},{m['path']},{m['provider']},{' '.join(m['tags'])}\n")
See Also#
- Vector Database Guide - Learn about vector embeddings
- MCP Service - Expose modules to AI agents
- Architecture - System architecture overview