Claude.Bricks

As-Built Document

AI-300 Operations Platform • 4 ML Scenarios • Bicep + Azure ML SDK v2

What is Claude.Bricks?

Claude.Bricks is two things at once. It is a system that generates modular LEGO buildings using Claude Code and PowerShell scripts that output LDraw (.ldr) files. It is also a working Azure ML platform that covers the Microsoft AI-300 (Operationalizing Machine Learning and Generative AI Solutions) exam objectives.

The project started as a DP-100 study lab and has been retooled for AI-300. The shift in emphasis is significant: AI-300 cares more about provisioning, versioning, promoting, evaluating, monitoring, and governing ML and GenAI systems than about building models from scratch.

The LEGO domain gives every scenario real data to work with. Images of rendered buildings feed the classifier. Structured .ldr files drive anomaly detection. Part-usage statistics power clustering. Natural language descriptions generate building specifications through a RAG pipeline. Four ML scenarios, all backed by Bicep provisioning, GitHub Actions CI/CD, and operational tooling (endpoints, evaluation, drift monitoring).

AI-300 Exam Domain Coverage

Domain 1: MLOps Infrastructure 15-20%
  • AML workspace, datastores, compute targets provisioned via Bicep (16 modules)
  • Data assets, environments, and components registered via SDK scripts
  • Bicep + Azure CLI deployment automated through GitHub Actions
  • Shared AML registry for cross-workspace asset promotion
  • Git-managed source control for all infrastructure and ML code
Domain 2: ML Model Lifecycle 25-30%
  • MLflow experiment tracking across 3 training scenarios
  • Model registration with versioning and metadata
  • Blue/green managed online endpoints with safe rollout and rollback
  • Drift detection with PSI/KS metrics and automatic retraining triggers
  • Training pipelines with metric-gated quality checks in CI/CD
Domain 3: GenAIOps Infrastructure 20-25%
  • Foundry hub and project deployed via Bicep
  • Foundation model deployment through Foundry model catalog
  • Prompt versioning in Git with system prompts and RAG templates
  • AI Search integration for retrieval-augmented generation
  • Prompt variant comparison via experiment framework
Domain 4: GenAI Quality Assurance 10-15%
  • 15-row evaluation dataset with LEGO domain test cases
  • Quality metrics: groundedness, relevance, coherence, fluency
  • Custom buildability evaluator for domain-specific validation
  • Automated eval workflow blocks deployment on threshold failure
  • Safety evaluation for harmful content detection
Domain 5: GenAI Optimization 10-15%
  • RAG tuning matrix (chunk size, overlap, search mode, top-k)
  • Fine-tuning with synthetic data generation
  • Embedding model selection and comparison
  • Hybrid search combining vector, keyword, and semantic retrieval
  • A/B testing framework with relevance metrics

End-to-End Pipeline

Provision
Bicep + CLI
Register
Assets + Components
Train / Evaluate
MLflow + Quality Gates
Deploy
Blue/Green + Smoke Tests
Monitor
Drift + Observability

What This Document Covers

Table of Contents

1. About Claude.Bricks

2. Architecture Overview

3. Infrastructure as Code

4. CI/CD with GitHub Actions

5. Asset Registration

6. Scenario 1: Facade Classification

7. Scenario 2: Structural Validation

8. Scenario 3: Pattern Extraction

9. Scenario 4: GenAI Spec Generator

10. Endpoint Operations

11. Evaluation and Quality Gates

12. Prompt Engineering

13. Observability and Monitoring

14. Data Flow Overview

15. Cost Breakdown

16. Teardown

17. How Models Work Together

2. Architecture Overview

Claude.Bricks: Complete Resource Map
CORE PLATFORM
ML Workspace
mlw-claudebricks-dev-v2
Storage
stclaudebricksdev
Key Vault
kv-claudebricks-dev
ACR
acrclaudebricksdev
App Insights
appi-claudebricks-dev
Log Analytics
law-claudebricks-dev
COMPUTE
Compute Instance
DS2_v2, dev only
CPU Cluster
0-2 nodes, DS3_v2
GPU Cluster
0-1 node, NC4as_T4_v3
AI SERVICES (CONDITIONAL)
Azure OpenAI
GPT-4o + embeddings
AI Search
Basic, hybrid search
Custom Vision
Optional
AI FOUNDRY
Foundry Hub
hub-claudebricks-dev
Foundry Project
proj-claudebricks-dev
STORAGE CONTAINERS
facade-images
ldr-files
reference-models
training-data
inference-logs
SHARED (CROSS-ENVIRONMENT)
AML Registry
reg-claudebricks
Shares models, environments, and components between dev and test workspaces.
Dev environment shown above. Test environment uses a separate resource group (rg-claudebricks-test) with the same resource layout. The shared AML registry spans both.

3. Infrastructure as Code

All Azure resources are defined declaratively in code, reviewed in pull requests, and deployed through automation. The project maintains two IaC implementations: Bicep (primary) and Terraform (alternate).

Bicep (Primary Path)

The Bicep implementation lives in deploymentcode/bicep/ and follows a module-per-resource pattern. Each Azure resource type gets its own .bicep file under modules/, and a top-level main.bicep orchestrates them with conditional deployment flags.

Bicep Module Inventory

#ModuleResourcePurpose
1storage-account.bicepStorage AccountBlob containers for training data, models, images
2key-vault.bicepKey VaultSecrets, connection strings, API keys
3acr.bicepContainer RegistryCustom environment images
4log-analytics.bicepLog Analytics WorkspaceCentral logging and diagnostics
5app-insights.bicepApplication InsightsEndpoint telemetry, latency tracking
6ml-workspace.bicepAML WorkspaceCore ML workspace (depends on 1-5)
7compute-cluster.bicepCompute ClusterCPU/GPU training clusters
8compute-instance.bicepCompute InstanceDev VM for notebook work
9openai.bicepAzure OpenAIGPT-4o-mini deployment (conditional)
10ai-search.bicepAI SearchVector index for RAG (conditional)
11custom-vision.bicepCustom VisionTraining + prediction (conditional)
12foundry-hub.bicepAI Foundry HubFoundry management layer (conditional)
13foundry-project.bicepAI Foundry ProjectScenario 4 Foundry workspace (conditional)
14aml-registry.bicepAML RegistryCross-workspace model promotion
15rbac.bicepRole AssignmentsService principal and managed identity roles
16diagnostics.bicepDiagnostic SettingsRoute resource logs to Log Analytics

Deployment Command Flow

Wrapper scripts in scripts/deploy/ handle login, subscription selection, resource group creation, and deployment in one command. The flow follows a validate-preview-deploy pattern.

Bicep Deployment Pipeline
az bicep build
Lint and compile
az deployment group validate
Schema + dependency check
az deployment group what-if
Preview changes
az deployment group create
Apply to resource group
Each step must pass before the next runs. The what-if preview shows exactly what will be created, modified, or deleted before any changes are applied.

Scripts: scripts/deploy/deploy-infra.sh (bash), scripts/deploy/deploy-infra.ps1 (PowerShell), scripts/deploy/whatif.sh (preview only), scripts/deploy/teardown.sh (cleanup).

Module Dependency Chain

Bicep Deployment Order
LAYER 1: FOUNDATION (PARALLEL)
Storage
Blob containers
Key Vault
Secrets store
ACR
Container images
Log Analytics
Central logging
App Insights
Depends on Log Analytics
AML Workspace
Depends on Storage, KV, ACR, App Insights
CPU Cluster
Training
GPU Cluster
Optional
Instance
Dev VM
LAYER 5: OPTIONAL (CONDITIONAL FLAGS)
OpenAI
AI Search
Custom Vision
Foundry Hub
AML Registry
Cross-workspace
RBAC
Role assignments
Diagnostics
Log routing
Bicep resolves dependencies automatically via resource references. Resources in the same layer deploy in parallel. Conditional flags control optional services.

Terraform (Alternate Path)

A complete Terraform implementation exists in deploymentcode/terraform/ and an earlier version in tf/. It provisions the same resources as Bicep using HashiCorp Configuration Language. Both paths produce equivalent Azure environments.

Terraform remains in the repository for reference and for teams that prefer the HashiCorp ecosystem. The Bicep path is documented as primary because it integrates natively with Azure CLI and requires no external tooling beyond az.

Environment Strategy

Two resource groups separate dev and test workloads. Parameter files in deploymentcode/bicep/parameters/ control what differs between environments.

ParameterDev (rg-claudebricks-dev)Test (rg-claudebricks-test)
Compute cluster SKUStandard_DS3_v2 (low cost)Standard_DS3_v2
Compute min nodes00
Compute max nodes24
OpenAI deploymentEnabledEnabled
AI SearchEnabledEnabled
Custom VisionEnabledDisabled (cost savings)
Foundry HubEnabledEnabled
Storage redundancyLRSLRS
Key Vault soft delete7 days90 days
Diagnostic loggingMinimalFull

The CI/CD pipeline deploys to dev first, waits for manual approval, then deploys to test with the production-like parameter set.

4. CI/CD with GitHub Actions

Five GitHub Actions workflows automate the full lifecycle: infrastructure provisioning, model training, model deployment, GenAI evaluation, and drift detection. All workflows authenticate via OIDC federated credentials (no stored secrets).

infra.yml
Infrastructure Deployment

Lint, validate, and deploy Bicep templates to dev and test environments with a manual approval gate between them.

Trigger: Push to deploymentcode/bicep/**
Auth: OIDC federated credential
Steps:
  1. Bicep lint (az bicep build)
  2. Validate (az deployment group validate)
  3. What-if preview
  4. Deploy to dev
  5. Manual approval gate
  6. Deploy to test
train.yml
ML Training Pipeline

Submit an AML training job, wait for completion, pull metrics, and register the model if quality thresholds are met.

Trigger: Push to scenario scripts or manual dispatch
Steps:
  1. Submit AML training job via SDK
  2. Wait for job completion
  3. Pull MLflow metrics
  4. Gate on threshold (accuracy >= 0.85)
  5. Register model if passing
deploy-model.yml
Model Deployment (Blue/Green)

Deploy a registered model as a canary, run smoke tests, shift traffic progressively, and promote or rollback based on health checks.

Trigger: Manual dispatch only
Steps:
  1. Deploy as canary (0% traffic)
  2. Smoke test the new deployment
  3. Shift 10% traffic
  4. Health check window (5 min)
  5. Promote to 100% or rollback
eval-rag.yml
GenAI Evaluation

Run evaluation datasets against the RAG pipeline, score quality metrics, and block deployment if thresholds are not met.

Trigger: Push to prompts/** or eval/**
Metrics: groundedness, relevance, coherence, fluency
Steps:
  1. Load eval dataset from eval/datasets/
  2. Run against deployed model
  3. Score quality metrics
  4. Compare against thresholds
  5. Block if any metric fails
drift-check.yml
Drift Detection

Run a weekly scheduled check on input data distributions. If drift exceeds the PSI threshold, trigger the training pipeline automatically.

Trigger: Weekly cron schedule
Metric: Population Stability Index (PSI)
Steps:
  1. Submit drift detection AML job
  2. Check PSI against threshold
  3. Trigger retrain via repository dispatch if needed

5. Asset Registration and Management

After Bicep provisions the infrastructure, Python scripts register ML-specific assets into the workspace. This is a four-layer approach that separates infrastructure provisioning from ML asset management.

Asset Registration Layers
Layer 1
Workspace Provisioning
Bicep deploys AML workspace,
storage, compute, KV
Layer 2
Asset Registration
Data assets, datastores,
environments via SDK
Layer 3
Component Registration
Reusable pipeline steps
as versioned components
Layer 4
Registry Publication
Promote assets to shared
AML registry
Each layer depends on the previous. Bicep handles Layer 1 (Azure resources). Python SDK scripts handle Layers 2 through 4 (ML-specific assets).

Registration Scripts

ScriptLayerWhat It Registers
register_data_assets.py24 data assets: facade-images:1 (uri_folder), ldr-files:1 (uri_folder), ldr-validation-features:1 (uri_file), reference-models:1 (uri_folder)
register_envs.py22 environments: claudebricks-sklearn:1 (scikit-learn, Pillow, MLflow), claudebricks-clustering:1 (scikit-learn, UMAP, HDBSCAN, MLflow)
register_components.py33 components: extract-features, run-clustering, evaluate-model (reusable pipeline steps with defined inputs/outputs)
publish_to_registry.py4Copies selected assets (models, environments, components) from the workspace to the shared AML registry for cross-workspace use

Shared AML Registry

The AML registry (claudebricks-registry) provides a central catalog that both dev and test workspaces can pull from. Models registered in dev can be promoted to the registry, then consumed by test without retraining. The same applies to environments and components.

This enables a clean promotion path: train in dev, validate in dev, publish to registry, consume in test, deploy to production endpoints.

6. Scenario 1: Facade Style Classification

1

Image Classification Pipeline

Train a Random Forest classifier on facade images to categorize architectural styles (historic, modern, industrial, commercial, residential). Deployed as a managed online endpoint with blue/green deployment capability.

Scenario 1: End-to-End ML Pipeline
PHASE 1: DATA PREPARATION
Blob Storage
facade-images/
prepare_data.py
Upload + version
Data Asset
facade-images:1
PHASE 2: TRAINING
train_job.py
Submit command job
On CPU Cluster
train.py
RandomForest on 64x64 RGB
MLflow autolog metrics
MLflow Experiment
scenario1-facade
register_model.py
facade-classifier:1 (MLflow)
PHASE 4: DEPLOY + INFERENCE
deploy_endpoint.py
Managed endpoint
Online Endpoint
facade-classifier-endpoint
Blue/green | DS3_v2
score.py
POST image → label
Images flow from Blob Storage through a versioned data asset to the training cluster. The trained model is registered in MLflow, then deployed to a managed online endpoint with blue/green traffic splitting for real-time inference.

Blue/Green Deployment

Scenario 1 is the primary demonstration of safe deployment. The endpoint supports two named deployments (blue and green) with traffic splitting controlled by percentage.

deploy_canary.py
Deploy v2 as "green"
0% traffic
smoke_test.py
Latency, success rate,
prediction confidence
promote_deployment.py
Shift 10% → 100%
progressive rollout
If any step fails: rollback_deployment.py reverts to blue at 100%

Drift monitoring runs weekly on this scenario via the drift-check.yml workflow. If the Population Stability Index (PSI) exceeds the threshold, a retraining job is triggered automatically.

Scripts

ScriptPurpose
prepare_data.pyUpload facade images to blob, create versioned data asset
train.pyTraining logic (runs on cluster): load images, train RandomForest, MLflow autolog
train_job.pySubmit training as AML command job
register_model.pyRegister best model from experiment run
deploy_endpoint.pyCreate endpoint and initial "blue" deployment (100% traffic)
deploy_canary.pyDeploy new model version as "green" (0% traffic)
promote_deployment.pyShift traffic percentage to new deployment
rollback_deployment.pyRevert all traffic to previous deployment
smoke_test.pyValidate endpoint health (latency, accuracy, schema)
score.pyClient-side scoring example

Azure Resources Consumed by Scenario 1

ResourceCreated ByPurpose
Blob container facade-imagesBicepTraining image storage
Datastore facade_imagesregister_data_assets.pyML workspace pointer to blob
Data asset facade-images:1prepare_data.pyVersioned uri_folder reference
Environment claudebricks-sklearn:1register_envs.pyTraining runtime
CPU/GPU ClusterBicepTraining compute
MLflow experimentAuto-createdMetric and artifact tracking
Registered model facade-classifier:1register_model.pyBest trained model
Managed endpoint facade-classifier-endpointdeploy_endpoint.pyReal-time inference (DS3_v2)

7. Scenario 2: Structural Validation (Anomaly Detection)

2

Anomaly Detection on LDraw Files

Parse .ldr files into numeric features, then use Isolation Forest or Random Forest to flag structurally unsafe LEGO builds. Deployed as a managed endpoint for real-time validation.

Scenario 2: Feature Engineering + Anomaly Detection Pipeline
.ldr Files
Raw LEGO models
pass/fail labeled
Feature Engineering
feature_engineering.py
Overhang Ratio
Collision Count
Height-to-Base • Layer Density
features.csv
Data Asset (uri_file)
ldr-validation-features:1
On CPU Cluster
train.py
IsolationForest (unsupervised)
or RandomForest (supervised)
MLflow autolog
Register
register_model.py
ldr-validator:1
Deploy
deploy_endpoint.py
ldr-validator-endpoint
Scoring Input / Output
POST { "overhang_ratio": 0.3,
  "collision_count": 2, "height_to_base_ratio": 4.5,
  "layer_density": 0.8 }
→ { "prediction": "pass" }
Raw .ldr files are parsed into 4 numeric features. An anomaly detection model (Isolation Forest or Random Forest) classifies structures as pass/fail. Deployed as a managed endpoint consumed by the Claude.Bricks validator agent.

Operational Lifecycle

The same blue/green deployment pattern from Scenario 1 applies here. When the model is retrained on new labeled .ldr files, a canary deployment is created and validated before traffic is shifted.

Feature engineering is the critical first step. The 4 numeric features (overhang ratio, collision count, height-to-base ratio, layer density) are derived from parsing the LDraw line format. This parsing logic lives in feature_engineering.py and must handle the full range of LDraw part types.

Azure Resources Consumed by Scenario 2

ResourceCreated ByPurpose
Blob container ldr-filesBicepRaw .ldr file storage
Datastore ldr_filesregister_data_assets.pyML workspace pointer to blob
Data asset ldr-validation-features:1feature_engineering.pyExtracted numeric features (uri_file)
Environment claudebricks-sklearn:1register_envs.pyTraining runtime
CPU ClusterBicepTraining compute
MLflow experimentAuto-createdMetric and artifact tracking
Registered model ldr-validator:1register_model.pyAnomaly detection model
Managed endpoint ldr-validator-endpointdeploy_endpoint.pyReal-time inference (DS3_v2)

8. Scenario 3: Pattern Extraction (Clustering)

3

Multi-Step Pipeline: Extract, Cluster, Analyze

Discover common building patterns across reference .ldr models using part-usage statistics and clustering (KMeans/HDBSCAN). This is a batch pipeline with no real-time endpoint. The pipeline is defined using the @dsl.pipeline decorator for component-based orchestration.

Scenario 3: Azure ML Pipeline (2-Step, Component-Based)
Azure ML Pipeline: scenario3-pattern-extraction
reference-models
(uri_folder)
Step 1: extract_features
extract_stats.py
Part counts by category
Dimensions, roof type, window ratio
Output: stats CSV
features
Step 2: run_clustering
train.py --n-clusters 5
Standardize features
KMeans or HDBSCAN
MLflow log cluster labels
Registered Model
clustering-model:1
Cluster labels + centroids
Compute: cpu-cluster (Standard_DS3_v2)
Environment: claudebricks-clustering:1
Key Difference
This is a PIPELINE job, not a single command job. No endpoint needed (batch analysis).
A two-step Azure ML Pipeline: Step 1 extracts part-usage features from .ldr files, Step 2 runs clustering. The @dsl.pipeline decorator chains steps, passing the features output from Step 1 into Step 2. Components are registered for reuse across pipeline runs.

Component-Based Pipeline

Unlike Scenarios 1 and 2 (which use single command jobs), Scenario 3 uses the @dsl.pipeline decorator to define a multi-step pipeline. Each step is a registered component with typed inputs and outputs. This means Step 1's output (the features CSV) is automatically passed to Step 2 as input.

The pipeline is submitted via pipeline_job.py, which defines the step graph and submits it to AML. Both steps run on the same compute cluster but could be configured to use different compute targets if needed.

Because this is a batch analysis (no real-time scoring), there is no managed endpoint. Results are stored as MLflow artifacts and the clustering model is registered for later use (for example, to classify new reference models into discovered pattern groups).

Azure Resources Consumed by Scenario 3

ResourceCreated ByPurpose
Blob container reference-modelsBicepReference .ldr file storage
Datastore reference_modelsregister_data_assets.pyML workspace pointer to blob
Data asset reference-models:1extract_stats.pyInput data (uri_folder)
Environment claudebricks-clustering:1register_envs.pyTraining runtime (sklearn, hdbscan)
CPU ClusterBicepPipeline compute
Pipeline job scenario3-pattern-extractionpipeline_job.py2-step extract + cluster pipeline
MLflow experimentAuto-createdMetric and artifact tracking
Registered model clustering-model:1train.pyCluster labels + centroids

9. Scenario 4: GenAI Spec Generator

This scenario generates LEGO building specifications from natural language descriptions. It combines retrieval-augmented generation (RAG) with optional fine-tuning. There are two implementation paths: the original Azure OpenAI + Prompt Flow approach, and a Foundry-native path that aligns with AI-300 GenAIOps objectives.

Path A: Azure OpenAI + Prompt Flow

The original implementation. Uses Azure OpenAI directly with Prompt Flow for orchestration. Good for understanding the underlying mechanics.

Retrieve from AI Search index
Generate via Azure OpenAI (GPT-4o)
Optional: Fine-tune for domain-specific model
Scripts: prepare_training_data.py, upload_training_data.py, fine_tune_job.py, deploy_model.py, rag_index_setup.py, evaluate_flow.py
Prompt Flow: flow.dag.yaml, retrieve.py, generate.py

Path B: Foundry-Native AI-300

The AI-300 aligned approach. Uses Foundry hub/project for model deployment, evaluation, monitoring, and tracing.

Deploy foundation model via Foundry catalog
Evaluate with Foundry built-in metrics
Monitor with continuous tracing and alerts
Scripts: foundry/create_project.py, foundry/deploy_model.py, foundry/configure_index.py, foundry/run_evaluation.py, foundry/configure_monitoring.py, foundry/trace_analysis.py

Resources Consumed

ResourcePurpose
OpenAI Azure OpenAIGPT-4o for generation, text-embedding-3-small for vectorization
Search AI SearchVector + hybrid search index for RAG retrieval
ML Foundry Hub + ProjectGenAI operations, evaluation, monitoring (Path B)
SA Storage AccountTraining data, inference logs
AppInsights App InsightsTelemetry, tracing, latency metrics

10. Endpoint Operations

Managed online endpoints serve trained models for real-time inference. Claude.Bricks uses blue/green deployment for safe rollout. The pattern applies to any scenario, but Scenario 1 (facade classification) is the primary demonstration.

Blue/Green Architecture

facade-classifier-endpoint
blue (v1)
100% traffic
green (v2)
0% traffic

Deployment Flow

Deploy Canary
deploy_canary.py
Smoke Test
smoke_test.py
Shift 10%
promote_deployment.py
Health Check
Monitor errors
Promote 100%
or Rollback

Smoke Test Criteria

CheckThreshold
Response latency (p95)< 2 seconds
HTTP 200 success rate> 99%
Response schemaContains expected fields (prediction, confidence)
Prediction confidenceAbove scenario-specific threshold

Rollback triggers: latency spike above 2x baseline, error rate exceeds 1%, accuracy regression on validation set, or manual operator decision. See runbooks/endpoint-deployment.md for the full operational playbook.

11. Evaluation and Quality Gates

Every change to prompts, evaluation datasets, or Scenario 4 code triggers an automated evaluation run through the eval-rag.yml GitHub Actions workflow. Deployment is blocked if quality metrics fall below defined thresholds.

Evaluation Dataset

15 test cases in eval/datasets/lego-spec-generator.jsonl.

  • Building types: bakery, hotel, townhouse, fire station, cinema, pub, library, canal house, and more
  • Styles: Victorian, Art Deco, Modern, Georgian, Parisian, Tudor, Japanese, Brutalist
  • Difficulty: 5 easy, 5 medium, 5 hard
  • Per row: prompt, expected traits, golden keywords, known bad patterns

Custom Buildability Evaluator

Domain-specific validator (eval/evaluators/buildability.py) that ties evaluation back to the LEGO domain.

  • Output includes physical dimensions (width, depth, height)?
  • References plausible brick/part names?
  • Specifies a color palette?
  • Structural elements physically possible?
  • Output follows expected section format?

Score: 0.0 to 1.0 based on checks passed.

Quality Metrics and Thresholds

Minimum scores required to pass the evaluation gate. Deployment is blocked if any metric falls below its threshold.

Fluency 0.85 threshold Foundry evaluator
85%
Coherence 0.80 threshold Foundry evaluator
80%
Relevance 0.75 threshold Foundry evaluator
75%
Groundedness 0.70 threshold Foundry evaluator
70%
Buildability 0.70 threshold Custom evaluator
70%
Safety 100% pass rate Safety evaluator
100%

Evaluation Scripts

ScriptPurpose
eval/run_evaluation.pyMain evaluation runner. Scores all test cases against deployed model.
eval/run_prompt_experiment.pyA/B testing across prompt variant combinations (system x RAG matrix).
eval/run_rag_tuning.pyRAG configuration optimization (chunk size, overlap, search mode, top-k).

12. Prompt Engineering

All prompts are versioned in Git under the prompts/ directory. Changes to prompts trigger the evaluation workflow automatically, so regressions are caught before deployment.

Prompt Directory Structure

prompts/
  system/v1.txt         # baseline system prompt
  system/v2.txt         # structured output format
  rag/v1.jinja2         # simple context template
  rag/v2.jinja2         # grounding rules + citations
  few-shot/v1.jsonl     # example set A
  few-shot/v2.jsonl     # example set B
  CHANGELOG.md          # version history with rationale

System Prompt v1

Free-form instructions. Asks the model to generate detailed LEGO building specifications covering dimensions, colors, architectural features, and construction notes. No enforced output structure.

System Prompt v2

Adds mandatory output sections: Dimensions, Color Palette, Architectural Features, Floor Plans, Construction Notes, Structural Validation. Includes explicit grounding instructions to base recommendations on retrieved reference materials.

RAG Template v1

Simple context concatenation. Appends retrieved reference materials before the user query. No explicit grounding rules.

RAG Template v2

Adds numbered references with relevance scores, five explicit grounding rules, inline citation requirements, and prompt version metadata for traceability.

Experiment Framework

eval/run_prompt_experiment.py creates a matrix of all system prompt x RAG template combinations. For each combination, it runs the full evaluation dataset and measures quality metrics, token usage, latency, and estimated cost. The output identifies the best-performing combination by weighted composite score.

Example: 2 system prompts x 2 RAG templates = 4 combinations, each evaluated against 15 test cases = 60 evaluation runs.

13. Observability and Monitoring

The observability stack tracks both classical ML and GenAI workloads. Application Insights handles telemetry collection, Log Analytics stores structured queries, and custom scripts detect drift and trend failure modes.

Telemetry Client

deploymentcode/scripts/common/telemetry.py wraps the Application Insights SDK. Every GenAI request automatically captures prompt version, model version, and trace ID. The wrapper keeps instrumentation out of business logic.

Core Functions:

  • track_request logs inbound request with latency, status, custom dimensions
  • track_dependency logs outbound calls (LLM, AI Search, storage)
  • track_metric logs numeric values (tokens, cost, eval scores)
  • track_exception logs errors with full stack trace and context

What Gets Captured

Metric
Source
Storage
Identity and Versioning
Prompt version
Application code
Custom Dimensions
Model name / version
Deployment config
Custom Dimensions
Trace ID
Foundry tracing
App Insights + Foundry
Latency
Retrieval latency (ms)
AI Search call
Dependency
Generation latency (ms)
LLM call
Dependency
Total response latency
End-to-end
Request
Cost and Tokens
Input token count
LLM response
Custom Metric
Output token count
LLM response
Custom Metric
Estimated cost ($)
Token pricing
Custom Metric
Quality and Errors
Evaluation score
Eval pipeline
Custom Metric
Error type / reason
Exception handler
Exceptions

Drift Detection

For Scenario 1 (facade classification). Compares current inference distributions against a training-time baseline using statistical tests.

Training Data
Source distribution
Baseline Profile
create_baseline.py
Scheduled Job
Weekly cron
Compare
detect_drift.py
Drift Metrics
PSI, KS, JS
Alert / Retrain
alert_config.py

PSI Threshold Zones

Normal PSI < 0.2
Warning 0.2 - 0.4
Critical → Retrain PSI > 0.4

Critical threshold triggers automatic retraining via the drift-check.yml GitHub Actions workflow.

PSI
Population Stability Index
Overall distribution shift
KS
Kolmogorov-Smirnov
Max CDF divergence
JS
Jensen-Shannon
Symmetric divergence

KQL Dashboard Queries

File Purpose
avg-latency-by-prompt.kqlAvg/p50/p95/p99 latency by prompt version
token-cost-by-day.kqlDaily token consumption and estimated USD cost
groundedness-pass-rate.kqlQuality metric pass rates by day
error-rate-trend.kqlError rate by exception type
top-failure-modes.kqlTop 10 failure categories by frequency

Failure Trending

monitoring/feedback/trend_failures.py reads evaluation results and categorizes failures across 12 categories: missing dimensions, invalid parts, structural issues, grounding failure, hallucinated references, incomplete specs, wrong color codes, orientation errors, scale mismatches, missing submodels, token limit overruns, and prompt ambiguity. Weekly aggregation feeds into prompt iteration priorities. The most frequent failure category in a given week becomes the top target for the next prompt revision cycle.

14. Data Flow Overview

Five distinct data paths move through the platform. Each path has different latency expectations and storage targets.

1

Training Data Flow

Raw Data
images, .ldr, stats
Storage
Blob containers
Data Assets
versioned refs
Training Jobs
compute cluster
MLflow
experiment tracking
Model Registry
versioned models
2

Inference Data Flow

Client Request
Managed Endpoint
Model Prediction
Response + Log
3

Evaluation Data Flow

Test Dataset
eval/datasets/
Eval Runner
run_evaluation.py
Quality Metrics
6 scores
Results
eval/results/
Actions Gate
pass / fail
4

Prompt Data Flow

Prompt Files
prompts/
Git Version Control
Eval Runner
Experiment Results
5

Monitoring Data Flow

Inference Logs
Drift Job
PSI / KS / JS
App Insights
Alert
Retrain

15. Cost Breakdown

Estimated monthly costs for a single dev environment with moderate usage. All prices are approximate and based on East US 2 region pricing as of early 2026.

Fixed Monthly (Dev)

AI Search Basic ~$75
Log Analytics Per-GB ~$5
ACR Basic ~$5
Storage Account LRS ~$2
Key Vault Standard ~$0.50
Foundry Hub + Project, AML Workspace $0 (included)
Total Fixed Monthly ~$87.50

Variable Costs

Compute Instance (DS2_v2) ~4 hr/day ~$30-60
GPU Cluster (NC4as_T4_v3) Scale-to-zero ~$0-50
CPU Cluster (DS3_v2) Scale-to-zero ~$0-20
Azure OpenAI Moderate usage ~$5-20
Total Variable ~$35-150

Total Estimate

~$120-240/month for dev with moderate use

Note: A test environment roughly doubles the fixed costs. Deploy test only when validating promotion flows. Scale-to-zero compute keeps variable costs near zero when idle.

16. Teardown

Removing all deployed resources. Both IaC tools support full teardown.

Primary (Bicep)

./deploymentcode/bicep/scripts/teardown.sh dev

Alternate (Terraform)

cd deploymentcode/terraform && terraform destroy

What Happens on Teardown

  • Key Vault enters soft-deleted state (recoverable for 90 days by default). Purge manually if you need the name back immediately.
  • OpenAI deployments are deleted with the resource group. Model deployments do not persist independently.
  • AI Search indexes are not recoverable. Re-index from source data after redeployment.
  • Storage data is lost unless backed up externally. The teardown script does not create snapshots.
  • MLflow experiments and run history are stored in the workspace. Deleting the workspace deletes all experiment data.

17. How Models Work Together

The four ML scenarios connect to the LEGO design workflow through managed endpoints. Each endpoint serves a specific role in the feedback loop between design and validation.

Design Workflow

User Prompt
Natural language
Claude Code
AI agents
PS
PowerShell
Generation script
.ldr File
LDraw output
stud.io
3D render
ML Feedback
Optional
Input Generation Output Validation

Integration Points

Endpoint Scenario What It Does
facade-classifier-endpoint1Classifies building style from rendered image
structural-validator-endpoint2Flags anomalous structural patterns
pattern-extractor-endpoint3Identifies building cluster from part usage
spec-generator-endpoint4Generates building spec from natural language

Configuration

Endpoint URLs and keys are stored in .env for local development and Key Vault for deployed environments. For production, use managed identity to avoid storing keys entirely. The SDK authenticates via DefaultAzureCredential, which picks up the managed identity automatically.

Current Status

Capability Status
Bicep infrastructure deploymentOperational
GitHub Actions CI/CDOperational
Scenario 1-3 training scriptsOperational
Scenario 4 RAG pipelineOperational
Foundry-native GenAI pathOperational
Managed online endpointsOperational
Evaluation quality gatesOperational
Prompt versioningOperational
Drift detectionOperational
Observability dashboardOperational
LEGO design integrationIn Progress
Production data collectionPlanned