Claude.Bricks

What is Claude.Bricks?

Claude.Bricks is two things at once. It is a system that generates modular LEGO buildings using Claude Code and PowerShell scripts that output LDraw (.ldr) files. It is also a working Azure ML platform that covers the Microsoft AI-300 (Operationalizing Machine Learning and Generative AI Solutions) exam objectives.

The project started as a DP-100 study lab and has been retooled for AI-300. The shift in emphasis is significant: AI-300 cares more about provisioning, versioning, promoting, evaluating, monitoring, and governing ML and GenAI systems than about building models from scratch.

The LEGO domain gives every scenario real data to work with. Images of rendered buildings feed the classifier. Structured .ldr files drive anomaly detection. Part-usage statistics power clustering. Natural language descriptions generate building specifications through a RAG pipeline. Four ML scenarios, all backed by Bicep provisioning, GitHub Actions CI/CD, and operational tooling (endpoints, evaluation, drift monitoring).

AI-300 Exam Domain Coverage

Domain 1: MLOps Infrastructure 15-20%

AML workspace, datastores, compute targets provisioned via Bicep (16 modules)
Data assets, environments, and components registered via SDK scripts
Bicep + Azure CLI deployment automated through GitHub Actions
Shared AML registry for cross-workspace asset promotion
Git-managed source control for all infrastructure and ML code

Domain 2: ML Model Lifecycle 25-30%

MLflow experiment tracking across 3 training scenarios
Model registration with versioning and metadata
Blue/green managed online endpoints with safe rollout and rollback
Drift detection with PSI/KS metrics and automatic retraining triggers
Training pipelines with metric-gated quality checks in CI/CD

Domain 3: GenAIOps Infrastructure 20-25%

Foundry hub and project deployed via Bicep
Foundation model deployment through Foundry model catalog
Prompt versioning in Git with system prompts and RAG templates
AI Search integration for retrieval-augmented generation
Prompt variant comparison via experiment framework

Domain 4: GenAI Quality Assurance 10-15%

15-row evaluation dataset with LEGO domain test cases
Quality metrics: groundedness, relevance, coherence, fluency
Custom buildability evaluator for domain-specific validation
Automated eval workflow blocks deployment on threshold failure
Safety evaluation for harmful content detection

Domain 5: GenAI Optimization 10-15%

RAG tuning matrix (chunk size, overlap, search mode, top-k)
Fine-tuning with synthetic data generation
Embedding model selection and comparison
Hybrid search combining vector, keyword, and semantic retrieval
A/B testing framework with relevance metrics

End-to-End Pipeline

Provision

Bicep + CLI

▶

Assets + Components

▶

Train / Evaluate

MLflow + Quality Gates

▶

Deploy

Blue/Green + Smoke Tests

▶

Monitor

Drift + Observability

What This Document Covers

Bicep-provisioned infrastructure (16 modules, dev/test environments)
Five GitHub Actions CI/CD workflows (infra, training, deployment, evaluation, drift)
Four ML scenarios with full operational lifecycle
Foundry-native GenAI path with evaluation and monitoring
Prompt versioning and A/B experiment framework
Blue/green endpoint deployment with safe rollback
Drift detection and automated retraining triggers
Observability dashboards and KQL queries
Cost analysis and teardown procedures

1. About Claude.Bricks

2. Architecture Overview

3. Infrastructure as Code

4. CI/CD with GitHub Actions

5. Asset Registration

6. Scenario 1: Facade Classification

7. Scenario 2: Structural Validation

8. Scenario 3: Pattern Extraction

9. Scenario 4: GenAI Spec Generator

10. Endpoint Operations

11. Evaluation and Quality Gates

12. Prompt Engineering

13. Observability and Monitoring

14. Data Flow Overview

15. Cost Breakdown

16. Teardown

17. How Models Work Together

2. Architecture Overview

Claude.Bricks: Complete Resource Map

CORE PLATFORM

ML Workspace

mlw-claudebricks-dev-v2

Storage

stclaudebricksdev

Key Vault

kv-claudebricks-dev

ACR

acrclaudebricksdev

App Insights

appi-claudebricks-dev

Log Analytics

law-claudebricks-dev

▼

COMPUTE

Compute Instance

DS2_v2, dev only

CPU Cluster

0-2 nodes, DS3_v2

GPU Cluster

0-1 node, NC4as_T4_v3

▼

AI SERVICES (CONDITIONAL)

Azure OpenAI

GPT-4o + embeddings

AI Search

Basic, hybrid search

Custom Vision

Optional

▼

AI FOUNDRY

Foundry Hub

hub-claudebricks-dev

Foundry Project

proj-claudebricks-dev

▼

STORAGE CONTAINERS

facade-images

ldr-files

reference-models

training-data

inference-logs

SHARED (CROSS-ENVIRONMENT)

AML Registry

reg-claudebricks

Shares models, environments, and components between dev and test workspaces.

Dev environment shown above. Test environment uses a separate resource group (rg-claudebricks-test) with the same resource layout. The shared AML registry spans both.

3. Infrastructure as Code

All Azure resources are defined declaratively in code, reviewed in pull requests, and deployed through automation. The project maintains two IaC implementations: Bicep (primary) and Terraform (alternate).

Bicep (Primary Path)

The Bicep implementation lives in deploymentcode/bicep/ and follows a module-per-resource pattern. Each Azure resource type gets its own .bicep file under modules/, and a top-level main.bicep orchestrates them with conditional deployment flags.

Bicep Module Inventory

#	Module	Resource	Purpose
1	`storage-account.bicep`	Storage Account	Blob containers for training data, models, images
2	`key-vault.bicep`	Key Vault	Secrets, connection strings, API keys
3	`acr.bicep`	Container Registry	Custom environment images
4	`log-analytics.bicep`	Log Analytics Workspace	Central logging and diagnostics
5	`app-insights.bicep`	Application Insights	Endpoint telemetry, latency tracking
6	`ml-workspace.bicep`	AML Workspace	Core ML workspace (depends on 1-5)
7	`compute-cluster.bicep`	Compute Cluster	CPU/GPU training clusters
8	`compute-instance.bicep`	Compute Instance	Dev VM for notebook work
9	`openai.bicep`	Azure OpenAI	GPT-4o-mini deployment (conditional)
10	`ai-search.bicep`	AI Search	Vector index for RAG (conditional)
11	`custom-vision.bicep`	Custom Vision	Training + prediction (conditional)
12	`foundry-hub.bicep`	AI Foundry Hub	Foundry management layer (conditional)
13	`foundry-project.bicep`	AI Foundry Project	Scenario 4 Foundry workspace (conditional)
14	`aml-registry.bicep`	AML Registry	Cross-workspace model promotion
15	`rbac.bicep`	Role Assignments	Service principal and managed identity roles
16	`diagnostics.bicep`	Diagnostic Settings	Route resource logs to Log Analytics

Deployment Command Flow

Wrapper scripts in scripts/deploy/ handle login, subscription selection, resource group creation, and deployment in one command. The flow follows a validate-preview-deploy pattern.

Bicep Deployment Pipeline

az bicep build

Lint and compile

→

az deployment group validate

Schema + dependency check

→

az deployment group what-if

Preview changes

→

az deployment group create

Apply to resource group

Each step must pass before the next runs. The what-if preview shows exactly what will be created, modified, or deleted before any changes are applied.

Scripts: scripts/deploy/deploy-infra.sh (bash), scripts/deploy/deploy-infra.ps1 (PowerShell), scripts/deploy/whatif.sh (preview only), scripts/deploy/teardown.sh (cleanup).

Module Dependency Chain

Bicep Deployment Order

LAYER 1: FOUNDATION (PARALLEL)

Storage

Blob containers

Key Vault

Secrets store

ACR

Container images

Log Analytics

Central logging

▼

App Insights

Depends on Log Analytics

▼

AML Workspace

Depends on Storage, KV, ACR, App Insights

▼

CPU Cluster

Training

GPU Cluster

Optional

Instance

Dev VM

▼

LAYER 5: OPTIONAL (CONDITIONAL FLAGS)

OpenAI

AI Search

Custom Vision

Foundry Hub

▼

AML Registry

Cross-workspace

RBAC

Role assignments

Diagnostics

Log routing

Bicep resolves dependencies automatically via resource references. Resources in the same layer deploy in parallel. Conditional flags control optional services.

Terraform (Alternate Path)

A complete Terraform implementation exists in deploymentcode/terraform/ and an earlier version in tf/. It provisions the same resources as Bicep using HashiCorp Configuration Language. Both paths produce equivalent Azure environments.

Terraform remains in the repository for reference and for teams that prefer the HashiCorp ecosystem. The Bicep path is documented as primary because it integrates natively with Azure CLI and requires no external tooling beyond az.

Environment Strategy

Two resource groups separate dev and test workloads. Parameter files in deploymentcode/bicep/parameters/ control what differs between environments.

Parameter	Dev (`rg-claudebricks-dev`)	Test (`rg-claudebricks-test`)
Compute cluster SKU	`Standard_DS3_v2` (low cost)	`Standard_DS3_v2`
Compute min nodes	0	0
Compute max nodes	2	4
OpenAI deployment	Enabled	Enabled
AI Search	Enabled	Enabled
Custom Vision	Enabled	Disabled (cost savings)
Foundry Hub	Enabled	Enabled
Storage redundancy	LRS	LRS
Key Vault soft delete	7 days	90 days
Diagnostic logging	Minimal	Full

The CI/CD pipeline deploys to dev first, waits for manual approval, then deploys to test with the production-like parameter set.

4. CI/CD with GitHub Actions

Five GitHub Actions workflows automate the full lifecycle: infrastructure provisioning, model training, model deployment, GenAI evaluation, and drift detection. All workflows authenticate via OIDC federated credentials (no stored secrets).

infra.yml

Infrastructure Deployment

Lint, validate, and deploy Bicep templates to dev and test environments with a manual approval gate between them.

Trigger: Push to deploymentcode/bicep/**
Auth: OIDC federated credential
Steps:

Bicep lint (az bicep build)
Validate (az deployment group validate)
What-if preview
Deploy to dev
Manual approval gate
Deploy to test

train.yml

ML Training Pipeline

Submit an AML training job, wait for completion, pull metrics, and register the model if quality thresholds are met.

Trigger: Push to scenario scripts or manual dispatch
Steps:

Submit AML training job via SDK
Wait for job completion
Pull MLflow metrics
Gate on threshold (accuracy >= 0.85)
Register model if passing

deploy-model.yml

Model Deployment (Blue/Green)

Deploy a registered model as a canary, run smoke tests, shift traffic progressively, and promote or rollback based on health checks.

Trigger: Manual dispatch only
Steps:

Deploy as canary (0% traffic)
Smoke test the new deployment
Shift 10% traffic
Health check window (5 min)
Promote to 100% or rollback

eval-rag.yml

GenAI Evaluation

Run evaluation datasets against the RAG pipeline, score quality metrics, and block deployment if thresholds are not met.

Trigger: Push to prompts/** or eval/**
Metrics: groundedness, relevance, coherence, fluency
Steps:

Load eval dataset from eval/datasets/
Run against deployed model
Score quality metrics
Compare against thresholds
Block if any metric fails

drift-check.yml

Drift Detection

Run a weekly scheduled check on input data distributions. If drift exceeds the PSI threshold, trigger the training pipeline automatically.

Trigger: Weekly cron schedule
Metric: Population Stability Index (PSI)
Steps:

Submit drift detection AML job
Check PSI against threshold
Trigger retrain via repository dispatch if needed

5. Asset Registration and Management

After Bicep provisions the infrastructure, Python scripts register ML-specific assets into the workspace. This is a four-layer approach that separates infrastructure provisioning from ML asset management.

Asset Registration Layers

Layer 1

Workspace Provisioning

Bicep deploys AML workspace,
storage, compute, KV

→

Layer 2

Asset Registration

Data assets, datastores,
environments via SDK

→

Layer 3

Component Registration

Reusable pipeline steps
as versioned components

→

Layer 4

Registry Publication

Promote assets to shared
AML registry

Each layer depends on the previous. Bicep handles Layer 1 (Azure resources). Python SDK scripts handle Layers 2 through 4 (ML-specific assets).

Registration Scripts

Script	Layer	What It Registers
`register_data_assets.py`	2	4 data assets: `facade-images:1` (uri_folder), `ldr-files:1` (uri_folder), `ldr-validation-features:1` (uri_file), `reference-models:1` (uri_folder)
`register_envs.py`	2	2 environments: `claudebricks-sklearn:1` (scikit-learn, Pillow, MLflow), `claudebricks-clustering:1` (scikit-learn, UMAP, HDBSCAN, MLflow)
`register_components.py`	3	3 components: `extract-features`, `run-clustering`, `evaluate-model` (reusable pipeline steps with defined inputs/outputs)
`publish_to_registry.py`	4	Copies selected assets (models, environments, components) from the workspace to the shared AML registry for cross-workspace use

Shared AML Registry

The AML registry (claudebricks-registry) provides a central catalog that both dev and test workspaces can pull from. Models registered in dev can be promoted to the registry, then consumed by test without retraining. The same applies to environments and components.

This enables a clean promotion path: train in dev, validate in dev, publish to registry, consume in test, deploy to production endpoints.

6. Scenario 1: Facade Style Classification

Image Classification Pipeline

Train a Random Forest classifier on facade images to categorize architectural styles (historic, modern, industrial, commercial, residential). Deployed as a managed online endpoint with blue/green deployment capability.

Scenario 1: End-to-End ML Pipeline

PHASE 1: DATA PREPARATION

Blob Storage

facade-images/

▶

prepare_data.py

Upload + version

▶

Data Asset

facade-images:1

▼

PHASE 2: TRAINING

train_job.py

Submit command job

▶

On CPU Cluster

train.py

RandomForest on 64x64 RGB

MLflow autolog metrics

▶

MLflow Experiment

scenario1-facade

▼

register_model.py

facade-classifier:1 (MLflow)

▼

PHASE 4: DEPLOY + INFERENCE

deploy_endpoint.py

Managed endpoint

▶

Online Endpoint

facade-classifier-endpoint

Blue/green | DS3_v2

▶

score.py

POST image → label

Images flow from Blob Storage through a versioned data asset to the training cluster. The trained model is registered in MLflow, then deployed to a managed online endpoint with blue/green traffic splitting for real-time inference.

Blue/Green Deployment

Scenario 1 is the primary demonstration of safe deployment. The endpoint supports two named deployments (blue and green) with traffic splitting controlled by percentage.

deploy_canary.py

Deploy v2 as "green"
0% traffic

→

smoke_test.py

Latency, success rate,
prediction confidence

→

promote_deployment.py

Shift 10% → 100%
progressive rollout

If any step fails: rollback_deployment.py reverts to blue at 100%

Drift monitoring runs weekly on this scenario via the drift-check.yml workflow. If the Population Stability Index (PSI) exceeds the threshold, a retraining job is triggered automatically.

Scripts

Script	Purpose
`prepare_data.py`	Upload facade images to blob, create versioned data asset
`train.py`	Training logic (runs on cluster): load images, train RandomForest, MLflow autolog
`train_job.py`	Submit training as AML command job
`register_model.py`	Register best model from experiment run
`deploy_endpoint.py`	Create endpoint and initial "blue" deployment (100% traffic)
`deploy_canary.py`	Deploy new model version as "green" (0% traffic)
`promote_deployment.py`	Shift traffic percentage to new deployment
`rollback_deployment.py`	Revert all traffic to previous deployment
`smoke_test.py`	Validate endpoint health (latency, accuracy, schema)
`score.py`	Client-side scoring example

Azure Resources Consumed by Scenario 1

Resource	Created By	Purpose
Blob container `facade-images`	Bicep	Training image storage
Datastore `facade_images`	register_data_assets.py	ML workspace pointer to blob
Data asset `facade-images:1`	prepare_data.py	Versioned uri_folder reference
Environment `claudebricks-sklearn:1`	register_envs.py	Training runtime
CPU/GPU Cluster	Bicep	Training compute
MLflow experiment	Auto-created	Metric and artifact tracking
Registered model `facade-classifier:1`	register_model.py	Best trained model
Managed endpoint `facade-classifier-endpoint`	deploy_endpoint.py	Real-time inference (DS3_v2)

7. Scenario 2: Structural Validation (Anomaly Detection)

Anomaly Detection on LDraw Files

Parse .ldr files into numeric features, then use Isolation Forest or Random Forest to flag structurally unsafe LEGO builds. Deployed as a managed endpoint for real-time validation.

Scenario 2: Feature Engineering + Anomaly Detection Pipeline

.ldr Files

Raw LEGO models
pass/fail labeled

Feature Engineering

feature_engineering.py

Overhang Ratio
Collision Count
Height-to-Base • Layer Density

features.csv

Data Asset (uri_file)
ldr-validation-features:1

On CPU Cluster

train.py

IsolationForest (unsupervised)
or RandomForest (supervised)
MLflow autolog

register_model.py

ldr-validator:1

Deploy

deploy_endpoint.py

ldr-validator-endpoint

Scoring Input / Output

POST { "overhang_ratio": 0.3,
"collision_count": 2, "height_to_base_ratio": 4.5,
"layer_density": 0.8 }
→ { "prediction": "pass" }

Raw .ldr files are parsed into 4 numeric features. An anomaly detection model (Isolation Forest or Random Forest) classifies structures as pass/fail. Deployed as a managed endpoint consumed by the Claude.Bricks validator agent.

Operational Lifecycle

The same blue/green deployment pattern from Scenario 1 applies here. When the model is retrained on new labeled .ldr files, a canary deployment is created and validated before traffic is shifted.

Feature engineering is the critical first step. The 4 numeric features (overhang ratio, collision count, height-to-base ratio, layer density) are derived from parsing the LDraw line format. This parsing logic lives in feature_engineering.py and must handle the full range of LDraw part types.

Azure Resources Consumed by Scenario 2

Resource	Created By	Purpose
Blob container `ldr-files`	Bicep	Raw .ldr file storage
Datastore `ldr_files`	register_data_assets.py	ML workspace pointer to blob
Data asset `ldr-validation-features:1`	feature_engineering.py	Extracted numeric features (uri_file)
Environment `claudebricks-sklearn:1`	register_envs.py	Training runtime
CPU Cluster	Bicep	Training compute
MLflow experiment	Auto-created	Metric and artifact tracking
Registered model `ldr-validator:1`	register_model.py	Anomaly detection model
Managed endpoint `ldr-validator-endpoint`	deploy_endpoint.py	Real-time inference (DS3_v2)

8. Scenario 3: Pattern Extraction (Clustering)

Multi-Step Pipeline: Extract, Cluster, Analyze

Discover common building patterns across reference .ldr models using part-usage statistics and clustering (KMeans/HDBSCAN). This is a batch pipeline with no real-time endpoint. The pipeline is defined using the @dsl.pipeline decorator for component-based orchestration.

Scenario 3: Azure ML Pipeline (2-Step, Component-Based)

Azure ML Pipeline: scenario3-pattern-extraction

reference-models

(uri_folder)

Step 1: extract_features

extract_stats.py

Part counts by category
Dimensions, roof type, window ratio
Output: stats CSV

features

Step 2: run_clustering

train.py --n-clusters 5

Standardize features
KMeans or HDBSCAN
MLflow log cluster labels

Registered Model

clustering-model:1
Cluster labels + centroids

Compute: cpu-cluster (Standard_DS3_v2)

Environment: claudebricks-clustering:1

Key Difference

This is a PIPELINE job, not a single command job. No endpoint needed (batch analysis).

A two-step Azure ML Pipeline: Step 1 extracts part-usage features from .ldr files, Step 2 runs clustering. The @dsl.pipeline decorator chains steps, passing the features output from Step 1 into Step 2. Components are registered for reuse across pipeline runs.

Component-Based Pipeline

Unlike Scenarios 1 and 2 (which use single command jobs), Scenario 3 uses the @dsl.pipeline decorator to define a multi-step pipeline. Each step is a registered component with typed inputs and outputs. This means Step 1's output (the features CSV) is automatically passed to Step 2 as input.

The pipeline is submitted via pipeline_job.py, which defines the step graph and submits it to AML. Both steps run on the same compute cluster but could be configured to use different compute targets if needed.

Because this is a batch analysis (no real-time scoring), there is no managed endpoint. Results are stored as MLflow artifacts and the clustering model is registered for later use (for example, to classify new reference models into discovered pattern groups).

Azure Resources Consumed by Scenario 3

Resource	Created By	Purpose
Blob container `reference-models`	Bicep	Reference .ldr file storage
Datastore `reference_models`	register_data_assets.py	ML workspace pointer to blob
Data asset `reference-models:1`	extract_stats.py	Input data (uri_folder)
Environment `claudebricks-clustering:1`	register_envs.py	Training runtime (sklearn, hdbscan)
CPU Cluster	Bicep	Pipeline compute
Pipeline job `scenario3-pattern-extraction`	pipeline_job.py	2-step extract + cluster pipeline
MLflow experiment	Auto-created	Metric and artifact tracking
Registered model `clustering-model:1`	train.py	Cluster labels + centroids

9. Scenario 4: GenAI Spec Generator

This scenario generates LEGO building specifications from natural language descriptions. It combines retrieval-augmented generation (RAG) with optional fine-tuning. There are two implementation paths: the original Azure OpenAI + Prompt Flow approach, and a Foundry-native path that aligns with AI-300 GenAIOps objectives.

Path A: Azure OpenAI + Prompt Flow

The original implementation. Uses Azure OpenAI directly with Prompt Flow for orchestration. Good for understanding the underlying mechanics.

Retrieve from AI Search index

▼

Generate via Azure OpenAI (GPT-4o)

▼

Optional: Fine-tune for domain-specific model

Scripts: prepare_training_data.py, upload_training_data.py, fine_tune_job.py, deploy_model.py, rag_index_setup.py, evaluate_flow.py
Prompt Flow: flow.dag.yaml, retrieve.py, generate.py

Path B: Foundry-Native AI-300

The AI-300 aligned approach. Uses Foundry hub/project for model deployment, evaluation, monitoring, and tracing.

Deploy foundation model via Foundry catalog

▼

Evaluate with Foundry built-in metrics

▼

Monitor with continuous tracing and alerts

Scripts: foundry/create_project.py, foundry/deploy_model.py, foundry/configure_index.py, foundry/run_evaluation.py, foundry/configure_monitoring.py, foundry/trace_analysis.py

Resources Consumed

Resource	Purpose
Azure OpenAI	GPT-4o for generation, text-embedding-3-small for vectorization
AI Search	Vector + hybrid search index for RAG retrieval
Foundry Hub + Project	GenAI operations, evaluation, monitoring (Path B)
Storage Account	Training data, inference logs
App Insights	Telemetry, tracing, latency metrics

10. Endpoint Operations

Managed online endpoints serve trained models for real-time inference. Claude.Bricks uses blue/green deployment for safe rollout. The pattern applies to any scenario, but Scenario 1 (facade classification) is the primary demonstration.

Blue/Green Architecture

facade-classifier-endpoint

blue (v1)
100% traffic

green (v2)
0% traffic

Deployment Flow

Deploy Canary

deploy_canary.py

▶

Smoke Test

smoke_test.py

▶

Shift 10%

promote_deployment.py

▶

Health Check

Monitor errors

▶

Promote 100%

or Rollback

Smoke Test Criteria

Check	Threshold
Response latency (p95)	< 2 seconds
HTTP 200 success rate	> 99%
Response schema	Contains expected fields (prediction, confidence)
Prediction confidence	Above scenario-specific threshold

Rollback triggers: latency spike above 2x baseline, error rate exceeds 1%, accuracy regression on validation set, or manual operator decision. See runbooks/endpoint-deployment.md for the full operational playbook.

11. Evaluation and Quality Gates

Every change to prompts, evaluation datasets, or Scenario 4 code triggers an automated evaluation run through the eval-rag.yml GitHub Actions workflow. Deployment is blocked if quality metrics fall below defined thresholds.

Evaluation Dataset

15 test cases in eval/datasets/lego-spec-generator.jsonl.

Building types: bakery, hotel, townhouse, fire station, cinema, pub, library, canal house, and more
Styles: Victorian, Art Deco, Modern, Georgian, Parisian, Tudor, Japanese, Brutalist
Difficulty: 5 easy, 5 medium, 5 hard
Per row: prompt, expected traits, golden keywords, known bad patterns

Custom Buildability Evaluator

Domain-specific validator (eval/evaluators/buildability.py) that ties evaluation back to the LEGO domain.

Output includes physical dimensions (width, depth, height)?
References plausible brick/part names?
Specifies a color palette?
Structural elements physically possible?
Output follows expected section format?

Score: 0.0 to 1.0 based on checks passed.

Quality Metrics and Thresholds

Minimum scores required to pass the evaluation gate. Deployment is blocked if any metric falls below its threshold.

Fluency 0.85 threshold Foundry evaluator

85%

Coherence 0.80 threshold Foundry evaluator

80%

Relevance 0.75 threshold Foundry evaluator

75%

Groundedness 0.70 threshold Foundry evaluator

70%

Buildability 0.70 threshold Custom evaluator

70%

Safety 100% pass rate Safety evaluator

100%

Evaluation Scripts

Script	Purpose
`eval/run_evaluation.py`	Main evaluation runner. Scores all test cases against deployed model.
`eval/run_prompt_experiment.py`	A/B testing across prompt variant combinations (system x RAG matrix).
`eval/run_rag_tuning.py`	RAG configuration optimization (chunk size, overlap, search mode, top-k).

12. Prompt Engineering

All prompts are versioned in Git under the prompts/ directory. Changes to prompts trigger the evaluation workflow automatically, so regressions are caught before deployment.

Prompt Directory Structure

prompts/

  system/v1.txt         # baseline system prompt

  system/v2.txt         # structured output format

  rag/v1.jinja2         # simple context template

  rag/v2.jinja2         # grounding rules + citations

  few-shot/v1.jsonl     # example set A

  few-shot/v2.jsonl     # example set B

  CHANGELOG.md          # version history with rationale

System Prompt v1

Free-form instructions. Asks the model to generate detailed LEGO building specifications covering dimensions, colors, architectural features, and construction notes. No enforced output structure.

System Prompt v2

Adds mandatory output sections: Dimensions, Color Palette, Architectural Features, Floor Plans, Construction Notes, Structural Validation. Includes explicit grounding instructions to base recommendations on retrieved reference materials.

RAG Template v1

Simple context concatenation. Appends retrieved reference materials before the user query. No explicit grounding rules.

RAG Template v2

Adds numbered references with relevance scores, five explicit grounding rules, inline citation requirements, and prompt version metadata for traceability.

Experiment Framework

eval/run_prompt_experiment.py creates a matrix of all system prompt x RAG template combinations. For each combination, it runs the full evaluation dataset and measures quality metrics, token usage, latency, and estimated cost. The output identifies the best-performing combination by weighted composite score.

Example: 2 system prompts x 2 RAG templates = 4 combinations, each evaluated against 15 test cases = 60 evaluation runs.

13. Observability and Monitoring

The observability stack tracks both classical ML and GenAI workloads. Application Insights handles telemetry collection, Log Analytics stores structured queries, and custom scripts detect drift and trend failure modes.

Telemetry Client

deploymentcode/scripts/common/telemetry.py wraps the Application Insights SDK. Every GenAI request automatically captures prompt version, model version, and trace ID. The wrapper keeps instrumentation out of business logic.

Core Functions:

track_request logs inbound request with latency, status, custom dimensions
track_dependency logs outbound calls (LLM, AI Search, storage)
track_metric logs numeric values (tokens, cost, eval scores)
track_exception logs errors with full stack trace and context

What Gets Captured

Metric

Source

Storage

Identity and Versioning

Prompt version

Application code

Custom Dimensions

Model name / version

Deployment config

Custom Dimensions

Trace ID

Foundry tracing

App Insights + Foundry

Latency

Retrieval latency (ms)

AI Search call

Dependency

Generation latency (ms)

LLM call

Dependency

Total response latency

End-to-end

Request

Cost and Tokens

Input token count

LLM response

Custom Metric

Output token count

LLM response

Custom Metric

Estimated cost ($)

Token pricing

Custom Metric

Quality and Errors

Evaluation score

Eval pipeline

Custom Metric

Error type / reason

Exception handler

Exceptions

Drift Detection

For Scenario 1 (facade classification). Compares current inference distributions against a training-time baseline using statistical tests.

Training Data

Source distribution

▶

Baseline Profile

create_baseline.py

▶

Scheduled Job

Weekly cron

▶

Compare

detect_drift.py

▶

Drift Metrics

PSI, KS, JS

▶

Alert / Retrain

alert_config.py

PSI Threshold Zones

Normal PSI < 0.2

Warning 0.2 - 0.4

Critical → Retrain PSI > 0.4

Critical threshold triggers automatic retraining via the drift-check.yml GitHub Actions workflow.

PSI

Population Stability Index

Overall distribution shift

Kolmogorov-Smirnov

Max CDF divergence

Jensen-Shannon

Symmetric divergence

KQL Dashboard Queries

File	Purpose
`avg-latency-by-prompt.kql`	Avg/p50/p95/p99 latency by prompt version
`token-cost-by-day.kql`	Daily token consumption and estimated USD cost
`groundedness-pass-rate.kql`	Quality metric pass rates by day
`error-rate-trend.kql`	Error rate by exception type
`top-failure-modes.kql`	Top 10 failure categories by frequency

Failure Trending

monitoring/feedback/trend_failures.py reads evaluation results and categorizes failures across 12 categories: missing dimensions, invalid parts, structural issues, grounding failure, hallucinated references, incomplete specs, wrong color codes, orientation errors, scale mismatches, missing submodels, token limit overruns, and prompt ambiguity. Weekly aggregation feeds into prompt iteration priorities. The most frequent failure category in a given week becomes the top target for the next prompt revision cycle.

14. Data Flow Overview

Five distinct data paths move through the platform. Each path has different latency expectations and storage targets.

Training Data Flow

Raw Data

images, .ldr, stats

▶

Storage

Blob containers

▶

Data Assets

versioned refs

▶

Training Jobs

compute cluster

▶

MLflow

experiment tracking

▶

Model Registry

versioned models

Inference Data Flow

Client Request

▶

Managed Endpoint

▶

Model Prediction

▶

Response + Log

Evaluation Data Flow

Test Dataset

eval/datasets/

▶

Eval Runner

run_evaluation.py

▶

Quality Metrics

6 scores

▶

Results

eval/results/

▶

Actions Gate

pass / fail

Prompt Data Flow

Prompt Files

prompts/

▶

Git Version Control

▶

Eval Runner

▶

Experiment Results

Monitoring Data Flow

Inference Logs

▶

Drift Job

▶

PSI / KS / JS

▶

App Insights

▶

Alert

▶

Retrain

15. Cost Breakdown

Estimated monthly costs for a single dev environment with moderate usage. All prices are approximate and based on East US 2 region pricing as of early 2026.

Fixed Monthly (Dev)

AI Search Basic ~$75

Log Analytics Per-GB ~$5

ACR Basic ~$5

Storage Account LRS ~$2

Key Vault Standard ~$0.50

Foundry Hub + Project,

AML Workspace $0 (included)

Total Fixed Monthly ~$87.50

Variable Costs

Compute Instance (DS2_v2) ~4 hr/day ~$30-60

GPU Cluster (NC4as_T4_v3) Scale-to-zero ~$0-50

CPU Cluster (DS3_v2) Scale-to-zero ~$0-20

Azure OpenAI Moderate usage ~$5-20

Total Variable ~$35-150

Total Estimate

~$120-240/month for dev with moderate use

Note: A test environment roughly doubles the fixed costs. Deploy test only when validating promotion flows. Scale-to-zero compute keeps variable costs near zero when idle.

16. Teardown

Removing all deployed resources. Both IaC tools support full teardown.

Primary (Bicep)

      ./deploymentcode/bicep/scripts/teardown.sh dev
    

Alternate (Terraform)

      cd deploymentcode/terraform && terraform destroy
    

What Happens on Teardown

Key Vault enters soft-deleted state (recoverable for 90 days by default). Purge manually if you need the name back immediately.
OpenAI deployments are deleted with the resource group. Model deployments do not persist independently.
AI Search indexes are not recoverable. Re-index from source data after redeployment.
Storage data is lost unless backed up externally. The teardown script does not create snapshots.
MLflow experiments and run history are stored in the workspace. Deleting the workspace deletes all experiment data.

17. How Models Work Together

The four ML scenarios connect to the LEGO design workflow through managed endpoints. Each endpoint serves a specific role in the feedback loop between design and validation.

Design Workflow

User Prompt

Natural language

Claude Code

AI agents

PowerShell

Generation script

.ldr File

LDraw output

stud.io

3D render

ML Feedback

Optional

Input Generation Output Validation

Integration Points

Endpoint	Scenario	What It Does
`facade-classifier-endpoint`	1	Classifies building style from rendered image
`structural-validator-endpoint`	2	Flags anomalous structural patterns
`pattern-extractor-endpoint`	3	Identifies building cluster from part usage
`spec-generator-endpoint`	4	Generates building spec from natural language

Configuration

Endpoint URLs and keys are stored in .env for local development and Key Vault for deployed environments. For production, use managed identity to avoid storing keys entirely. The SDK authenticates via DefaultAzureCredential, which picks up the managed identity automatically.

Current Status

Capability	Status
Bicep infrastructure deployment	Operational
GitHub Actions CI/CD	Operational
Scenario 1-3 training scripts	Operational
Scenario 4 RAG pipeline	Operational
Foundry-native GenAI path	Operational
Managed online endpoints	Operational
Evaluation quality gates	Operational
Prompt versioning	Operational
Drift detection	Operational
Observability dashboard	Operational
LEGO design integration	In Progress
Production data collection	Planned

What is Claude.Bricks?

AI-300 Exam Domain Coverage

End-to-End Pipeline

What This Document Covers

Table of Contents

2. Architecture Overview

3. Infrastructure as Code

Bicep (Primary Path)

Bicep Module Inventory

Deployment Command Flow

Module Dependency Chain

Terraform (Alternate Path)

Environment Strategy

4. CI/CD with GitHub Actions

5. Asset Registration and Management

Registration Scripts

Shared AML Registry

6. Scenario 1: Facade Style Classification

Image Classification Pipeline

Blue/Green Deployment

Scripts

Azure Resources Consumed by Scenario 1

7. Scenario 2: Structural Validation (Anomaly Detection)

Anomaly Detection on LDraw Files

Operational Lifecycle

Azure Resources Consumed by Scenario 2

8. Scenario 3: Pattern Extraction (Clustering)

Multi-Step Pipeline: Extract, Cluster, Analyze

Component-Based Pipeline

Azure Resources Consumed by Scenario 3

9. Scenario 4: GenAI Spec Generator

Path A: Azure OpenAI + Prompt Flow

Path B: Foundry-Native AI-300

Resources Consumed

10. Endpoint Operations

Blue/Green Architecture

Deployment Flow

Smoke Test Criteria

11. Evaluation and Quality Gates

Evaluation Dataset

Custom Buildability Evaluator

Quality Metrics and Thresholds

Evaluation Scripts

12. Prompt Engineering

Prompt Directory Structure

System Prompt v1

System Prompt v2

RAG Template v1

RAG Template v2

Experiment Framework

13. Observability and Monitoring

Telemetry Client

What Gets Captured

Drift Detection

PSI Threshold Zones

KQL Dashboard Queries

Failure Trending

14. Data Flow Overview

Training Data Flow

Inference Data Flow

Evaluation Data Flow

Prompt Data Flow

Monitoring Data Flow

15. Cost Breakdown

Fixed Monthly (Dev)

Variable Costs

Total Estimate

16. Teardown

Primary (Bicep)

Alternate (Terraform)

What Happens on Teardown

17. How Models Work Together

Design Workflow

Integration Points

Configuration

Current Status