Performance Architecture — Azure AI Engineer (AI-300)
Latency and Scale Are Not the Same Constraint
Questions in this family present a performance requirement—sub-100ms response, global user distribution, or high read throughput—then list services that each solve a different slice of it. Azure Front Door handles geographic routing and TLS offload. Azure Cache for Redis eliminates repeated inference against static feature sets. Azure Cosmos DB multi-region writes reduce read latency at the data layer. The deciding constraint is where in the request path the bottleneck lives, not which service sounds most powerful.
What This Pattern Tests
Azure performance architecture tests whether you match the scaling mechanism to the bottleneck. For AI-300, Azure ML managed online endpoints auto-scale based on request latency or CPU, while batch endpoints handle offline scoring at lower cost. Azure OpenAI provisioned throughput units (PTUs) guarantee tokens-per-minute for production workloads, while standard deployment uses shared capacity with rate limits. For AZ-400, Azure Pipelines parallel jobs determine build throughput — free tier gives 1 parallel job with 1,800 minutes/month, while paid agents provide unlimited minutes. Azure Test Plans load testing identifies performance ceilings before deployment. The trap is adding more compute when the bottleneck is API rate limiting, or provisioning PTUs for a dev/test workload that should use standard deployment.
Decision Axis
Performance lever depends on the service: compute scaling vs. throughput provisioning vs. rate limit management vs. data partitioning.
Associated Traps
Decision Rules
Whether to satisfy the training completion SLA by scaling vertically to a single high-memory GPU VM or by scaling horizontally across a multi-node Azure Machine Learning Compute cluster using data-parallel distributed training, where the horizontal option delivers higher aggregate GPU throughput at lower cost-per-epoch for this dataset and model size.
Whether to trigger retraining through an Azure Machine Learning dataset monitor that fires only when drift coefficient exceeds a threshold, or through a fixed-schedule Azure Machine Learning Pipeline that retrains on a calendar interval regardless of actual drift magnitude.
Whether to provision a single oversized GPU VM (vertical scale-up) or a multi-node Azure Machine Learning Compute cluster using data-parallel distributed training (horizontal scale-out) when model size and dataset volume exceed single-node throughput capacity under a cost-per-epoch constraint.
Whether to trigger retraining via an Azure Machine Learning dataset monitor drift-score threshold alert (reactive, cost-efficient) versus a fixed-interval Azure Machine Learning Pipeline schedule (superficially simpler but wasteful when drift is absent).
Select multi-node distributed training on a right-sized Azure Machine Learning Compute cluster over a single oversized GPU VM when the combined dataset volume and time SLA require aggregate data-ingestion and compute throughput that a single node cannot deliver within budget.
Select drift-threshold-triggered retraining — Azure Machine Learning dataset monitor emitting alert → Azure Machine Learning Pipelines trigger — over a fixed-schedule retraining pipeline, because the constraint is cost-efficient automated remediation proportional to actual drift magnitude, not calendar cadence.
Whether the retry reliability, step-level caching, and lineage requirements of the nightly multi-step training workflow are better satisfied by Azure Machine Learning Pipelines' native capabilities or by a custom Python orchestration script that must implement those behaviors manually.
Choose Serverless API Endpoints over Provisioned Throughput Units when the workload is low-volume or traffic-unpredictable — PTUs are only cost-justified when sustained throughput can saturate the fixed reserved allocation, and their latency advantage disappears at volumes serverless can serve within the SLA.
Whether to satisfy the reproducibility and auditability constraint by storing prompts in Git and tagging MLflow experiment runs with the corresponding commit SHAs — using zero new infrastructure — versus provisioning dedicated Azure Blob Storage versioned containers plus a separate metadata store to track lineage, which over-provisions for text assets that Git handles natively.
For a time-bounded batch workload with a high peak-throughput requirement but a dominant daily idle fraction, select serverless deployment over Provisioned Throughput Units because serverless charges only for tokens consumed during the active window, while PTU charges for reserved capacity across all 24 hours regardless of utilization.
Whether to build a purpose-built custom prompt registry backed by Azure Blob Storage with a metadata-tagging API, or to commit prompt files directly to Git with GitHub Actions CI validation hooks—Git wins because it delivers native diff history, branch isolation, and rollback with zero new infrastructure surface, satisfying the reproducibility and no-added-management-burden constraints that the custom registry violates.
When a sustained-traffic throughput and p99 latency SLA are both fixed, choose the deployment mode whose built-in traffic-shifting capability satisfies the zero-downtime promotion requirement without adding a custom routing layer that introduces new coordination and failure-mode surfaces.
Whether to adopt Git-based source control with MLflow experiment linking—which satisfies reproducibility and audit-lineage at minimal infrastructure footprint—versus enabling Azure Blob Storage blob-level versioning, which over-provisions storage capacity while failing the diff-visibility, branch-isolation, and CI-validation requirements of the reproducibility constraint.
Whether to deploy the Azure OpenAI Service model using pay-as-you-go token-based consumption or Provisioned Throughput Units, given a low-volume, irregular traffic pattern and a cost-efficiency constraint that disqualifies pre-reserved capacity whose baseline cost accrues regardless of utilization.
Whether to store and version prompt templates in a Git repository with CI-enforced, reviewer-gated promotion (satisfying both reproducibility via commit SHA and auditability via PR approval history) or to adopt a near-right alternative such as Azure Blob Storage with versioning tags or an Azure Machine Learning prompt catalog that provides artifact retention and version identifiers but lacks native diff history, branch isolation, and reviewer-attribution workflows.
Whether to deploy the updated model version as a named deployment within the existing Azure OpenAI Service resource and use built-in traffic splitting with managed rollback, or to provision a separate Azure Machine Learning managed online endpoint with custom canary routing scripts and alert-triggered revert logic.
Whether Git-native tagging plus GitHub Actions PR-review gates constitute a sufficient and lower-complexity prompt versioning system, or whether layering MLflow as an additional prompt artifact registry on top of Git meaningfully improves reproducibility at acceptable operational cost.
Whether to address intermittent generative AI agent latency spikes through observability tooling (distributed tracing and metric dashboards) or through capacity-layer provisioning (scaling PTUs or reserved throughput), given an explicit minimal-overhead constraint requiring root cause identification rather than headroom addition.
Whether to diagnose intermittent latency spikes by enabling Foundry's built-in tracing integrated with Azure Monitor versus scaling PTUs or authoring a custom OpenTelemetry pipeline—where the decisive constraint is least operational overhead for root-cause identification, not raw throughput capacity.
Enable Azure OpenAI Service diagnostic logging forwarded to an Azure Monitor Log Analytics workspace to surface per-call token metrics; do not scale provisioned throughput units (PTUs), which adds reserved capacity without producing per-invocation cost-attribution observability.
Choose retrieval-parameter tuning (chunk size, chunk overlap, similarity threshold) over infrastructure tier scaling to close the retrieval-relevance gap within the stated cost constraint.
Whether to apply full fine-tuning on an oversized high-memory GPU cluster SKU or a parameter-efficient fine-tuning technique (LoRA/QLoRA) on a cost-appropriate Azure Machine Learning Compute tier when a specific accuracy threshold — not maximum accuracy — is the stated requirement and budget is a hard constraint.
Whether to scale Azure AI Search capacity (add replicas or upgrade SKU tier) or tune retrieval configuration parameters — specifically chunk size, chunk overlap, and similarity threshold — to close the relevance gap at current cost.
Whether the stated accuracy threshold and fixed compute budget together justify full model fine-tuning on high-capacity GPU compute, or whether a parameter-efficient fine-tuning technique such as LoRA on a lower-tier Azure Machine Learning Compute cluster satisfies both constraints with substantially less cost and operational overhead.
Whether to improve RAG retrieval quality by tuning retrieval configuration parameters — chunk size, overlap window, or semantic similarity threshold — within the existing Azure AI Search tier, versus upgrading to a higher SKU or adding replicas, when the constraint is cost-neutral and the stated problem is relevance rather than throughput.
Whether the stated accuracy threshold and dataset size justify full-parameter fine-tuning on an oversized GPU cluster, or whether a parameter-efficient technique (e.g., LoRA) on a cost-appropriate Azure Machine Learning Compute tier meets the same evaluation metric at a fraction of the GPU-hour spend.
Whether to improve retrieval relevance by tuning retrieval parameters (chunk size, chunk overlap, similarity score threshold) within the existing Azure AI Search index versus scaling infrastructure (adding replicas or upgrading to a higher search SKU) that adds cost and operational overhead without changing which chunks are ranked and returned.
Domain Coverage
Difficulty Breakdown