A company is using Amazon Bedrock and Anthropic Claude 3 Haiku to develop an AI assistant. The AI
assistant normally processes 10,000 requests each hour but experiences surges of up to 30,000
requests each hour during peak usage periods. The AI assistant must respond within 2 seconds while
operating across multiple AWS Regions.
The company observes that during peak usage periods, the AI assistant experiences throughput
bottlenecks that cause increased latency and occasional request timeouts. The company must resolve
the performance issues.
Which solution will meet this requirement?
Correct Answer: B
Explanation:
Option B is the correct solution because it directly addresses both throughput bottlenecks and latency
requirements using native Amazon Bedrock performance optimization features that are designed for
real-time, high-volume generative AI workloads.
Amazon Bedrock supports cross-Region inference profiles, which allow applications to transparently
route inference requests across multiple AWS Regions. During peak usage periods, traffic is
automatically distributed to Regions with available capacity, reducing throttling, request queuing, and
timeout risks. This approach aligns with AWS guidance for building highly available, low-latency GenAI
applications that must scale elastically across geographic boundaries.
Token batching further improves efficiency by combining multiple inference requests into a single model
invocation where applicable. AWS Generative AI documentation highlights batching as a key optimization
technique to reduce per-request overhead, improve throughput, and better utilize model capacity. This is
especially effective for lightweight, low-latency models such as Claude 3 Haiku, which are designed for
fast responses and high request volumes.
Option A does not meet the requirement because purchasing provisioned throughput in a single Region
creates a regional bottleneck and does not address multi-Region availability or traffic spikes beyond
reserved capacity. Retries increase load and latency rather than resolving the root cause Option C improves application-layer scaling but does not solve model-side throughput limits. Client-side
round-robin routing lacks awareness of real-time model capacity and can still send traffic to saturated
Regions.
Option D is unsuitable because batch inference with asynchronous retrieval is designed for offline or
non-interactive workloads. It cannot meet a strict 2-second response time requirement for an interactive
AI assistant.
Therefore, Option B provides the most effective and AWS-aligned solution to achieve low latency, global
scalability, and high throughput during peak usage periods.
Question 2
A company is using Amazon Bedrock and Anthropic Claude 3 Haiku to develop an AI assistant. The AI
assistant normally processes 10,000 requests each hour but experiences surges of up to 30,000
requests each hour during peak usage periods. The AI assistant must respond within 2 seconds while
operating across multiple AWS Regions.
The company observes that during peak usage periods, the AI assistant experiences throughput
bottlenecks that cause increased latency and occasional request timeouts. The company must resolve
the performance issues.
Which solution will meet this requirement?
Correct Answer: B
Explanation:
Option B is the correct solution because it directly addresses both throughput bottlenecks and latency
requirements using native Amazon Bedrock performance optimization features that are designed for
real-time, high-volume generative AI workloads.
Amazon Bedrock supports cross-Region inference profiles, which allow applications to transparently
route inference requests across multiple AWS Regions. During peak usage periods, traffic is
automatically distributed to Regions with available capacity, reducing throttling, request queuing, and
timeout risks. This approach aligns with AWS guidance for building highly available, low-latency GenAI
applications that must scale elastically across geographic boundaries.
Token batching further improves efficiency by combining multiple inference requests into a single model
invocation where applicable. AWS Generative AI documentation highlights batching as a key optimization
technique to reduce per-request overhead, improve throughput, and better utilize model capacity. This is
especially effective for lightweight, low-latency models such as Claude 3 Haiku, which are designed for
fast responses and high request volumes.
Option A does not meet the requirement because purchasing provisioned throughput in a single Region
creates a regional bottleneck and does not address multi-Region availability or traffic spikes beyond
reserved capacity. Retries increase load and latency rather than resolving the root cause Option C improves application-layer scaling but does not solve model-side throughput limits. Client-side
round-robin routing lacks awareness of real-time model capacity and can still send traffic to saturated
Regions.
Option D is unsuitable because batch inference with asynchronous retrieval is designed for offline or
non-interactive workloads. It cannot meet a strict 2-second response time requirement for an interactive
AI assistant.
Therefore, Option B provides the most effective and AWS-aligned solution to achieve low latency, global
scalability, and high throughput during peak usage periods.
Question 3
A company provides a service that helps users from around the world discover new restaurants. The
service has 50 million monthly active users. The company wants to implement a semantic search
solution across a database that contains 20 million restaurants and 200 million reviews. The company
currently stores the data in PostgreSQL.
The solution must support complex natural language queries and return results for at least 95% of
queries within 500 ms. The solution must maintain data freshness for restaurant details that update
hourly. The solution must also scale cost-effectively during peak usage periods.
Which solution will meet these requirements with the LEAST development effort?
Correct Answer: B
Explanation:
Option B best satisfies the requirements while minimizing development effort by combining managed
semantic search capabilities with fully managed foundation models. AWS Generative AI guidance
describes semantic search as a vector-based retrieval pattern where both documents and user queries
are embedded into a shared vector space. Similarity search (such as k-nearest neighbors) then retrieves
results based on meaning rather than exact keywords.
Amazon OpenSearch Service natively supports vector indexing and k-NN search at scale. This makes it
well suited for large datasets such as 20 million restaurants and 200 million reviews while still achieving
sub-second latency for the majority of queries. Because OpenSearch is a distributed, managed service,
it automatically scales during peak traffic periods and provides cost-effective performance compared with
building and tuning custom vector search pipelines on relational databases.
Using Amazon Bedrock to generate embeddings significantly reduces development complexity. AWS
manages the foundation models, eliminates the need for custom model hosting, and ensures
consistency by using the same FM for both document embeddings and query embeddings. This aligns
directly with AWS-recommended semantic search architectures and removes the need for model
lifecycle management.
Hourly updates to restaurant data can be handled efficiently through incremental re-indexing in
OpenSearch without disrupting query performance. This approach cleanly separates transactional data storage from search workloads, which is a best practice in AWS architectures.
Option A does not meet the semantic search requirement because keyword-based search cannot reliably
interpret complex natural language intent.
Option C introduces scalability and performance risks by running large-scale vector similarity searches
inside PostgreSQL, which increases operational complexity.
Option D adds unnecessary ingestion and abstraction layers intended for retrieval-augmented
generation, not high-throughput semantic search.
Therefore, Option B provides the optimal balance of performance, scalability, data freshness, and
minimal development effort using AWS Generative AI services
Demo Practice Mode
You are viewing only the questions marked as Demo.