Agents and Building Blocks
Updated: Dec 16, 2025
- Understand how Agents coordinate runtime data flow and events in AI
Building Blocks.
- Learn how Providers handle model inference across Cloud, Local, and
On-Device backends.
- Connect multiple Agents to build modular, event-driven, and
multimodal AI pipelines.
- Combine vision, speech, and language Agents to create interactive XR
experiences.
- Extend or customize existing Agents using inheritance and UnityEvents without
modifying Providers.
Agents are Unity components that bring core AI capabilities — such as object
detection, natural language processing, and speech synthesis — into XR
applications. Each Agent coordinates runtime data flow between your scene
and an inference Provider (Cloud, Local, or On-Device).
Providers define how inference runs and how input/output is formatted, while
Agents handle when and where data is captured, processed, and dispatched in
Unity.
Overview of Available Building Blocks
| Building Block | Agent(s) | Purpose |
|---|
Object Detection | ObjectDetectionAgent, ObjectDetectionVisualizer
| Detects and tracks objects in passthrough or camera textures. |
Large Language Model (LLM) | LlmAgent, LlmAgentHelper
| Manages text and multimodal chat using GPT, Llama, or similar models. |
Speech-to-Text (STT) | SpeechToTextAgent
| Converts user speech or audio clips into text. |
Text-to-Speech (TTS) | TextToSpeechAgent
| Generates natural-sounding voice from text. |
All four Building Blocks, which are made up of four agents and their helper
classes, share a consistent interface and can be combined to form complex,
multimodal pipelines.
Possible Bounding Box Misalignment
We have a fix ready for the upcoming release that will take into account different resolutions and aspect ratios of the Passthrough Camera. If you are currently experiencing an offset on your bounding boxes, please set the resolution on your PassthroughCameraAccess component to x: 1280, y: 960, which was previously the default resolution.- ObjectDetectionAgent: Runs inference and returns structured detection
data.
- ObjectDetectionVisualizer: Renders bounding boxes or 3D meshes in the
scene.
- The Agent captures a frame from
PassthroughCamera. - It sends the texture to the assigned Provider (for example,
UnityInferenceEngineProvider or HuggingFaceProvider). - The Provider returns detection results (bounding boxes, class labels,
confidence scores).
- The Visualizer draws the results in world space.
How to place object detections in 3D space
When importing the Object Detection Building Block with visualizer, the
visualizer will also import the DepthTextureAccess class. This provides the
visualizer with depth data to turn 2D detections into real-world 3D results.
- EnvironmentDepthManager: Retrieves per-eye environment depth textures
from the Meta Quest system and makes them available globally for other
components.
- DepthTextureAccess: Reads these depth textures from the GPU (via
EnvironmentDepthManager), copies them into a CPU-accessible array, and
exposes per-frame depth data. - ObjectDetectionAgent: Uses the 2D bounding boxes from its model output
and maps them to world-space using the latest depth information from
DepthTextureAccess. - ObjectDetectionVisualizer: Renders 3D bounding boxes or markers in the
scene at the world-space positions provided by
ObjectDetectionAgent.
An alternative and still valid solution is to use the
EnvironmentRaycastManager component to shoot a ray from the camera pixel
into world space to find the distance to the object. However, the AI Building
Blocks do not use this solution as it can get imprecise and expensive when done
each frame and for each point of the bounding box. Furthermore, if the user’s
head moves between inference and output retrieval, the raycast may hit the wrong
point, as the user might not have the object in their frustum anymore.
agent.OnDetectionsUpdated += (detections) =>
{
foreach (var d in detections)
Debug.Log($"Detected {d.label} at {d.box}");
};
Large Language Model (LLM)
- LlmAgent: Manages text or multimodal (image and text) conversation flow.
- LlmAgentHelper: Connects LLMs/VLMs with speech agents (STT/TTS) or other
custom logic.
SendPrompt() forwards user input to the Provider (Llama API, OpenAI,
Ollama, and so on).- The Provider streams text tokens asynchronously.
- The Agent emits
OnResponseReceived and OnStreamUpdate events. - Other Agents (like Text to Speech) can subscribe to stream updates for
immediate feedback.
speechToTextAgent.OnTranscript += llmAgentHelper.SendPrompt;
llmAgent.OnResponseReceived += textToSpeechAgent.SpeakText;
Providers such as Llama 4 Maverick, OpenAI GPT-4o, or Claude 4 Sonnet
on Replicate models support image input. When a Provider implements
IChatTask, textures are automatically encoded and sent alongside text prompts.
- SpeechToTextAgent: Captures microphone or audio clip input.
- Works with Providers implementing
ISpeechToTextTask.
- The Agent records audio or receives an
AudioClip. - Audio is encoded and sent to the Provider (
OpenAI, ElevenLabs). - The Provider transcribes it to text and triggers
OnTranscript.
speechToTextAgent.OnTranscript += (string transcript) =>
{
Debug.Log($"User said: {transcript}");
};
- Supports real-time or clip-based transcription.
- Some Providers offer streaming recognition for live captions.
- Optional language code field for multilingual setups.
- TextToSpeechAgent: Converts text into audio clips using
ITextToSpeechTask Providers.
- The Agent receives text via
SpeakText(). - It sends the request to the Provider (for example, ElevenLabs, OpenAI TTS).
- The Provider returns or streams an
AudioClip. - The Agent plays it automatically or exposes it for custom playback.
llmAgent.OnResponseReceived += textToSpeechAgent.SpeakText;
| Event | Raised By | Description |
|---|
onDetectionResponseReceived
| ObjectDetectionAgent
| Invoked after a detection pass with the processed BoxData list. |
OnBoxesUpdated
| ObjectDetectionAgent
| C# event fired when boxes are updated; used by ObjectDetectionVisualizer. |
onPromptSent
| LlmAgent
| Raised when a user prompt is dispatched to the provider. |
onResponseReceived
| LlmAgent
| Raised when a full assistant response is received. |
onImageCaptured
| LlmAgent
| Raised when a passthrough or debug image is captured for a multimodal prompt. |
onTranscript
| SpeechToTextAgent
| Emits the recognized transcript after STT completes. |
onClipReady
| TextToSpeechAgent
| Fired when a synthesized AudioClip is ready for playback. |
onSpeakStarting
| TextToSpeechAgent
| Fired just before playback, passing the text that will be spoken. |
onSpeakFinished
| TextToSpeechAgent
| Fired when the AudioSource finishes or playback is stopped. |
- MRUK-only events (like
OnBoxesUpdated) are wrapped in compile guards and
only available if MRUK is installed and the Passthrough Camera is available. LlmAgentHelper connects existing events (onPromptSent,
onResponseReceived) for simplified wiring but doesn’t introduce new ones.
All Agents use a unified public event-driven architecture. You can connect them
directly through the Unity Inspector or your custom logic.
Example 1: Conversational Chain
[Microphone Input] → SpeechToTextAgent → LlmAgent → TextToSpeechAgent → [Audio Output]
Example 2: Vision + LLM Hybrid
[Camera Frame] → ObjectDetectionAgent → LlmAgent (context injection) → TextToSpeechAgent
Developers can subclass any Agent to add new behavior such as gesture input,
haptic feedback, or custom UI updates.
public class MyCustomAgent : ObjectDetectionAgent
{
// Override HandleResults to inject custom logic after detection processing
// Use this pattern to add haptic feedback, audio cues, or custom UI updates
protected override void HandleResults(List<Detection> results)
{
// Call base implementation first to ensure proper detection handling
base.HandleResults(results);
// Add custom behavior that responds to detection results
foreach (var detection in results)
{
if (detection.confidence > 0.8f)
{
// Trigger haptic feedback, play sound, or update custom UI
Debug.Log($"High confidence detection: {detection.label}");
}
}
}
}
Because Agents are Provider-agnostic, switching from a
cloud model to an on-device model only requires assigning a new Provider
asset — no code changes.