Agents are Unity components that bring core AI capabilities — such as object detection, natural language processing, and speech synthesis — into XR applications. Each Agent coordinates runtime data flow between your scene and an inference Provider (Cloud, Local, or On-Device).

Providers define how inference runs and how input/output is formatted, while Agents handle when and where data is captured, processed, and dispatched in Unity.

Overview of Available Building Blocks

Building Block	Agent(s)	Purpose
Object Detection	`ObjectDetectionAgent`, `ObjectDetectionVisualizer`	Detects and tracks objects in passthrough or camera textures.
Large Language Model (LLM)	`LlmAgent`, `LlmAgentHelper`	Manages text and multimodal chat using GPT, Llama, or similar models.
Speech-to-Text (STT)	`SpeechToTextAgent`	Converts user speech or audio clips into text.
Text-to-Speech (TTS)	`TextToSpeechAgent`	Generates natural-sounding voice from text.

All four Building Blocks, which are made up of four agents and their helper classes, share a consistent interface and can be combined to form complex, multimodal pipelines.

Object Detection

Possible Bounding Box Misalignment

We have a fix ready for the upcoming release that will take into account different resolutions and aspect ratios of the Passthrough Camera. If you are currently experiencing an offset on your bounding boxes, please set the resolution on your PassthroughCameraAccess component to x: 1280, y: 960, which was previously the default resolution.

Components

ObjectDetectionAgent: Runs inference and returns structured detection data.

ObjectDetectionVisualizer: Renders bounding boxes or 3D meshes in the scene.

Data Flow

The Agent captures a frame from PassthroughCamera.

It sends the texture to the assigned Provider (for example, UnityInferenceEngineProvider or HuggingFaceProvider).

The Provider returns detection results (bounding boxes, class labels, confidence scores).

The Visualizer draws the results in world space.

How to place object detections in 3D space

When importing the Object Detection Building Block with visualizer, the visualizer will also import the DepthTextureAccess class. This provides the visualizer with depth data to turn 2D detections into real-world 3D results.

EnvironmentDepthManager: Retrieves per-eye environment depth textures from the Meta Quest system and makes them available globally for other components.

DepthTextureAccess: Reads these depth textures from the GPU (via EnvironmentDepthManager), copies them into a CPU-accessible array, and exposes per-frame depth data.

ObjectDetectionAgent: Uses the 2D bounding boxes from its model output and maps them to world-space using the latest depth information from DepthTextureAccess.

ObjectDetectionVisualizer: Renders 3D bounding boxes or markers in the scene at the world-space positions provided by ObjectDetectionAgent.

An alternative and still valid solution is to use the EnvironmentRaycastManager component to shoot a ray from the camera pixel into world space to find the distance to the object. However, the AI Building Blocks do not use this solution as it can get imprecise and expensive when done each frame and for each point of the bounding box. Furthermore, if the user’s head moves between inference and output retrieval, the raycast may hit the wrong point, as the user might not have the object in their frustum anymore.

Example

agent.OnDetectionsUpdated += (detections) =>
{
    foreach (var d in detections)
        Debug.Log($"Detected {d.label} at {d.box}");
};

Large Language Model (LLM)

Components

LlmAgent: Manages text or multimodal (image and text) conversation flow.

LlmAgentHelper: Connects LLMs/VLMs with speech agents (STT/TTS) or other custom logic.

Data Flow

SendPrompt() forwards user input to the Provider (Llama API, OpenAI, Ollama, and so on).

The Provider streams text tokens asynchronously.

The Agent emits OnResponseReceived and OnStreamUpdate events.

Other Agents (like Text to Speech) can subscribe to stream updates for immediate feedback.

Example

speechToTextAgent.OnTranscript += llmAgentHelper.SendPrompt;
llmAgent.OnResponseReceived += textToSpeechAgent.SpeakText;

Multimodal Support

Providers such as Llama 4 Maverick, OpenAI GPT-4o, or Claude 4 Sonnet on Replicate models support image input. When a Provider implements IChatTask, textures are automatically encoded and sent alongside text prompts.

Speech-to-Text (STT)

Components

SpeechToTextAgent: Captures microphone or audio clip input.

Works with Providers implementing ISpeechToTextTask.

Data Flow

The Agent records audio or receives an AudioClip.

Audio is encoded and sent to the Provider (OpenAI, ElevenLabs).

The Provider transcribes it to text and triggers OnTranscript.

Example

speechToTextAgent.OnTranscript += (string transcript) =>
{
    Debug.Log($"User said: {transcript}");
};

Features

Supports real-time or clip-based transcription.

Some Providers offer streaming recognition for live captions.

Optional language code field for multilingual setups.

Text-to-Speech (TTS)

Components

TextToSpeechAgent: Converts text into audio clips using ITextToSpeechTask Providers.

Data Flow

The Agent receives text via SpeakText().

It sends the request to the Provider (for example, ElevenLabs, OpenAI TTS).

The Provider returns or streams an AudioClip.

The Agent plays it automatically or exposes it for custom playback.

Example

llmAgent.OnResponseReceived += textToSpeechAgent.SpeakText;

Agent Events Summary

Event	Raised By	Description
`onDetectionResponseReceived`	`ObjectDetectionAgent`	Invoked after a detection pass with the processed `BoxData` list.
`OnBoxesUpdated`	`ObjectDetectionAgent`	C# event fired when boxes are updated; used by `ObjectDetectionVisualizer`.
`onPromptSent`	`LlmAgent`	Raised when a user prompt is dispatched to the provider.
`onResponseReceived`	`LlmAgent`	Raised when a full assistant response is received.
`onImageCaptured`	`LlmAgent`	Raised when a passthrough or debug image is captured for a multimodal prompt.
`onTranscript`	`SpeechToTextAgent`	Emits the recognized transcript after STT completes.
`onClipReady`	`TextToSpeechAgent`	Fired when a synthesized `AudioClip` is ready for playback.
`onSpeakStarting`	`TextToSpeechAgent`	Fired just before playback, passing the text that will be spoken.
`onSpeakFinished`	`TextToSpeechAgent`	Fired when the `AudioSource` finishes or playback is stopped.

Notes

MRUK-only events (like OnBoxesUpdated) are wrapped in compile guards and only available if MRUK is installed and the Passthrough Camera is available.

LlmAgentHelper connects existing events (onPromptSent, onResponseReceived) for simplified wiring but doesn’t introduce new ones.

Wiring Agents Together

All Agents use a unified public event-driven architecture. You can connect them directly through the Unity Inspector or your custom logic.

Example 1: Conversational Chain

[Microphone Input] → SpeechToTextAgent → LlmAgent → TextToSpeechAgent → [Audio Output]

Example 2: Vision + LLM Hybrid

[Camera Frame] → ObjectDetectionAgent → LlmAgent (context injection) → TextToSpeechAgent

Extending Agents

Developers can subclass any Agent to add new behavior such as gesture input, haptic feedback, or custom UI updates.

public class MyCustomAgent : ObjectDetectionAgent
{
    // Override HandleResults to inject custom logic after detection processing
    // Use this pattern to add haptic feedback, audio cues, or custom UI updates
    protected override void HandleResults(List<Detection> results)
    {
        // Call base implementation first to ensure proper detection handling
        base.HandleResults(results);

        // Add custom behavior that responds to detection results
        foreach (var detection in results)
        {
            if (detection.confidence > 0.8f)
            {
                // Trigger haptic feedback, play sound, or update custom UI
                Debug.Log($"High confidence detection: {detection.label}");
            }
        }
    }
}

Because Agents are Provider-agnostic, switching from a cloud model to an on-device model only requires assigning a new Provider asset — no code changes.

→ Next: Adding New Providers

Did you find this page helpful?