Neural Graphics for Real-Time Rendering

Metal 4 Neural Graphics

Integrating Machine Learning into Your Rendering Pipeline

Machine learning is transforming graphics programming. Techniques like neural upscaling, asset compression, ambient occlusion, and procedural material generation push visual fidelity while reducing computational cost. Metal 4 makes these techniques accessible through native tensor support, a dedicated ML command encoder, and Shader ML for embedding inference directly in your shaders.

This article explores Metal 4’s machine learning architecture in depth, providing patterns for integrating neural networks into real-time rendering pipelines.

The convergence of graphics and machine learning

Traditional rendering pipelines rely on hand-crafted algorithms: SSAO for ambient occlusion, mipmapped textures for LOD, temporal anti-aliasing for smoothing. Each technique represents decades of graphics research distilled into efficient GPU implementations.

Machine learning offers an alternative: learn the mapping from inputs to outputs directly from data. A neural network trained on ground-truth ambient occlusion can predict occlusion values per pixel. A network trained on high-resolution textures can decompress compact latent representations. The results often match or exceed traditional techniques while enabling novel capabilities.

Metal 4 integrates ML at two levels. The MTL4MachineLearningCommandEncoder runs complete neural networks on the GPU timeline, synchronized with render and compute work. Shader ML embeds smaller networks directly in fragment, vertex, or compute shaders, eliminating memory round-trips between operations.

MTLTensor: the foundation of Metal ML

Metal 4 introduces MTLTensor as a first-class resource type alongside buffers and textures. Tensors are multi-dimensional data containers designed specifically for machine learning workloads.

Why tensors matter

Textures limit channels to four and impose format-dependent dimension constraints. Buffers provide raw memory but require manual indexing for multi-dimensional data. Tensors eliminate these limitations:

Arbitrary dimensionality: Rank-2 matrices, rank-3 feature maps, rank-4 batch tensors—whatever your network requires
Baked-in strides: Dimension and stride information embedded in the tensor simplifies indexing
Optimized layouts: Device-allocated tensors use opaque layouts optimized for ML operations

Creating tensors from devices

For best performance, create tensors directly from the Metal device:

let tensorDescriptor = MTLTensorDescriptor()
tensorDescriptor.dataType = .float16
tensorDescriptor.dimensions = [1, 512, 512, 64]  // Batch, Height, Width, Channels
tensorDescriptor.usage = [.machineLearning, .read, .write]

let featureTensor = try device.makeTensor(descriptor: tensorDescriptor)

Device-allocated tensors use an opaque, optimized layout—similar to how textures may be swizzled for cache efficiency. This layout provides the best performance for ML operations but requires copying data through staging buffers for CPU access.

Creating tensors from buffers

When tensor data originates from buffers (network weights, CPU-generated inputs), create tensors with explicit strides:

// Buffer contains a 256x128 matrix in row-major order
let matrixBuffer = device.makeBuffer(length: 256 * 128 * MemoryLayout<Float16>.size,
                                      options: .storageModeShared)!

let tensorDescriptor = MTLTensorDescriptor()
tensorDescriptor.dataType = .float16
tensorDescriptor.dimensions = [256, 128]
tensorDescriptor.strides = [128, 1]  // Row stride, column stride (innermost = 1)
tensorDescriptor.usage = [.machineLearning, .read]

let weightsTensor = try matrixBuffer.makeTensor(descriptor: tensorDescriptor, offset: 0)

Strides account for padding in source data. If your buffer contains 256 rows with 128 elements each plus 16 padding elements per row, set strides to [144, 1].

Tensor usage flags

Specify how tensors will be used:

.machineLearning - Required for MTL4MachineLearningCommandEncoder
.compute - Required for Shader ML in compute shaders
.render - Required for Shader ML in vertex/fragment shaders
.read / .write - Standard access patterns

Combine flags when tensors serve multiple purposes.

MTL4MachineLearningCommandEncoder: networks on the GPU timeline

The ML command encoder integrates complete neural networks into Metal’s command buffer model. Networks execute alongside render and compute work, synchronized through standard Metal primitives.

The MTLPackage format

Networks run from .mtlpackage files—Metal’s optimized format for ML models. Create packages from CoreML models:

# Python: Export PyTorch model to CoreML
import coremltools as ct

coreml_model = ct.convert(
    pytorch_model,
    inputs=[ct.TensorType(shape=(1, 3, 512, 512), name="input")],
    outputs=[ct.TensorType(name="output")],
    convert_to='mlprogram',
    minimum_deployment_target=ct.target.macOS26
)

coreml_model.save('ambient_occlusion.mlpackage')

Then convert to MTLPackage using the command-line tool:

metal-package-builder ambient_occlusion.mlpackage -o ambient_occlusion.mtlpackage

The conversion optimizes the network for Metal execution, fusing operations and selecting optimal implementations for Apple Silicon.

Loading and compiling networks

Load the package as a Metal library and compile for the current device:

// Load the package
let packageURL = Bundle.main.url(forResource: "ambient_occlusion", withExtension: "mtlpackage")!
let library = try device.makeLibrary(URL: packageURL)

// Create function descriptor for the main network
let functionDescriptor = MTL4LibraryFunctionDescriptor()
functionDescriptor.name = "main"
functionDescriptor.library = library

// Create pipeline descriptor
let pipelineDescriptor = MTL4MachineLearningPipelineDescriptor()
pipelineDescriptor.machineLearningFunctionDescriptor = functionDescriptor

// Set dynamic input dimensions if needed
pipelineDescriptor.setInputDimensions([1, 512, 512, 4], atBufferIndex: 0)

// Compile for this device
let pipeline = try compiler.makeMachineLearningPipelineState(descriptor: pipelineDescriptor)

Compilation may take significant time for large networks. Compile during loading screens or startup, not during gameplay.

Dispatching network inference

Encode network execution into command buffers:

// Create intermediate storage heap
let heapDescriptor = MTLHeapDescriptor()
heapDescriptor.type = .placement
heapDescriptor.size = pipeline.intermediatesHeapSize

let intermediatesHeap = device.makeHeap(descriptor: heapDescriptor)!

// Configure argument table with inputs and outputs
let argumentTable = try device.makeArgumentTable(descriptor: argTableDescriptor)
argumentTable.setTensor(inputTensor.gpuResourceID, index: 0)
argumentTable.setTensor(outputTensor.gpuResourceID, index: 1)

// Create and configure encoder
let encoder = commandBuffer.makeMachineLearningCommandEncoder()
encoder.setPipelineState(pipeline)
encoder.setArgumentTable(argumentTable)
encoder.dispatchNetwork(intermediatesHeap: intermediatesHeap)
encoder.endEncoding()

The intermediates heap stores activations between network layers. Size it according to pipeline.intermediatesHeapSize. Reuse heaps across frames to avoid allocation overhead.

Synchronization with render and compute

Network execution integrates with Metal 4’s barrier system. The .machineLearning stage identifier enables precise synchronization:

// Wait for depth buffer before running ambient occlusion network
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
computeEncoder.barrier(
    afterQueueStages: .fragment,
    beforeStages: .machineLearning,
    visibilityOptions: .device
)
computeEncoder.endEncoding()

// Dispatch neural ambient occlusion
let mlEncoder = commandBuffer.makeMachineLearningCommandEncoder()
// ... configure and dispatch ...
mlEncoder.endEncoding()

// Wait for network output before compositing
let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: compositePassDescriptor)
renderEncoder.barrier(
    afterQueueStages: .machineLearning,
    beforeStages: .fragment,
    visibilityOptions: .device
)
// ... composite final frame ...

This pattern enables parallel execution of non-dependent work while ensuring correct ordering for data dependencies.

Neural ambient occlusion example

A complete neural AO implementation:

class NeuralAmbientOcclusion {
    let pipeline: MTL4MachineLearningPipelineState
    let intermediatesHeap: MTLHeap
    let inputTensor: MTLTensor   // View-space normals + depth
    let outputTensor: MTLTensor  // Per-pixel occlusion values

    func encode(commandBuffer: MTL4CommandBuffer,
                depthTexture: MTLTexture,
                normalsTexture: MTLTexture) {

        // Copy textures to input tensor (could be done in prior compute pass)
        let blitEncoder = commandBuffer.makeBlitCommandEncoder()
        blitEncoder.copy(from: depthTexture, to: inputTensor, /* ... */)
        blitEncoder.copy(from: normalsTexture, to: inputTensor, /* ... */)
        blitEncoder.endEncoding()

        // Barrier: blit -> ML
        let syncEncoder = commandBuffer.makeComputeCommandEncoder()
        syncEncoder.barrier(afterQueueStages: .blit, beforeStages: .machineLearning,
                            visibilityOptions: .device)
        syncEncoder.endEncoding()

        // Run inference
        let mlEncoder = commandBuffer.makeMachineLearningCommandEncoder()
        mlEncoder.setPipelineState(pipeline)
        mlEncoder.setTensor(inputTensor, index: 0)
        mlEncoder.setTensor(outputTensor, index: 1)
        mlEncoder.dispatchNetwork(intermediatesHeap: intermediatesHeap)
        mlEncoder.endEncoding()
    }
}

Shader ML: neural networks inside shaders

For smaller networks, Shader ML embeds inference directly in shader code. This eliminates memory round-trips between operations, dramatically improving efficiency for networks with few layers.

When to use Shader ML

Shader ML excels for:

Neural material decompression: Small networks decoding latent textures per-pixel
Per-vertex deformation: Learned blend shapes or animation
Procedural generation: Neural noise, learned pattern synthesis
Custom activation functions: Complex learned nonlinearities

Networks should be small enough to execute within fragment/vertex/compute shader time budgets—typically a few matrix multiplications.

Metal Performance Primitives

Metal 4 introduces Metal Performance Primitives (MPP) for efficient tensor operations in shaders:

#include <MetalPerformancePrimitives/MetalPerformancePrimitives.h>

using namespace mpp;

// Matrix multiplication configuration
constexpr tensor_ops::matmul2d_descriptor matmulDesc(
    /* M */ 1,
    /* N */ 64,      // Hidden layer width
    /* K */ 16,      // Input width
    /* left transpose */ false,
    /* right transpose */ true,
    /* reduced precision */ true
);

// Create operation for single-thread execution
tensor_ops::matmul2d<matmulDesc, execution_thread> matmulOp;

The execution_thread parameter specifies single-thread execution—appropriate for fragment shaders where each invocation processes one pixel. For uniform operations across SIMD groups or threadgroups, use execution_simdgroup or execution_threadgroup for better hardware utilization.

Neural material compression implementation

Neural materials achieve 50% compression versus block-compressed formats. The fragment shader samples latent textures, runs inference, and shades in a single dispatch:

#include <metal_tensor>
#include <MetalPerformancePrimitives/MetalPerformancePrimitives.h>

using namespace metal;
using namespace mpp;

// Network configuration
constant int INPUT_WIDTH = 12;   // 4 latent samples * 3 channels
constant int HIDDEN_WIDTH = 32;
constant int OUTPUT_WIDTH = 6;   // baseColor RGB + normal XYZ

// Matrix multiplication descriptors
constexpr tensor_ops::matmul2d_descriptor layer0Desc(1, HIDDEN_WIDTH, INPUT_WIDTH,
                                                      false, true, true);
constexpr tensor_ops::matmul2d_descriptor layer1Desc(1, OUTPUT_WIDTH, HIDDEN_WIDTH,
                                                      false, true, true);

[[fragment]]
float4 neuralMaterial(
    VertexOut in [[stage_in]],
    texture2d<half> latent0 [[texture(0)]],
    texture2d<half> latent1 [[texture(1)]],
    texture2d<half> latent2 [[texture(2)]],
    texture2d<half> latent3 [[texture(3)]],
    tensor<device half, dextents<int, 2>> layer0Weights [[buffer(0)]],
    tensor<device half, dextents<int, 2>> layer1Weights [[buffer(1)]],
    constant float3& lightDir [[buffer(2)]]
)
{
    constexpr sampler s(filter::linear);

    // Sample latent textures
    half3 l0 = latent0.sample(s, in.uv).rgb;
    half3 l1 = latent1.sample(s, in.uv).rgb;
    half3 l2 = latent2.sample(s, in.uv).rgb;
    half3 l3 = latent3.sample(s, in.uv).rgb;

    // Build input tensor
    half inputs[INPUT_WIDTH] = {
        l0.r, l0.g, l0.b,
        l1.r, l1.g, l1.b,
        l2.r, l2.g, l2.b,
        l3.r, l3.g, l3.b
    };
    auto inputTensor = tensor(inputs, extents<int, INPUT_WIDTH, 1>());

    // Layer 0: input -> hidden
    half hidden[HIDDEN_WIDTH];
    auto hiddenTensor = tensor(hidden, extents<int, HIDDEN_WIDTH, 1>());

    tensor_ops::matmul2d<layer0Desc, execution_thread> layer0Op;
    layer0Op.run(inputTensor, layer0Weights, hiddenTensor);

    // ReLU activation
    for (int i = 0; i < HIDDEN_WIDTH; ++i) {
        hidden[i] = max(half(0), hidden[i]);
    }

    // Layer 1: hidden -> output
    half outputs[OUTPUT_WIDTH];
    auto outputTensor = tensor(outputs, extents<int, OUTPUT_WIDTH, 1>());

    tensor_ops::matmul2d<layer1Desc, execution_thread> layer1Op;
    layer1Op.run(hiddenTensor, layer1Weights, outputTensor);

    // Extract decompressed material
    float3 baseColor = float3(outputs[0], outputs[1], outputs[2]);
    float3 tangentNormal = float3(outputs[3], outputs[4], outputs[5]);

    // Transform normal to world space
    float3 worldNormal = in.TBN * normalize(tangentNormal);

    // Simple diffuse shading
    float NdotL = saturate(dot(worldNormal, lightDir));
    return float4(baseColor * NdotL, 1.0);
}

Execution group selection

Choose the appropriate execution group based on control flow:

// Single thread - for divergent operations or non-uniform data
tensor_ops::matmul2d<desc, execution_thread> threadOp;

// SIMD group - when all threads in simdgroup execute same operation on same data
tensor_ops::matmul2d<desc, execution_simdgroup> simdOp;

// Threadgroup - when entire threadgroup cooperates on single operation
tensor_ops::matmul2d<desc, execution_threadgroup> groupOp;

Fragment shaders typically use execution_thread since each fragment may sample different data. Compute shaders processing uniform data can leverage larger execution groups for better hardware utilization.

Inline tensor creation

Create tensors directly in shader code without buffer backing:

// Create tensor from local array
float localData[16] = { /* ... */ };
auto localTensor = tensor(localData, extents<int, 16, 1>());

// Create tensor from sampled values
half4 sample = texture.sample(sampler, uv);
half sampleArray[4] = { sample.r, sample.g, sample.b, sample.a };
auto sampleTensor = tensor(sampleArray, extents<int, 4, 1>());

Inline tensors assume tight packing (stride = 1 for innermost dimension). Use them for intermediate values during shader-local inference.

Debugging ML workloads

Xcode 26 introduces comprehensive debugging tools for Metal 4 ML:

ML Network Debugger

Capture GPU traces and inspect network execution:

Capture trace: Click Metal icon in Xcode, then Capture
Navigate to ML encoder: Find MTL4MachineLearningCommandEncoder in command list
Open network debugger: Double-click Network in bound resources

The debugger visualizes network structure as a graph, with operations as nodes and data flow as edges. Click any operation to inspect:

Operation type and attributes
Input/output tensor shapes
Intermediate tensor values (click preview to open tensor viewer)

Tensor viewer

Inspect tensor contents visually:

2D tensors display as grayscale/heatmap images
Higher-rank tensors show slice selection UI
Value inspection at specific coordinates

Dependency viewer

Verify synchronization between ML and other work:

Open Dependencies view (icon at top-left of trace)
Locate ML command buffer
Verify barriers connect dependent passes

Look for:

Barriers from preceding stages to .machineLearning
Barriers from .machineLearning to consuming stages
No unexpected parallelism with dependent work

Performance optimization

Network architecture considerations

Design networks for real-time constraints:

Minimize layer count: Each layer adds latency
Use half precision: Float16 doubles throughput on Apple Silicon
Reduce channel counts: Smaller hidden layers = faster inference
Avoid expensive operations: Softmax, layer norm add overhead

Memory optimization

Reduce memory traffic:

Reuse intermediate heaps: Allocate once, reuse across frames
Batch inference when possible: Process multiple inputs per dispatch
Use Shader ML for small networks: Eliminates round-trips to device memory

Profiling

Use Instruments Metal System Trace:

ML encoder duration: Time from dispatch to completion
Barrier wait times: Identify synchronization bottlenecks
Memory bandwidth: Check for unexpected memory traffic

Target network execution times well under frame budget—leave headroom for render and compute work.

Integration patterns

Hybrid rendering with neural enhancement

Combine traditional rendering with neural post-processing:

func renderFrame() {
    // Traditional G-buffer pass
    encodeGBufferPass(commandBuffer)

    // Neural ambient occlusion
    neuralAO.encode(commandBuffer, depthTexture: depthBuffer, normalsTexture: normalBuffer)

    // Traditional lighting with neural AO
    encodeLightingPass(commandBuffer, aoTexture: neuralAO.outputTensor)

    // Neural upscaling
    neuralUpscaler.encode(commandBuffer, inputTexture: lightingResult)

    // Present
    commandBuffer.present(drawable)
}

Dynamic network selection

Switch networks based on performance requirements:

class AdaptiveNeuralRenderer {
    let highQualityPipeline: MTL4MachineLearningPipelineState
    let performancePipeline: MTL4MachineLearningPipelineState
    var currentPipeline: MTL4MachineLearningPipelineState

    func update(frameTiming: FrameTiming) {
        if frameTiming.gpuTime > targetFrameTime * 0.8 {
            currentPipeline = performancePipeline
        } else if frameTiming.gpuTime < targetFrameTime * 0.6 {
            currentPipeline = highQualityPipeline
        }
    }
}

Fallback for unsupported devices

Provide traditional implementations for devices without ML support:

protocol AmbientOcclusionProvider {
    func encode(commandBuffer: MTL4CommandBuffer, depth: MTLTexture, normals: MTLTexture)
    var outputTexture: MTLTexture { get }
}

class NeuralAO: AmbientOcclusionProvider { /* ML implementation */ }
class SSAO: AmbientOcclusionProvider { /* Traditional SSAO */ }

// Select implementation based on device capabilities
let aoProvider: AmbientOcclusionProvider = device.supportsFamily(.apple6)
    ? NeuralAO(device: device)
    : SSAO(device: device)

Conclusion

Metal 4’s machine learning integration represents a fundamental shift in graphics programming. Tensors provide natural containers for ML data, the ML command encoder synchronizes network execution with rendering, and Shader ML embeds inference directly in shaders for maximum efficiency.

These capabilities enable techniques previously impractical for real-time rendering: neural material compression achieving 50% size reduction, learned ambient occlusion matching ray-traced quality, and adaptive upscaling that maintains visual fidelity while reducing render resolution.

The patterns presented here—network preparation workflows, synchronization strategies, Shader ML implementations—provide a foundation for integrating ML into production rendering pipelines. Combined with Xcode’s new debugging tools, developers have everything needed to bring neural graphics to their applications.

The final article in this series will explore Apple Silicon-specific optimizations, covering TBDR architecture exploitation, unified memory strategies, and M4-specific performance tuning.