Metal 4 Neural Graphics
Machine learning is transforming graphics programming. Techniques like neural upscaling, asset compression, ambient occlusion, and procedural material generation push visual fidelity while reducing computational cost. Metal 4 makes these techniques accessible through native tensor support, a dedicated ML command encoder, and Shader ML for embedding inference directly in your shaders.
This article explores Metal 4’s machine learning architecture in depth, providing patterns for integrating neural networks into real-time rendering pipelines.
The convergence of graphics and machine learning
Traditional rendering pipelines rely on hand-crafted algorithms: SSAO for ambient occlusion, mipmapped textures for LOD, temporal anti-aliasing for smoothing. Each technique represents decades of graphics research distilled into efficient GPU implementations.
Machine learning offers an alternative: learn the mapping from inputs to outputs directly from data. A neural network trained on ground-truth ambient occlusion can predict occlusion values per pixel. A network trained on high-resolution textures can decompress compact latent representations. The results often match or exceed traditional techniques while enabling novel capabilities.
Metal 4 integrates ML at two levels. The MTL4MachineLearningCommandEncoder runs complete neural networks on the GPU timeline, synchronized with render and compute work. Shader ML embeds smaller networks directly in fragment, vertex, or compute shaders, eliminating memory round-trips between operations.
MTLTensor: the foundation of Metal ML
Metal 4 introduces MTLTensor as a first-class resource type alongside buffers and textures. Tensors are multi-dimensional data containers designed specifically for machine learning workloads.
Why tensors matter
Textures limit channels to four and impose format-dependent dimension constraints. Buffers provide raw memory but require manual indexing for multi-dimensional data. Tensors eliminate these limitations:
- Arbitrary dimensionality: Rank-2 matrices, rank-3 feature maps, rank-4 batch tensors—whatever your network requires
- Baked-in strides: Dimension and stride information embedded in the tensor simplifies indexing
- Optimized layouts: Device-allocated tensors use opaque layouts optimized for ML operations
Creating tensors from devices
For best performance, create tensors directly from the Metal device:
let tensorDescriptor = MTLTensorDescriptor()
tensorDescriptor.dataType = .float16
tensorDescriptor.dimensions = [1, 512, 512, 64] // Batch, Height, Width, Channels
tensorDescriptor.usage = [.machineLearning, .read, .write]
let featureTensor = try device.makeTensor(descriptor: tensorDescriptor)
Device-allocated tensors use an opaque, optimized layout—similar to how textures may be swizzled for cache efficiency. This layout provides the best performance for ML operations but requires copying data through staging buffers for CPU access.
Creating tensors from buffers
When tensor data originates from buffers (network weights, CPU-generated inputs), create tensors with explicit strides:
// Buffer contains a 256x128 matrix in row-major order
let matrixBuffer = device.makeBuffer(length: 256 * 128 * MemoryLayout<Float16>.size,
options: .storageModeShared)!
let tensorDescriptor = MTLTensorDescriptor()
tensorDescriptor.dataType = .float16
tensorDescriptor.dimensions = [256, 128]
tensorDescriptor.strides = [128, 1] // Row stride, column stride (innermost = 1)
tensorDescriptor.usage = [.machineLearning, .read]
let weightsTensor = try matrixBuffer.makeTensor(descriptor: tensorDescriptor, offset: 0)
Strides account for padding in source data. If your buffer contains 256 rows with 128 elements each plus 16 padding elements per row, set strides to [144, 1].
Tensor usage flags
Specify how tensors will be used:
.machineLearning- Required forMTL4MachineLearningCommandEncoder.compute- Required for Shader ML in compute shaders.render- Required for Shader ML in vertex/fragment shaders.read/.write- Standard access patterns
Combine flags when tensors serve multiple purposes.
MTL4MachineLearningCommandEncoder: networks on the GPU timeline
The ML command encoder integrates complete neural networks into Metal’s command buffer model. Networks execute alongside render and compute work, synchronized through standard Metal primitives.
The MTLPackage format
Networks run from .mtlpackage files—Metal’s optimized format for ML models. Create packages from CoreML models:
# Python: Export PyTorch model to CoreML
import coremltools as ct
coreml_model = ct.convert(
pytorch_model,
inputs=[ct.TensorType(shape=(1, 3, 512, 512), name="input")],
outputs=[ct.TensorType(name="output")],
convert_to='mlprogram',
minimum_deployment_target=ct.target.macOS26
)
coreml_model.save('ambient_occlusion.mlpackage')
Then convert to MTLPackage using the command-line tool:
metal-package-builder ambient_occlusion.mlpackage -o ambient_occlusion.mtlpackage
The conversion optimizes the network for Metal execution, fusing operations and selecting optimal implementations for Apple Silicon.
Loading and compiling networks
Load the package as a Metal library and compile for the current device:
// Load the package
let packageURL = Bundle.main.url(forResource: "ambient_occlusion", withExtension: "mtlpackage")!
let library = try device.makeLibrary(URL: packageURL)
// Create function descriptor for the main network
let functionDescriptor = MTL4LibraryFunctionDescriptor()
functionDescriptor.name = "main"
functionDescriptor.library = library
// Create pipeline descriptor
let pipelineDescriptor = MTL4MachineLearningPipelineDescriptor()
pipelineDescriptor.machineLearningFunctionDescriptor = functionDescriptor
// Set dynamic input dimensions if needed
pipelineDescriptor.setInputDimensions([1, 512, 512, 4], atBufferIndex: 0)
// Compile for this device
let pipeline = try compiler.makeMachineLearningPipelineState(descriptor: pipelineDescriptor)
Compilation may take significant time for large networks. Compile during loading screens or startup, not during gameplay.
Dispatching network inference
Encode network execution into command buffers:
// Create intermediate storage heap
let heapDescriptor = MTLHeapDescriptor()
heapDescriptor.type = .placement
heapDescriptor.size = pipeline.intermediatesHeapSize
let intermediatesHeap = device.makeHeap(descriptor: heapDescriptor)!
// Configure argument table with inputs and outputs
let argumentTable = try device.makeArgumentTable(descriptor: argTableDescriptor)
argumentTable.setTensor(inputTensor.gpuResourceID, index: 0)
argumentTable.setTensor(outputTensor.gpuResourceID, index: 1)
// Create and configure encoder
let encoder = commandBuffer.makeMachineLearningCommandEncoder()
encoder.setPipelineState(pipeline)
encoder.setArgumentTable(argumentTable)
encoder.dispatchNetwork(intermediatesHeap: intermediatesHeap)
encoder.endEncoding()
The intermediates heap stores activations between network layers. Size it according to pipeline.intermediatesHeapSize. Reuse heaps across frames to avoid allocation overhead.
Synchronization with render and compute
Network execution integrates with Metal 4’s barrier system. The .machineLearning stage identifier enables precise synchronization:
// Wait for depth buffer before running ambient occlusion network
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
computeEncoder.barrier(
afterQueueStages: .fragment,
beforeStages: .machineLearning,
visibilityOptions: .device
)
computeEncoder.endEncoding()
// Dispatch neural ambient occlusion
let mlEncoder = commandBuffer.makeMachineLearningCommandEncoder()
// ... configure and dispatch ...
mlEncoder.endEncoding()
// Wait for network output before compositing
let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: compositePassDescriptor)
renderEncoder.barrier(
afterQueueStages: .machineLearning,
beforeStages: .fragment,
visibilityOptions: .device
)
// ... composite final frame ...
This pattern enables parallel execution of non-dependent work while ensuring correct ordering for data dependencies.
Neural ambient occlusion example
A complete neural AO implementation:
class NeuralAmbientOcclusion {
let pipeline: MTL4MachineLearningPipelineState
let intermediatesHeap: MTLHeap
let inputTensor: MTLTensor // View-space normals + depth
let outputTensor: MTLTensor // Per-pixel occlusion values
func encode(commandBuffer: MTL4CommandBuffer,
depthTexture: MTLTexture,
normalsTexture: MTLTexture) {
// Copy textures to input tensor (could be done in prior compute pass)
let blitEncoder = commandBuffer.makeBlitCommandEncoder()
blitEncoder.copy(from: depthTexture, to: inputTensor, /* ... */)
blitEncoder.copy(from: normalsTexture, to: inputTensor, /* ... */)
blitEncoder.endEncoding()
// Barrier: blit -> ML
let syncEncoder = commandBuffer.makeComputeCommandEncoder()
syncEncoder.barrier(afterQueueStages: .blit, beforeStages: .machineLearning,
visibilityOptions: .device)
syncEncoder.endEncoding()
// Run inference
let mlEncoder = commandBuffer.makeMachineLearningCommandEncoder()
mlEncoder.setPipelineState(pipeline)
mlEncoder.setTensor(inputTensor, index: 0)
mlEncoder.setTensor(outputTensor, index: 1)
mlEncoder.dispatchNetwork(intermediatesHeap: intermediatesHeap)
mlEncoder.endEncoding()
}
}
Shader ML: neural networks inside shaders
For smaller networks, Shader ML embeds inference directly in shader code. This eliminates memory round-trips between operations, dramatically improving efficiency for networks with few layers.
When to use Shader ML
Shader ML excels for:
- Neural material decompression: Small networks decoding latent textures per-pixel
- Per-vertex deformation: Learned blend shapes or animation
- Procedural generation: Neural noise, learned pattern synthesis
- Custom activation functions: Complex learned nonlinearities
Networks should be small enough to execute within fragment/vertex/compute shader time budgets—typically a few matrix multiplications.
Metal Performance Primitives
Metal 4 introduces Metal Performance Primitives (MPP) for efficient tensor operations in shaders:
#include <MetalPerformancePrimitives/MetalPerformancePrimitives.h>
using namespace mpp;
// Matrix multiplication configuration
constexpr tensor_ops::matmul2d_descriptor matmulDesc(
/* M */ 1,
/* N */ 64, // Hidden layer width
/* K */ 16, // Input width
/* left transpose */ false,
/* right transpose */ true,
/* reduced precision */ true
);
// Create operation for single-thread execution
tensor_ops::matmul2d<matmulDesc, execution_thread> matmulOp;
The execution_thread parameter specifies single-thread execution—appropriate for fragment shaders where each invocation processes one pixel. For uniform operations across SIMD groups or threadgroups, use execution_simdgroup or execution_threadgroup for better hardware utilization.
Neural material compression implementation
Neural materials achieve 50% compression versus block-compressed formats. The fragment shader samples latent textures, runs inference, and shades in a single dispatch:
#include <metal_tensor>
#include <MetalPerformancePrimitives/MetalPerformancePrimitives.h>
using namespace metal;
using namespace mpp;
// Network configuration
constant int INPUT_WIDTH = 12; // 4 latent samples * 3 channels
constant int HIDDEN_WIDTH = 32;
constant int OUTPUT_WIDTH = 6; // baseColor RGB + normal XYZ
// Matrix multiplication descriptors
constexpr tensor_ops::matmul2d_descriptor layer0Desc(1, HIDDEN_WIDTH, INPUT_WIDTH,
false, true, true);
constexpr tensor_ops::matmul2d_descriptor layer1Desc(1, OUTPUT_WIDTH, HIDDEN_WIDTH,
false, true, true);
[[fragment]]
float4 neuralMaterial(
VertexOut in [[stage_in]],
texture2d<half> latent0 [[texture(0)]],
texture2d<half> latent1 [[texture(1)]],
texture2d<half> latent2 [[texture(2)]],
texture2d<half> latent3 [[texture(3)]],
tensor<device half, dextents<int, 2>> layer0Weights [[buffer(0)]],
tensor<device half, dextents<int, 2>> layer1Weights [[buffer(1)]],
constant float3& lightDir [[buffer(2)]]
)
{
constexpr sampler s(filter::linear);
// Sample latent textures
half3 l0 = latent0.sample(s, in.uv).rgb;
half3 l1 = latent1.sample(s, in.uv).rgb;
half3 l2 = latent2.sample(s, in.uv).rgb;
half3 l3 = latent3.sample(s, in.uv).rgb;
// Build input tensor
half inputs[INPUT_WIDTH] = {
l0.r, l0.g, l0.b,
l1.r, l1.g, l1.b,
l2.r, l2.g, l2.b,
l3.r, l3.g, l3.b
};
auto inputTensor = tensor(inputs, extents<int, INPUT_WIDTH, 1>());
// Layer 0: input -> hidden
half hidden[HIDDEN_WIDTH];
auto hiddenTensor = tensor(hidden, extents<int, HIDDEN_WIDTH, 1>());
tensor_ops::matmul2d<layer0Desc, execution_thread> layer0Op;
layer0Op.run(inputTensor, layer0Weights, hiddenTensor);
// ReLU activation
for (int i = 0; i < HIDDEN_WIDTH; ++i) {
hidden[i] = max(half(0), hidden[i]);
}
// Layer 1: hidden -> output
half outputs[OUTPUT_WIDTH];
auto outputTensor = tensor(outputs, extents<int, OUTPUT_WIDTH, 1>());
tensor_ops::matmul2d<layer1Desc, execution_thread> layer1Op;
layer1Op.run(hiddenTensor, layer1Weights, outputTensor);
// Extract decompressed material
float3 baseColor = float3(outputs[0], outputs[1], outputs[2]);
float3 tangentNormal = float3(outputs[3], outputs[4], outputs[5]);
// Transform normal to world space
float3 worldNormal = in.TBN * normalize(tangentNormal);
// Simple diffuse shading
float NdotL = saturate(dot(worldNormal, lightDir));
return float4(baseColor * NdotL, 1.0);
}
Execution group selection
Choose the appropriate execution group based on control flow:
// Single thread - for divergent operations or non-uniform data
tensor_ops::matmul2d<desc, execution_thread> threadOp;
// SIMD group - when all threads in simdgroup execute same operation on same data
tensor_ops::matmul2d<desc, execution_simdgroup> simdOp;
// Threadgroup - when entire threadgroup cooperates on single operation
tensor_ops::matmul2d<desc, execution_threadgroup> groupOp;
Fragment shaders typically use execution_thread since each fragment may sample different data. Compute shaders processing uniform data can leverage larger execution groups for better hardware utilization.
Inline tensor creation
Create tensors directly in shader code without buffer backing:
// Create tensor from local array
float localData[16] = { /* ... */ };
auto localTensor = tensor(localData, extents<int, 16, 1>());
// Create tensor from sampled values
half4 sample = texture.sample(sampler, uv);
half sampleArray[4] = { sample.r, sample.g, sample.b, sample.a };
auto sampleTensor = tensor(sampleArray, extents<int, 4, 1>());
Inline tensors assume tight packing (stride = 1 for innermost dimension). Use them for intermediate values during shader-local inference.
Debugging ML workloads
Xcode 26 introduces comprehensive debugging tools for Metal 4 ML:
ML Network Debugger
Capture GPU traces and inspect network execution:
- Capture trace: Click Metal icon in Xcode, then Capture
- Navigate to ML encoder: Find
MTL4MachineLearningCommandEncoderin command list - Open network debugger: Double-click Network in bound resources
The debugger visualizes network structure as a graph, with operations as nodes and data flow as edges. Click any operation to inspect:
- Operation type and attributes
- Input/output tensor shapes
- Intermediate tensor values (click preview to open tensor viewer)
Tensor viewer
Inspect tensor contents visually:
- 2D tensors display as grayscale/heatmap images
- Higher-rank tensors show slice selection UI
- Value inspection at specific coordinates
Dependency viewer
Verify synchronization between ML and other work:
- Open Dependencies view (icon at top-left of trace)
- Locate ML command buffer
- Verify barriers connect dependent passes
Look for:
- Barriers from preceding stages to
.machineLearning - Barriers from
.machineLearningto consuming stages - No unexpected parallelism with dependent work
Performance optimization
Network architecture considerations
Design networks for real-time constraints:
- Minimize layer count: Each layer adds latency
- Use half precision: Float16 doubles throughput on Apple Silicon
- Reduce channel counts: Smaller hidden layers = faster inference
- Avoid expensive operations: Softmax, layer norm add overhead
Memory optimization
Reduce memory traffic:
- Reuse intermediate heaps: Allocate once, reuse across frames
- Batch inference when possible: Process multiple inputs per dispatch
- Use Shader ML for small networks: Eliminates round-trips to device memory
Profiling
Use Instruments Metal System Trace:
- ML encoder duration: Time from dispatch to completion
- Barrier wait times: Identify synchronization bottlenecks
- Memory bandwidth: Check for unexpected memory traffic
Target network execution times well under frame budget—leave headroom for render and compute work.
Integration patterns
Hybrid rendering with neural enhancement
Combine traditional rendering with neural post-processing:
func renderFrame() {
// Traditional G-buffer pass
encodeGBufferPass(commandBuffer)
// Neural ambient occlusion
neuralAO.encode(commandBuffer, depthTexture: depthBuffer, normalsTexture: normalBuffer)
// Traditional lighting with neural AO
encodeLightingPass(commandBuffer, aoTexture: neuralAO.outputTensor)
// Neural upscaling
neuralUpscaler.encode(commandBuffer, inputTexture: lightingResult)
// Present
commandBuffer.present(drawable)
}
Dynamic network selection
Switch networks based on performance requirements:
class AdaptiveNeuralRenderer {
let highQualityPipeline: MTL4MachineLearningPipelineState
let performancePipeline: MTL4MachineLearningPipelineState
var currentPipeline: MTL4MachineLearningPipelineState
func update(frameTiming: FrameTiming) {
if frameTiming.gpuTime > targetFrameTime * 0.8 {
currentPipeline = performancePipeline
} else if frameTiming.gpuTime < targetFrameTime * 0.6 {
currentPipeline = highQualityPipeline
}
}
}
Fallback for unsupported devices
Provide traditional implementations for devices without ML support:
protocol AmbientOcclusionProvider {
func encode(commandBuffer: MTL4CommandBuffer, depth: MTLTexture, normals: MTLTexture)
var outputTexture: MTLTexture { get }
}
class NeuralAO: AmbientOcclusionProvider { /* ML implementation */ }
class SSAO: AmbientOcclusionProvider { /* Traditional SSAO */ }
// Select implementation based on device capabilities
let aoProvider: AmbientOcclusionProvider = device.supportsFamily(.apple6)
? NeuralAO(device: device)
: SSAO(device: device)
Conclusion
Metal 4’s machine learning integration represents a fundamental shift in graphics programming. Tensors provide natural containers for ML data, the ML command encoder synchronizes network execution with rendering, and Shader ML embeds inference directly in shaders for maximum efficiency.
These capabilities enable techniques previously impractical for real-time rendering: neural material compression achieving 50% size reduction, learned ambient occlusion matching ray-traced quality, and adaptive upscaling that maintains visual fidelity while reducing render resolution.
The patterns presented here—network preparation workflows, synchronization strategies, Shader ML implementations—provide a foundation for integrating ML into production rendering pipelines. Combined with Xcode’s new debugging tools, developers have everything needed to bring neural graphics to their applications.
The final article in this series will explore Apple Silicon-specific optimizations, covering TBDR architecture exploitation, unified memory strategies, and M4-specific performance tuning.