GPT class systems operate at hundreds of billions of parameters, and the newest reasoning models consume hundreds of watts per query during inference. Training runs for frontier models cost hundreds of millions of dollars and occupy entire data centers for months, and each generation is larger, more capable, and more expensive to operate than the last. However, none of this has much bearing on the AI workloads that are being deployed in the physical world right now.
A security camera operating on a Power over Ethernet (PoE) connection has a total power budget of 15 to 30 watts, and only a fraction of that is available for AI processing after the sensor, image signal processor, video encoder, and network stack have taken their share. A drone running a visual inspection of a wind turbine operates on battery power at altitudes and in environments where consistent cloud connectivity is unavailable, while an industrial vision system on a manufacturing line must process frames continuously at rates that match the speed of the production equipment. In each of these cases, the AI workload must fit within hard limits on power, memory, and latency that have no equivalent in a data center.
The first piece in this blog series, “Everything Is Going to Be Driven by Algorithms,” described the migration of intelligence from the cloud to the edge and the emergence of agentic workflows that pursue goals, use tools, and maintain state over time. That piece argued that this transition is driven by temporal, economic, and regulatory forces that make on-device processing necessary at scale. What remains unaddressed is what kind of intelligence actually fits on these devices.
The shift from scale to composition
Model capability in AI has historically scaled with parameter count. This relationship has held up across multiple model families and training regimes, and it has shaped how the industry thinks about what constitutes a capable system. Both cloud and edge environments use composed, multi-model systems and agentic workflows. The difference is in what those workflows are designed to achieve. In the cloud, composition pushes the ceiling on capability. Models are chained together, given tools, and orchestrated to handle increasingly complex reasoning tasks. The trajectory is toward general-purpose intelligence that can operate across open-ended domains. At the edge, composition serves a more constrained purpose: it makes complex tasks achievable within fixed physical limits. Power budgets are measured in single-digit watts. Memory is scarce. Latency requirements are hard. The processor must run multiple workloads simultaneously. A security camera does not need hundreds of billions of parameters. It needs a compact object detector running at the frame rate of the video stream, a tracker maintaining identity across frames, and a classifier routing events to the appropriate response. Each of these components can be a small, purpose-trained model that executes efficiently within those limits. The performance of the system depends on how these models are composed and coordinated, not on the parameter count of any individual model.
Composing specialized models at the edge is not new. Object detection, semantic segmentation, pose estimation, depth estimation, and tracking are mature model families with well-understood architectures, training recipes, and quantization paths. Edge AI developers have been assembling these components into multi-model pipelines for over a decade. What has changed in 2026 is that these perception models can now be combined with vision-language models and small language models that run on the same device, which means the system can reason over its own perceptual outputs and coordinate action across models without a round trip to the cloud.
Computer vision as the proving ground
Computer vision remains the dominant edge AI workload. It powers quality inspection on factory floors, safety monitoring in construction and mining, pedestrian detection in automotive ADAS, perimeter security in enterprise and public installations, and patient monitoring in clinical settings. Computer vision leads for a structural reason: cameras are the most widely deployed sensor type, visual data is among the richest and highest-bandwidth modalities, and the decisions that derive from visual input are often the most time-critical.
The computer vision stack at the edge has had a decade to mature, and the practical effect is visible in what these models can do within severe resource constraints. Purpose-trained CNNs for detection and classification can run at hundreds of frames per second within sub-watt power envelopes. Transformer-based vision models have been compressed and quantized to run on edge processors that were designed a generation ago for CNN workloads. Segmentation models like SAM and its derivatives have been adapted for promptable, interactive use on edge hardware. Depth estimation models provide spatial understanding from monocular cameras, eliminating the need for dedicated depth sensors in many applications.
What is new in 2026 is that these perception models can now be paired with language-capable models that run on the same device. A vision-language model with 500 million to 7 billion parameters, quantized and optimized for edge inference, can interpret open-ended queries, generate structured descriptions of visual scenes, and coordinate with purpose-trained specialist models. The practical result is a system that combines the accuracy of domain-specific perception with the flexibility of language-driven reasoning, all within the power and memory constraints of an embedded processor.
Why small models are a design decision
The framing that “small models are less capable” represents a cloud-centric perspective and doesn’t reflect how these small models contribute to capable systems. In edge deployments, model size is a design parameter that interacts with latency, power, memory bandwidth, and thermal headroom. A smaller model that meets the accuracy requirement for a specific task while running within the device’s power budget represents a better engineering outcome than a larger model that exceeds the thermal envelope and requires active cooling, a larger battery, or a more expensive processor.
The techniques that make small models viable at the edge have matured significantly. Quantization, which reduces the numerical precision of model weights and activations, can compress a model by 4x or more with minimal accuracy loss on well-defined tasks. Knowledge distillation transfers the learned behavior of a large teacher model into a smaller student model that retains much of the teacher’s accuracy in a fraction of the parameter count. Pruning removes redundant connections from trained networks. Architecture search identifies network topologies that achieve favorable tradeoffs between accuracy and computational cost for specific hardware targets.
These techniques are standard engineering practice for deploying neural networks in the physical world. The result is a growing library of models, spanning detection, classification, segmentation, pose estimation, depth estimation, tracking, and vision-language understanding, that have been validated for edge deployment on specific hardware platforms. Ambarella’s Cooper Model Garden, for example, provides optimized models across these task families, each built and benchmarked for deployment on Ambarella’s edge AI SoCs, spanning architectures from lightweight CNNs to vision-language models supporting up to 34 billion parameters on the company’s higher-performance N1 family.
Orchestration as the architectural pattern
“Everything Is Going to Be Driven by Algorithms” introduced the concept of LMs serving as orchestrators in agentic edge systems. The compositional model described here is the operational expression of that concept.
In a monolithic architecture, a single large model handles perception, reasoning, and action selection. This works in cloud environments where the model has enough capacity to cover the full task space. At the edge, the tradeoff is more direct. A model large enough to cover the full task space will not fit on the device. A model that fits on the device will not cover the full task space.
The compositional approach resolves this tension. The VLM or small language model serves as the orchestration layer: it interprets instructions, reasons over context, and routes subtasks to specialist models. The specialist models handle high-frequency, well-defined perception tasks where purpose-trained accuracy matters. A person detector identifies targets. A pose estimator extracts body keypoints from cropped regions. A license plate recognizer reads characters. A fire and smoke classifier flags hazards. Each model is optimized for its task and runs at the speed the application demands, while the orchestrator coordinates them all.
This pattern maps directly to the three-tier compute architecture described in “Everything.” At the far edge, on the device, lightweight specialist models handle real-time perception. At the near edge, on a local gateway, a more capable model orchestrates across multiple devices and maintains state. At the cloud tier, heavier models handle rare or computationally expensive tasks like forensic analysis and model retraining. The composition scales across tiers.
What this means for edge AI platforms
The shift from monolithic models to composed model systems changes what developers need from an edge AI platform. The platform needs to support multiple model architectures running concurrently on the same processor. It needs to provide validated, optimized models across the task families that production applications require. Developers also need tooling for benchmarking models against specific hardware targets, and a common software stack that works across the platform’s SoC portfolio so that a pipeline validated on one device can be adapted for another without a full re-engineering cycle.
Ambarella’s Developer Zone is structured around these requirements. The Cooper Model Garden provides pre-validated models spanning detection, classification, segmentation, pose estimation, depth estimation, and vision-language understanding, each optimized for deployment on Ambarella’s CV7 and N1 SoC families. Cloud-based benchmarking tools allow developers to test model performance against specific hardware configurations before committing to a design. Agentic blueprints provide templates for composing multi-model workflows. The Cooper development platform provides the common software stack that spans the company’s edge AI portfolio, from far-edge endpoints to near-edge infrastructure.
The broader industry trajectory points in the same direction. Edge AI is converging on an architecture where intelligence is distributed across composed model systems, each component right-sized for its task and its hardware target. The platforms that provide the broadest and most thoroughly validated model libraries, the most capable composition tooling, and the most consistent cross-device software experience will attract the largest developer ecosystems.
The year the edge finds its own identity
For the past several years, edge AI has been discussed primarily in relation to the cloud: as a deployment target for models that were designed in and for data centers, with edge optimization treated as a post-hoc step. The defining shift in 2026 is that edge AI is developing its own architectural identity, where the models are designed for edge constraints from the start. The composition patterns are native to multi-model, multi-tier environments. The tooling is built around hardware-aware optimization and deployment.
The AI that will operate the cameras, vehicles, robots, drones, medical devices, and industrial equipment of the next decade will not be a scaled-down version of what runs in the cloud. It will be a distinct class of system, composed from task-specific components, engineered for physical constraints, and measured by whether the full composition delivers the outcome the application demands. The models that matter at the edge are the ones that fit the device and work together. The measure of success is the outcome of the composition, not the capability of any single component.