AI Systems Stack Architecture, Cloud to Edge: A Bird's-Eye View

February 14, 2026  ·  Architecture, AI, Systems

Across deployment contexts, the layered stack remains recognizable:

applicationframeworksruntimesorchestrationaccelerator APIsOSdrivers/firmwarehardware

What changes is where complexity concentrates and which constraints dominate: throughput vs latency, multi-tenancy vs local/physical trust, power/thermal vs scale, and connectivity assumptions (always-connected vs intermittently connected vs fully offline).


The Three Worlds of Compute Infrastructure

Cloud / Data Centers

Cloud's main advantage is massive, elastic compute at scale—orders of magnitude beyond any edge device. This enables large-model training, high-throughput inference, and multi-tenancy. The tradeoffs: blast-radius risk, hard requirements for isolation (GPU partitioning, quotas), high-bandwidth networking, and cluster-grade observability. Cloud-native patterns—containerization, Kubernetes, serverless—abstract infrastructure further, trading some control for scalability.

Desktop / Edge

Desktop-class devices offer strong latency, privacy, and local control. Compared to mobile devices, they provide more compute headroom and better thermals. They occupy a "best of both" position for on-device AI: sufficient memory and sustained performance for inference and fine-tuning, fewer sandboxing restrictions than mobile, yet still user-controlled or enterprise-managed. Well-suited for local inference and light training, though constrained by power and compute limits relative to data centers.

Laptop / Mobile / Embedded / IoT

These environments offer ultimate mobility and ubiquity—always with the user. They emphasize heterogeneous acceleration (CPU/GPU/NPU/DSP), operate under strict power/thermal budgets, and are shaped by OS sandboxing. They increasingly rely on secure-execution primitives (TEE) for sensitive workloads. Embedded/IoT tightens things further: smaller compute, real-time guarantees (RTOS), and longer deployment lifecycles.


SW Stack (Unified View)

In the stack matrix below, I split the Edge world further into Desktop/Laptop vs Mobile vs IoT/Embedded because the OS and accelerator API boundaries diverge sharply.

Layer Cloud (Data Center) Edge: Desktop/Laptop Edge: Mobile Edge: IoT/Embedded
1) Application AI services, agents, RAG, batch jobs Local agents, interactive apps On-device features, assistants Sensor/vision/control workloads
2) Frameworks/Libraries PyTorch/TF/JAX PyTorch/TF/ONNX TFLite/ONNX (mobile) TFLite (incl. micro), vendor SDKs
3) Runtimes/Compilers TensorRT/OpenVINO/XLA/ORT EPs ORT EPs, OpenVINO, TFLite delegates, TensorRT Core ML, NNAPI, TFLite delegates OpenVINO/TensorRT (where used), TFLite delegates
4) Compute Orchestration + Distributed Execution Kubernetes / Slurm / Ray (+ MPI/NCCL) (rare) process supervision + device mgmt; (optional) K3s/MicroK8s/Nomad/Ray (sites/fleets) (rare) (optional) K3s/MicroK8s/Nomad/Ray (sites/fleets)
5) Accelerator API boundary CUDA/ROCm/oneAPI/OpenCL/ CUDA/ROCm/oneAPI/OpenCL/ Metal/Vulkan NNAPI / Core ML (+ Metal/Vulkan) Vendor accel APIs (often behind SDKs)
6) OS (User-space) Linux (+ vSphere sometimes underneath) Windows/macOS/Linux/ChromeOS Android/iOS Embedded Linux / RTOS
7) Kernel + Drivers ioctl + device drivers WDDM + MCDM/IOKit/DRM(ioctl) HAL/Binder (Android) / IOKit (iOS) ioctl/DRM + BSP drivers
8) Firmware UEFI/BMC + device fw Secure boot/TPM/TEE + device fw Secure boot/TEE + modem/sensor hubs Secure boot/TEE/MCU firmware
9) Hardware CPU + GPU/NPU/TPU + fabric CPU + GPU/NPU SoC CPU+GPU+NPU/DSP MCU/SoC + accel (varies)

Note: The technologies listed (CUDA, OpenVINO, TensorRT, Core ML, etc.) are representative examples of today's popular stack components—not an exhaustive list.

— Sandeep