Back to Blogs

AI Systems Stack Architecture, Cloud to Edge: A Bird's-Eye View

February 14, 2026 · Architecture, AI, Systems

Across deployment contexts, the layered stack remains recognizable:

application→frameworks→runtimes→orchestration→accelerator APIs→OS→drivers/firmware→hardware

What changes is where complexity concentrates and which constraints dominate: throughput vs latency, multi-tenancy vs local/physical trust, power/thermal vs scale, and connectivity assumptions (always-connected vs intermittently connected vs fully offline).

The Three Worlds of Compute Infrastructure

Cloud / Data Centers

Cloud's main advantage is massive, elastic compute at scale—orders of magnitude beyond any edge device. This enables large-model training, high-throughput inference, and multi-tenancy. The tradeoffs: blast-radius risk, hard requirements for isolation (GPU partitioning, quotas), high-bandwidth networking, and cluster-grade observability. Cloud-native patterns—containerization, Kubernetes, serverless—abstract infrastructure further, trading some control for scalability.

Desktop / Edge

Desktop-class devices offer strong latency, privacy, and local control. Compared to mobile devices, they provide more compute headroom and better thermals. They occupy a "best of both" position for on-device AI: sufficient memory and sustained performance for inference and fine-tuning, fewer sandboxing restrictions than mobile, yet still user-controlled or enterprise-managed. Well-suited for local inference and light training, though constrained by power and compute limits relative to data centers.

Laptop / Mobile / Embedded / IoT

These environments offer ultimate mobility and ubiquity—always with the user. They emphasize heterogeneous acceleration (CPU/GPU/NPU/DSP), operate under strict power/thermal budgets, and are shaped by OS sandboxing. They increasingly rely on secure-execution primitives (TEE) for sensitive workloads. Embedded/IoT tightens things further: smaller compute, real-time guarantees (RTOS), and longer deployment lifecycles.

SW Stack (Unified View)

In the stack matrix below, I split the Edge world further into Desktop/Laptop vs Mobile vs IoT/Embedded because the OS and accelerator API boundaries diverge sharply.

Layer	Cloud (Data Center)	Edge: Desktop/Laptop	Edge: Mobile	Edge: IoT/Embedded
1) Application	AI services, agents, RAG, batch jobs	Local agents, interactive apps	On-device features, assistants	Sensor/vision/control workloads
2) Frameworks/Libraries	PyTorch/TF/JAX	PyTorch/TF/ONNX	TFLite/ONNX (mobile)	TFLite (incl. micro), vendor SDKs
3) Runtimes/Compilers	TensorRT/OpenVINO/XLA/ORT EPs	ORT EPs, OpenVINO, TFLite delegates, TensorRT	Core ML, NNAPI, TFLite delegates	OpenVINO/TensorRT (where used), TFLite delegates
4) Compute Orchestration + Distributed Execution	Kubernetes / Slurm / Ray (+ MPI/NCCL)	(rare) process supervision + device mgmt; (optional) K3s/MicroK8s/Nomad/Ray (sites/fleets)	(rare)	(optional) K3s/MicroK8s/Nomad/Ray (sites/fleets)
5) Accelerator API boundary	CUDA/ROCm/oneAPI/OpenCL/	CUDA/ROCm/oneAPI/OpenCL/ Metal/Vulkan	NNAPI / Core ML (+ Metal/Vulkan)	Vendor accel APIs (often behind SDKs)
6) OS (User-space)	Linux (+ vSphere sometimes underneath)	Windows/macOS/Linux/ChromeOS	Android/iOS	Embedded Linux / RTOS
7) Kernel + Drivers	ioctl + device drivers	WDDM + MCDM/IOKit/DRM(ioctl)	HAL/Binder (Android) / IOKit (iOS)	ioctl/DRM + BSP drivers
8) Firmware	UEFI/BMC + device fw	Secure boot/TPM/TEE + device fw	Secure boot/TEE + modem/sensor hubs	Secure boot/TEE/MCU firmware
9) Hardware	CPU + GPU/NPU/TPU + fabric	CPU + GPU/NPU	SoC CPU+GPU+NPU/DSP	MCU/SoC + accel (varies)

Note: The technologies listed (CUDA, OpenVINO, TensorRT, Core ML, etc.) are representative examples of today's popular stack components—not an exhaustive list.

— Sandeep

Back to Blogs