AI Systems Stack Architecture, Cloud to Edge: A Bird's-Eye View
Across deployment contexts, the layered stack remains recognizable:
What changes is where complexity concentrates and which constraints dominate: throughput vs latency, multi-tenancy vs local/physical trust, power/thermal vs scale, and connectivity assumptions (always-connected vs intermittently connected vs fully offline).
The Three Worlds of Compute Infrastructure
Cloud / Data Centers
Cloud's main advantage is massive, elastic compute at scale—orders of magnitude beyond any edge device. This enables large-model training, high-throughput inference, and multi-tenancy. The tradeoffs: blast-radius risk, hard requirements for isolation (GPU partitioning, quotas), high-bandwidth networking, and cluster-grade observability. Cloud-native patterns—containerization, Kubernetes, serverless—abstract infrastructure further, trading some control for scalability.
Desktop / Edge
Desktop-class devices offer strong latency, privacy, and local control. Compared to mobile devices, they provide more compute headroom and better thermals. They occupy a "best of both" position for on-device AI: sufficient memory and sustained performance for inference and fine-tuning, fewer sandboxing restrictions than mobile, yet still user-controlled or enterprise-managed. Well-suited for local inference and light training, though constrained by power and compute limits relative to data centers.
Laptop / Mobile / Embedded / IoT
These environments offer ultimate mobility and ubiquity—always with the user. They emphasize heterogeneous acceleration (CPU/GPU/NPU/DSP), operate under strict power/thermal budgets, and are shaped by OS sandboxing. They increasingly rely on secure-execution primitives (TEE) for sensitive workloads. Embedded/IoT tightens things further: smaller compute, real-time guarantees (RTOS), and longer deployment lifecycles.
SW Stack (Unified View)
In the stack matrix below, I split the Edge world further into Desktop/Laptop vs Mobile vs IoT/Embedded because the OS and accelerator API boundaries diverge sharply.
| Layer | Cloud (Data Center) | Edge: Desktop/Laptop | Edge: Mobile | Edge: IoT/Embedded |
|---|---|---|---|---|
| 1) Application | AI services, agents, RAG, batch jobs | Local agents, interactive apps | On-device features, assistants | Sensor/vision/control workloads |
| 2) Frameworks/Libraries | PyTorch/TF/JAX | PyTorch/TF/ONNX | TFLite/ONNX (mobile) | TFLite (incl. micro), vendor SDKs |
| 3) Runtimes/Compilers | TensorRT/OpenVINO/XLA/ORT EPs | ORT EPs, OpenVINO, TFLite delegates, TensorRT | Core ML, NNAPI, TFLite delegates | OpenVINO/TensorRT (where used), TFLite delegates |
| 4) Compute Orchestration + Distributed Execution | Kubernetes / Slurm / Ray (+ MPI/NCCL) | (rare) process supervision + device mgmt; (optional) K3s/MicroK8s/Nomad/Ray (sites/fleets) | (rare) | (optional) K3s/MicroK8s/Nomad/Ray (sites/fleets) |
| 5) Accelerator API boundary | CUDA/ROCm/oneAPI/OpenCL/ | CUDA/ROCm/oneAPI/OpenCL/ Metal/Vulkan | NNAPI / Core ML (+ Metal/Vulkan) | Vendor accel APIs (often behind SDKs) |
| 6) OS (User-space) | Linux (+ vSphere sometimes underneath) | Windows/macOS/Linux/ChromeOS | Android/iOS | Embedded Linux / RTOS |
| 7) Kernel + Drivers | ioctl + device drivers | WDDM + MCDM/IOKit/DRM(ioctl) | HAL/Binder (Android) / IOKit (iOS) | ioctl/DRM + BSP drivers |
| 8) Firmware | UEFI/BMC + device fw | Secure boot/TPM/TEE + device fw | Secure boot/TEE + modem/sensor hubs | Secure boot/TEE/MCU firmware |
| 9) Hardware | CPU + GPU/NPU/TPU + fabric | CPU + GPU/NPU | SoC CPU+GPU+NPU/DSP | MCU/SoC + accel (varies) |
Note: The technologies listed (CUDA, OpenVINO, TensorRT, Core ML, etc.) are representative examples of today's popular stack components—not an exhaustive list.
— Sandeep