"Life is like riding a bicycle. To keep your balance you must keep moving"
Albert Einstein
[]

Bio

I am an AI Software Architect at Intel, currently focused on full stack AI Software and Performance R&D. I received my Master's in Electrical Engineering from Rochester Institute of Technology in 2016, where I was advised by Amlan Ganguly and Ray Ptucha on Multi-Core Systems with NoC Architectures and Deep Learning. Over the course of my Master's I had an internship at Intel, where I worked on creating high performance Deep Learning models for Intel Atom based SoC. I received my Bachelor's degree in Electronics and Communication Engineering from Visvesvaraya Technological University, India in 2012.

I enjoy doing Interdisciplinary Research, and attending Hackathons. I am a Neuroscience, Physics, and Hardware Architecture enthusiast.

Experiences

Mar 2016 - Present

Intel Corporation

Engineer

Technologies

• Deep Learning Software and Computer Architecture

• Machine Learning Algorithms, and Deep Learning Data Science for Computer vision

• Hybrid computing (Distributed + Heterogeneous)

Products Timeline
Mar 2019 – Present
Deep Learning Software Architecture for Next-Gen AI Products

Intel NPU (Formerly Intel Movidius VPU)

Products Released/Public:
2024
NPU4 in Lunar Lake
2023
NPU2.7 in Meteor Lake
2022
Discrete Keembay in Raptor Lake Surface laptops
2019
Keembay
Sep 2017 – Mar 2019
Deep Learning Graph Compiler nGraph

Intel Nervana NNP-Training

Mar 2016 – Sep 2017
Computer Vision and Deep Learning

Intel Atom+FPGA+iGPU

Oct 2015 - Dec 2015

Intel Corporation (Hillsboro, OR)

Software Engineer Intern

Performance analysis and optimization of machine learning algorithms (Deep Learning) for Computer Vision and Mobile Application. In Torch, OpenCV and TensorFlow.

Aug 2014 - Mar 2016

Rochester Institute of Technology (Rochester, NY)

Research Assistant

Aug 2014 - Mar 2016
@Multi-Core System Lab

Improving thermal performance of multi-core network-on-chip based architectures through a distributed and intelligent proactive thermal-aware task reallocation algorithm. Optimized for faster computation training time of Neural Network.

May 2015 – Oct 2015
@Machine Intelligence Lab

Developing improved scheme for video classification by going deeper using convolutional neural network for better accuracy and computational time.

Aug 2012 - May 2013

Hindustan Aeronautics Limited (Bangalore, India)

Apprentice Engineer

Worked on data analysis of Solid State Digital Video Recording system and Electronic Flight instrument system for fault detection and integration of the system.

Research paper / Hackathons / Academic Projects

  • Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning:
  • SYSML 2018 link
    The Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual effort. This issue is compounded by the proliferation of frameworks and hardware platforms. The current approach, which we call "direct optimization", requires deep changes within each framework to improve the training performance for each hardware backend (CPUs, GPUs, FPGAs, ASICs) and requires O(fp) effort; where f is the number of frameworks and p is the number of platforms. While optimized kernels for deep-learning primitives are provided via libraries like Intel Math Kernel Library for Deep Neural Networks (MKL-DNN), there are several compiler-inspired ways in which performance can be further optimized. Building on our experience creating neon (a fast deep learning library on GPUs), we developed Intel nGraph, a soon to be open-sourced C++ library to simplify the realization of optimized deep learning performance across frameworks and hardware platforms. Initially-supported frameworks include TensorFlow, MXNet, and Intel neon framework. Initial backends are Intel Architecture CPUs (CPU), the Intel(R) Nervana Neural Network Processor(R) (NNP), and NVIDIA GPUs. Currently supported compiler optimizations include efficient memory management and data layout abstraction. In this paper, we describe our overall architecture and its core components. In the future, we envision extending nGraph API support to a wider range of frameworks, hardware (including FPGAs and ASICs), and compiler optimizations (training versus inference optimizations, multi-node and multi-device scaling via efficient sub-graph partitioning, and HW-specific compounding of operations).


  • An Artificial Neural Networks based Temperature Prediction Framework for Network-on-Chip based Multicore Platform:
  • Thesis link
    Continuous improvement in silicon process technologies has made possible the integration of hundreds of cores on a single chip. However, power and heat have become dominant constraints in designing these massive multicore chips causing issues with reliability, timing variations and reduced lifetime of the chips. Dynamic Thermal Management (DTM) is a solution to avoid high temperatures on the die. Typical DTM schemes only address core level thermal issues. However, the Network-on-chip (NoC) paradigm, which has emerged as an enabling methodology for integrating hundreds to thousands of cores on the same die can contribute significantly to the thermal issues. Moreover, the typical DTM is triggered reactively based on temperature measurements from on-chip thermal sensor requiring long reaction times whereas predictive DTM method estimates future temperature in advance, eliminating the chance of temperature overshoot. Artificial Neural Networks (ANNs) have been used in various domains for modeling and prediction with high accuracy due to its ability to learn and adapt. This thesis concentrates on designing an ANN prediction engine to predict the thermal profile of the cores and Network-on-Chip elements of the chip. This thermal profile of the chip is then used by the predictive DTM that combines both core level and network level DTM techniques. On-chip wireless interconnect which is recently envisioned to enable energy-efficient data exchange between cores in a multicore environment, will be used to provide a broadcast-capable medium to efficiently distribute thermal control messages to trigger and manage the DTM schemes.


  • OhSnap @HackPrinceton-2015 | Princeton University:
  • Identification of human finger snaps. Feature extraction using Cepstrum. Random Forest and PCA is used for training on the recorded data. Coding in Python.



  • SketchitUp @BrickHack-2015 | Rochester Institute of Technology:
  • A Web app that helps draw images that are uploaded by joining points(tracing the image). Using Machine Learning, Python, OpenCV, HTML5, JS and CSS


  • Literade @HackMIT-2015 | Massachusetts Institute of Technology:
  • A Web app that converts boring articles to colorful for images for children to read. Using Machine Learning, Python, HTML, CSS and JS

  • Other hackathon attended
    • Tinder4Food (Android App) @HackNY-2015 | New York University.
    • HackBU-2016 | Binghamton University.
    • BrickHack2-2016 | Rochester Institute of Technology


  • Moving Target Detection and Aiming - 2013 :
  • Designed Arm Robot that can aim at moving targets in real time using a cat toy laser. Regression technique was
    used for arm movement in MATLAB.


  • e-Data Analysis - 2014 :
  • Descriptive statistic was performed on the quantitative description of stock data and to understand the stock
    performances of a company.


  • Weather Prediction - 2014 :
  • Historical temperature analysis using statistics and temperature prediction using machine learning algorithms
    like PCA, SVD and Bayesian Networks.


  • Multi-Channel ADPCM CODEC (MCAC) - 2013 :
  • Designed the RTL and the verification model for the Adaptive Quantizer and Tone & Transition model and some of
    their sub models for the pipelined design of MCAC.


  • Memory Access Bus Arbiter (ARB) of DTMF receiver - 2013 :
  • Designed the RTL model and performed verification, logic synthesis, test insertion and detailed timing analysis
    using Verilog HDL.


  • Boundary Scan Sum - 2013 :
  • Hierarchically designed Boundary Scan Sum with optimal sizing, clean DRC and LVS in Cadence Virtuoso, 0.6 micron
    technology.


  • Deep Learning for Image Classification - 2013 :
  • Deep Neural Networks with feature extraction was implemented for image classification on "calTech101" dataset.


    CONTACT