☕️

Dixant Mittal

Dixant works at the intersection of research and industry — currently as a Lead Data Scientist at Paytm, where he focuses on large language models and generative AI, and previously as a Research Scientist at Moovita, building decision-making systems for autonomous vehicles.

He received his Ph.D. in Computer Science from the National University of Singapore, advised by Professor Wee Sun Lee. He holds a Master’s degree in Computer Science from the National University of Singapore, where he was advised by Professor David Hsu, and a Bachelor’s degree in Information Technology from National Institute of Technology Kurukshetra.

His research interests span reinforcement learning, planning & search, and large language models — broadly, how intelligent systems can reason and act effectively in complex, uncertain environments.

Recent Publications

Dixant Mittal, Liwei Kang, Wee Sun Lee (2025). Learning to Search from Demonstration Sequences. In ICLR.

PDF Cite

Siddharth Aravindan, Dixant Mittal, Wee Sun Lee (2024). EVaDE: Event-Based Variational Thompson Sampling for Model-Based Reinforcement Learning. In ACML.

PDF Cite

Dixant Mittal, Siddharth Aravindan, Wee Sun Lee (2023). ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search. In AAMAS.

PDF Cite

Mohit Shridhar, Dixant Mittal, David Hsu (2020). INGRESS: Interactive visual grounding of referring expressions. In IJRR.

PDF Cite

Experience

Lead Data Scientist

Paytm

May 2025 – Present Gurugram, India

Fine-tuned Small Language Models (SLM) for Paytm Chatbot

Owned end-to-end development of a 4B-parameter SLM that replaced the production Llama-70B chatbot stack, beating GPT-OSS 120B by +2.5 pts on Paytm’s customer-care eval while cutting annual inference cost ~97% ($2M → $60K).
Designed and trained a 3-phase pipeline: teacher-aligned supervised fine-tuning, generalized knowledge distillation (GKD) and data aggregation (DAgger) to counter distribution shift, and RL alignment using modified PPO against human (CSAT) and LLM-judge reward signals.
Built the LLM-as-Judge evaluation framework — a composite score of 6 instruction-following metrics — to benchmark model quality and drive data-driven iteration between training phases.
Architected multi-GPU training and unified-LoRA serving via vLLM across 40 business verticals; production inference sustains p95 TTFT of 120ms on 10K-token prompts, 25ms ITL, and 3,000 tok/s peak throughput at 64 concurrency on 2×H200.
Engineered a continual-learning loop that fine-tunes the live model on a blended CSAT and eval-score reward, with a staged rollout policy (5% → 25% → 50% → 100%) gated on eval-score gain.

Paytm Playback — Personalized Rap Song Generation

Owned model development of Paytm Playback, a personalized rap-song generator that turns a user’s recent transactions into custom Hindi lyrics and music; ~4M songs generated since launch.
Fine-tuned the open-source ACE-Step diffusion transformer on a custom synthetic audio–lyric dataset to fix poor Hindi pronunciation in the base model, yielding production-quality Hindi rap generation.
Engineered a scalable inference pipeline using Ray Serve with dynamic batching, sustaining real-time low-latency generation through launch-day traffic spikes.
Reduced per-song generation cost from ₹1.92 → ₹0.25 (~87%) through model optimization, mixed-precision inference, and batch-level parallelization.
Delivered measurable product lift: +15% engagement, +12% retention, and a 40% song-share rate.

Research Scientist

Moovita

October 2023 – April 2025 Singapore

Scene Foundation Model for Autonomous Vehicles

Designed a foundation model for driving-scene understanding, representing the road network, vehicle positions, curbs, and other items-of-interest as a heterogeneous graph with typed nodes and relational edges (e.g. vehicle-on-lane, lane-leads-to-lane).
Built a graph neural network backbone over the scene graph to learn unified embeddings capturing both geometric and topological relationships between scene entities.
Pre-trained via self-supervised learning using a dual masking objective — masked node-attribute reconstruction and masked edge-existence prediction — to learn rich, transferable scene representations without labelled data.

Probabilistic Intention-aware Decision Maker

Developed a neural decision-making module that explicitly models uncertainty over surrounding actors’ intentions (turn left/right, go straight, yield) and selects ego high-level actions conditioned on this uncertainty.
Designed a two-headed transformer: one head maintains a Bayesian posterior over neighbouring agents’ intentions, updated via Bayes’ rule as new observations arrive; the second head outputs the ego decision conditioned on the current scene and the posterior.
Achieved dynamic conservatism — cautious when actor intentions were uncertain, assertive when confident — yielding faster mission-completion time and smoother negotiation at comparable collision rates to the production planner.

Overtake Confirmation Network

Built a neural decision network based on a scene transformer architecture that ingests the full surrounding environment — lanes, vehicles, pedestrians, and traffic-light signals — and outputs a discrete action: overtake-left, overtake-right, or maintain lane.
Pre-trained via imitation learning on expert driving trajectories, then fine-tuned with RL using a modified asynchronous PPO designed to hide simulator latency and keep GPU utilisation high.
Outperformed the production deterministic planner on all key metrics: lower collision rate, smoother motion profile, greater distance covered, and faster mission-completion time.

Lecturer

NUS Fintech Lab, School of Computing, National University of Singapore

July 2020 – September 2024 Singapore

Developer Toolkit #2 – Backend Programming

Co-lectured a 6-session course on backend development for the NUS-FintechSG programme, covering the full stack of components needed to build and expose a backend system to client applications.

Research Intern

Sea AI Lab

October 2021 – May 2022 Singapore

Differentiable Online Search in Latent State Space

Designed a neural network architecture that bakes the inductive bias of Monte Carlo Tree Search directly into its computation graph, performing online search over a learnt latent world model.
The full search procedure — node selection, expansion, rollout, and back-up — was implemented as differentiable operations, yielding an end-to-end differentiable architecture whose search behaviour could be optimised jointly with the policy and Q-value heads.
Evaluated on grid-world navigation and Sokoban planning tasks, where the model consistently outperformed a vanilla baseline with identical capacity but no built-in search inductive bias.

Teaching Assistant

School of Computing, National University of Singapore

January 2019 – May 2022 Singapore

CS3243 – Introduction to Artificial Intelligence: Teaching and grading assignments.
CS5339 – Theory and Algorithms for Machine Learning: Grading assignments and course consultation.
IS5006 – Intelligent System Deployment: Grading assignments, creating course material and demo codes.

Research Intern

Moovita

December 2017 – September 2021 Singapore

Intuitive Motion Prediction Network

Implemented a pedestrian motion-prediction module operating on a 2D bird’s-eye-view grid around the ego vehicle, producing goal-conditioned future trajectory distributions for downstream planning.
Coupled a custom Value Iteration Network for grid-based path computation with a Bayesian filter that maintains and updates a posterior over each pedestrian’s hidden goal location as new observations arrive.

Object Detection Network

Developed a custom YOLO-based object-detection network for autonomous vehicles, pruning the class taxonomy to AV-relevant categories to free up model capacity for the classes that matter.
Re-architected the backbone for embedded deployment: pruned layers and channel widths, removed heavy ConvNeXt blocks, and added a Feature Pyramid Network (FPN) for multi-scale detection, trading depth for latency without sacrificing recall on small objects.
Optimised and deployed on Google’s Coral Edge TPU, achieving real-time on-vehicle detection at 50 FPS and 0.72 mAP.

High Level Path Generation

Developed a high-level online path planner for autonomous vehicles using the A* algorithm, leveraging the road-network graph and real-time vehicle location to produce route plans on demand.

Senior Software Engineer

Ixigo.com

March 2017 – July 2017 Gurgaon, India

Trending Train Searches

Built a service in Java Spring Boot that consumed the live train-search event stream from Kafka and mined the most frequently searched trains and stations, surfacing them as “Trending Searches” on the Ixigo UI.

Scalable Task Scheduler

Designed and built a horizontally scalable task scheduler in which executor workers pull the next available task from a Kafka queue and run it, allowing consumers to be scaled up or down on demand.
Designed the scheduler as a generic, task-agnostic primitive that any team could plug into for deferred execution, scheduled retries, or long-running async workloads — removing the need for service-specific scheduling logic.

Software Engineer

Snapdeal.com

July 2015 – March 2017 Gurgaon, India

Identity Management System (IMS)

Contributed to Snapdeal’s in-house identity platform (IMS) serving a user base of 25 million.
Shipped OAuth integration for third-party login and built a reusable data validation framework with structured error mapping for the IMS APIs.

Education

Doctor of Philosophy (Ph.D.) in Computer Science

National University of Singapore

January 2019 – June 2024 Singapore

Thesis: Combining Planning and Learning to Improve Decision Making

Advisor: Prof. Wee Sun Lee

Master of Computing (M.Comp.) in Computer Science

National University of Singapore

July 2017 – December 2018 Singapore

Thesis: Active Information Gathering to Disambiguate Referring Expressions

Advisor: Prof. David Hsu

Bachelor of Technology (B.Tech.) in Information Technology

National Institute of Technology Kurukshetra

July 2011 – May 2015 India

Projects

Using Posterior Variance Estimates to Improve Exploration in Monte Carlo Tree Search

Reformulated the MCTS value estimate as a Gaussian posterior over each node’s true value, propagating both mean and variance up the tree during back-up to capture epistemic uncertainty in unexplored sub-trees. Replaced the standard UCB1 exploration bonus with a posterior-variance-based bonus, using Thompson Sampling to select the action branch to visit next rather than relying on visit-count heuristics alone. Demonstrated improved sample efficiency and stronger final policies on benchmark planning tasks compared to vanilla UCT at matched simulation budgets.

Learning to Search in Partially-Observable Environments

Partial observability is fundamental to our uncertain world, and we need to reason about it while making decisions. Partially Observable Markov Decision Processes (POMDP) presents a principled approach to model sequential decision-making problems for partially observable domains. Online planning algorithms, such as POMCP or DESPOT, are the most preferred algorithms to solve large POMDPs. They find the optimal action for the current belief by searching the most promising part of the search tree guided by a heuristic function. Designing the optimal heuristics for a task requires a considerable amount of human effort and expert domain knowledge, which may be challenging to acquire. Further, a suboptimal heuristic function degrades the performance of these approximate online planners.

Control Autonomous Vehicle from Pixels

Built a convolutional neural network steering controller that maps raw RGB front-camera images directly to steering commands, bypassing intermediate perception and planning modules in an end-to-end fashion. Trained the policy via imitation learning on expert demonstrations, then iteratively improved it with DAgger by querying the expert on states visited by the learner to correct compounding errors from distribution shift. Evaluated in the CARLA simulator, achieving stable lane-following and turn-taking on routes unseen during training.

Asynchronous Deep Q-Network

Implemented an asynchronous DQN in which multiple actor workers interact with parallel environment instances and push transitions to a shared replay buffer, while a separate learner thread performs Q-learning updates concurrently. Decoupled environment stepping from gradient computation to hide simulator latency and keep GPU utilisation high, achieving substantially higher throughput than synchronous DQN at matched hardware. Validated on a 4-way intersection navigation scenario in the CARLA simulator: trained a DQN-based acceleration controller that consumes a bird’s-eye-view of the intersection and selects longitudinal actions to safely negotiate cross-traffic, converging in fewer training hours than the synchronous baseline.

Fast RRT*

Enhanced the RRT* sampling-based motion planner by replacing its linear-scan nearest-neighbour and near-neighbour queries with a Kd-tree index, reducing per-iteration cost from O(n) to O(log n). Achieved substantially faster convergence to high-quality paths on cluttered 2D planning benchmarks, enabling practical use in larger maps where vanilla RRT* became prohibitively slow.

See all

Dixant Mittal

Recent Publications

Experience

Education

Projects

Contact