CV - Yuang Xu

Education

University of California, Merced Starting Sep. 2026

Incoming PhD in Computer Science

University of Southern California (USC) Jan. 2024 – Dec. 2025

Master of Science in Computer Science

Shandong University, China Sep. 2018 – Jun. 2022

Bachelor of Engineering in Computer Science and Technology

Research Interests

Distributed ML training & inference systems on heterogeneous memory architectures (e.g., CXL-based tiered memory); AI compiler and operator optimization for LLM inference serving (e.g., vLLM).

Publications & Patents

1 Underwater Image Enhancement Method Based on Feature Fusion Neural Network

Yuan Tian*, Yuang Xu*, Jun Zhou Co-first Author

IEEE Access, 09/2022

2 A Signal Acquisition Optimization Method based on RAPID Tomography Patent

Bin Zhang, Yuang Xu, Wenrui Luo

CN109946384A, 06/2019

Services

Artifact Evaluation Committee Member

MLSys 2026 EuroSys 2026

Research & Systems Projects

TinyLlama Acceleration using OpenAI Triton May 2025 – Dec. 2025

Engineered a custom fused SwiGLU kernel in OpenAI Triton to integrate Linear, gating, and SiLU operations, significantly reducing global memory traffic and kernel launch overhead in MLP layers.
Profiled and optimized memory access patterns and block-level parallelism using NVIDIA Nsight Systems and Nsight Compute to eliminate architectural bottlenecks.
Achieved up to a 20% reduction in MLP latency on an RTX 3060 and a significantly lower GPU memory footprint compared to the PyTorch eager mode implementation.

Vision Transformer (ViT) Operator Optimization via TVM TensorIR Aug. 2025 – Oct. 2025

Executed end-to-end model conversion from PyTorch to TVM via ONNX, identifying and targeting performance-critical subgraphs in Attention and Patch Embedding layers.
Optimized memory locality and execution efficiency by combining MetaSchedule auto-tuning with manual TensorIR scheduling, including tiling, loop reordering, vectorization, and cache read/write reuse.
Benchmarked optimized kernels on an NVIDIA RTX 3070 Ti, achieving superior inference performance and lower memory overhead compared to the PyTorch eager mode baseline.

Mini-PyTorch: C++ Deep Learning Framework Jan. 2025 – May 2025

Implemented lightweight deep learning framework in C++ to understand the internal mechanisms of systems like PyTorch.
Designed a Tensor class with proper memory management (using smart pointers) and stride handling to support n-dimensional array operations. Developed a basic Autograd engine that constructs a dynamic computational graph (DAG) and performs reverse-mode automatic differentiation.
Implemented an operator dispatcher to separate the frontend interface from the execution backend.

Work Experience

4Paradigm (AI Unicorn) – Algorithm Engineer Intern Apr. 2025 – Sep. 2025

Shanghai, China

Built a multimodal content generation pipeline integrating ASR (Whisper), LLM (Qwen), and TTS models.
Optimized the inference backend service, implementing request batching and asynchronous processing, which reduced average response latency by 20% under high concurrency.

TP-LINK – Software Engineer Jul. 2022 – Jul. 2023

Hangzhou, China

Designed a high-performance database connection pool for cloud services, utilizing lock-free data structures to handle high concurrency; contributed bug fixes to the open-source Kona JDK.

Skills

Languages

C/C++ CUDA Python PTX Assembly

AI Systems

OpenAI Triton Apache TVM PyTorch Internals CUDA Optimization TPU

Tools

Docker Git Linux GDB Nsight Systems

PDF Version

Download PDF