Back to Homepage

Yuang Xu

Incoming CS PhD Student · UC Merced

Education

University of California, Merced Starting Sep. 2026
Incoming PhD in Computer Science
University of Southern California (USC) Jan. 2024 – Dec. 2025
Master of Science in Computer Science
Shandong University, China Sep. 2018 – Jun. 2022
Bachelor of Engineering in Computer Science and Technology

Research Interests

Distributed ML training & inference systems on heterogeneous memory architectures (e.g., CXL-based tiered memory); AI compiler and operator optimization for LLM inference serving (e.g., vLLM).

Publications & Patents

1 Underwater Image Enhancement Method Based on Feature Fusion Neural Network

Yuan Tian*, Yuang Xu*, Jun Zhou Co-first Author

IEEE Access, 09/2022

2 A Signal Acquisition Optimization Method based on RAPID Tomography Patent

Bin Zhang, Yuang Xu, Wenrui Luo

CN109946384A, 06/2019

Services

Artifact Evaluation Committee Member

MLSys 2026 EuroSys 2026

Research & Systems Projects

TinyLlama Acceleration using OpenAI Triton May 2025 – Dec. 2025
  • Engineered a custom fused SwiGLU kernel in OpenAI Triton to integrate Linear, gating, and SiLU operations, significantly reducing global memory traffic and kernel launch overhead in MLP layers.
  • Profiled and optimized memory access patterns and block-level parallelism using NVIDIA Nsight Systems and Nsight Compute to eliminate architectural bottlenecks.
  • Achieved up to a 20% reduction in MLP latency on an RTX 3060 and a significantly lower GPU memory footprint compared to the PyTorch eager mode implementation.
Vision Transformer (ViT) Operator Optimization via TVM TensorIR Aug. 2025 – Oct. 2025
  • Executed end-to-end model conversion from PyTorch to TVM via ONNX, identifying and targeting performance-critical subgraphs in Attention and Patch Embedding layers.
  • Optimized memory locality and execution efficiency by combining MetaSchedule auto-tuning with manual TensorIR scheduling, including tiling, loop reordering, vectorization, and cache read/write reuse.
  • Benchmarked optimized kernels on an NVIDIA RTX 3070 Ti, achieving superior inference performance and lower memory overhead compared to the PyTorch eager mode baseline.
Mini-PyTorch: C++ Deep Learning Framework Jan. 2025 – May 2025
  • Implemented lightweight deep learning framework in C++ to understand the internal mechanisms of systems like PyTorch.
  • Designed a Tensor class with proper memory management (using smart pointers) and stride handling to support n-dimensional array operations. Developed a basic Autograd engine that constructs a dynamic computational graph (DAG) and performs reverse-mode automatic differentiation.
  • Implemented an operator dispatcher to separate the frontend interface from the execution backend.

Work Experience

4Paradigm (AI Unicorn) – Algorithm Engineer Intern Apr. 2025 – Sep. 2025
Shanghai, China
  • Built a multimodal content generation pipeline integrating ASR (Whisper), LLM (Qwen), and TTS models.
  • Optimized the inference backend service, implementing request batching and asynchronous processing, which reduced average response latency by 20% under high concurrency.
TP-LINK – Software Engineer Jul. 2022 – Jul. 2023
Hangzhou, China
  • Designed a high-performance database connection pool for cloud services, utilizing lock-free data structures to handle high concurrency; contributed bug fixes to the open-source Kona JDK.

Skills

Languages

C/C++ CUDA Python PTX Assembly

AI Systems

OpenAI Triton Apache TVM PyTorch Internals CUDA Optimization TPU

Tools

Docker Git Linux GDB Nsight Systems

PDF Version

Download PDF

Download PDF