Loading...

Senior GPU Optimisation Engineer | San Francisco

January 16, 2026

We’re hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level — someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. You’ll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallest’s real-time speech models.

What You’ll Do

  • Optimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardware
  • Profile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraints
  • Design and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sections
  • Perform operator fusion, graph optimization, and kernel-level scheduling improvements
  • Tune models to fit GPU memory limits while maintaining quality
  • Benchmark and calibrate inference across NVIDIA, AMD, and potentially emerging accelerators
  • Port models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)
  • Work with TensorRT, ONNX Runtime, and custom runtimes for deployment
  • Partner with the research and infra teams to ensure the entire stack is optimized for real-time workloads

Requirements

  • Strong understanding of GPU architecture — SMs, warps, memory hierarchy, occupancy tuning
  • Hands-on experience with CUDA, kernel writing, and kernel-level debugging
    Experience with kernel fusion and model graph optimizations
  • Familiarity with TensorRT, ONNX, Triton, tinygrad, or similar inference engines
  • Strong proficiency in PyTorch and Python
  • Deep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks)
  • Experience profiling GPU workloads using Nsight, nvprof, or similar tools
  • Strong problem-solving abilities with a performance-first mindset

Great to Have

  • Experience with quantization (INT8, FP8, hybrid formats)
  • Experience with audio/speech models (ASR, TTS, SSL, vocoders)
  • Contributions to open-source GPU stacks or inference runtimes
  • Published work related to systems-level model optimization

Years of Experience

  • 3-5 years of specialized experience in GPU Optimization through academia or industry

Education

  • Master’s or PhD in GPU Programming or related field

This is the second paragraph of your amazing article.

This is the third paragraph where the content continues.

Work Level
Mid Senior
Employment Type
On-Site
Salary
$200K – $300K • $100K – $200K Equity
Valid Until
August 7, 2026
States
CA
Region
San Francisco
Country
United States of America
Salary Currency
United States Dollar ( $ USD )
Salary
$200K – $300K • $100K – $200K Equity
smallest.ai
View profile
Industry
Technology
Company size
11-20 employees
Founded in
2023
Home
Snips
Connection
Jobs Search
Message