Site Reliability Engineer (SRE) - AI Infrastructure Job at Hamilton Barnes Associates Limited, San Francisco, CA

NU9mK0M1N1k4QmVoUlRyYmZaS2JkY2tLZEE9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high‑performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Flexible hours

Similar Jobs

Mindlance

Physician Reviewer Job at Mindlance

 ...Clinical Services Reviewer Pay rate is ***/hour fully remote, but must reside in one of our plan states: IL, TX, NM, OK, MT, TN. Evaluates clinical...  ...partnership with case managers. Participates in the Physician Review Units' appeal process of service denials. Participates... 

CompHealth

Locums Pediatric Nephrologist Is Wanted in North Carolina Job at CompHealth

Some locum assignments can be as short as a day, others, years. Some are far from home, others are local. Whatever it is you're looking for, we offer true opportunities, not just postings. CompHealth goes far beyond a job board, providing you with expert guidance tailored...

Foxconn Technology Group

IT Help Desk Technician - Night Shift Job at Foxconn Technology Group

 ...user productivity and ensuring our IT services align with Ingrasys' operational needs and...  ..., and security best practices. Customer Service Excellence: Provide consistent and...  ...10k per year (on-site daily position) Shift: Starting at 11:30pm(shift premium is included... 

Shiftsmart Inc

Retail Stocker / Merchandiser Job at Shiftsmart Inc

 ...Store Associate / Backroom Team Member] - No Experience Required / Choose Your Own Schedule / Start Earning Tomorrow - As a Retail Stocker/Merchandiser you will: Work at retail stores near you helping organize new store merchandise; Unload newly delivered inventory; Restock... 

Headlight

Licensed Mental Health Therapist - LCSW - LPC - LMFT Job at Headlight

 ...based on office location). Applicants without a valid, active clinical license will not be considered. We are not hiring associate/intern level therapists.2+ years experience with providing diagnostic assessment and therapy services.Bilingual (English/Spanish) skills...