Senior Platform Engineer
| Verified Pay check_circle | Provided by the employer$175,000-$275,000 per year |
|---|---|
| Hours | Full-time |
| Location | Boston, MA Boston, Massachusetts open_in_new |
About this job
Job Description
Title: Senior or Staff Platform Engineer
Location: FULLY remote!
Salary: $175k-$275k base + RSUs + Full Benefits
Requirements: 3+ years in Systems Engineering or HPC Infrastructure, strong Linux and bare-metal GPU experience, NVIDIA DGX/HGX, InfiniBand/RoCE, and automation with Python or Go
We build the high-performance, bare-metal GPU infrastructure that powers modern AI. Our team designs and operates large-scale NVIDIA DGX/HGX clusters, high-speed networking, and the automation that turns complex hardware into a reliable, production-ready platform. We work directly with the metal: provisioning nodes, tuning Linux, integrating InfiniBand/RoCE, and building the tooling that enables fast, secure, and scalable AI workloads.
If you want to help shape the systems that make large-scale AI possible, this is where you will do it. We are looking for a Senior or Staff-level Platform Engineer to architect and operate the high-performance GPU infrastructure that powers next-generation AI systems. This is not a traditional cloud role - you will own the full lifecycle of bare-metal GPU clusters, from "empty rack" to production-grade Kubernetes, and build the automation that makes large-scale AI infrastructure reliable, observable, and secure.
If you thrive at the intersection of hardware, distributed systems, and automation - and you love solving the problems that live between teams - you will feel right at home here.
What You'll be Doing
- Design and operate container orchestration platforms optimized for NVIDIA DGX/HGX-class hardware.
- Build bare-metal provisioning systems (PXE, Ironic, MAAS) to bring GPU clusters online at scale.
- Manage GPU lifecycle: driver stacks, CUDA/kernel compatibility, MIG slicing, and performance tuning.
- Partner with Network Engineering and DCOps to align physical infrastructure with software orchestration.
- Build automation and internal tooling in Go or Python to streamline cluster operations.
- Implement Terraform/Ansible-based IaC for fully auditable, repeatable infrastructure.
- Design high-resolution observability stacks (Prometheus/Grafana, DCGM, VictoriaMetrics).
- Participate in a specialized on-call rotation supporting GPU workloads and core platform services.
What You Need for this Position
- 7+ years in systems, platform, or distributed systems engineering (10+ for Staff).
- Expert-level Linux knowledge: kernel modules, sysctl tuning, hugepages, container runtimes.
- Hands-on experience bootstrapping Kubernetes or SLURM on physical hardware.
- Strong proficiency in Go (preferred) or Python for systems-level automation.
- Deep familiarity with NVIDIA GPU ecosystems (drivers, CUDA, MIG).
- Working knowledge of InfiniBand or RoCEv2 networking and NCCL performance tuning.
- Experience building observability pipelines for hardware-accelerated environments.
- Ability to troubleshoot complex, multi-layered issues across hardware, networking, and orchestration.
- Strong cross-team communication - you're the "glue" between Network, DCOps, and Software.
Bonus Points
- Experience with SLURM, Kubeflow, or distributed PyTorch.
- Integrating vendor APIs (NetBox, Vault, GitLab CI, etc.) into unified workflows.
- Infrastructure testing, chaos engineering, or cluster-level integration test suites.
- Designing telemetry aggregation across hardware, networking, and environmental systems.
What's In It for You
- $175k - $275k/year DOE
- RSU's
- 5 weeks PTO
- 401k w/ match
- Comprehensive Benefit Plan
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, sexual orientation, gender identity or expression, national origin, ancestry, citizenship, genetic information, registered domestic partner status, marital status, status as a crime victim, disability, protected veteran status, or any other characteristic protected by law. Our hiring process includes AI screening for keywords and minimum qualifications, and a virtual recruiter as part of the application process. A human recruiter reviews all results. Click here for details on our virtual recruiter . Everforth CyberCoders will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable state and local law, including but not limited to the Los Angeles County Fair Chance Ordinance, the San Francisco Fair Chance Ordinance, and the California Fair Chance Act. Everforth CyberCoders is committed to working with and providing reasonable accommodation to individuals with physical and mental disabilities. Individuals needing special assistance or an accommodation while seeking employment can contact a member of our Human Resources team at Benefits@CyberCoders.com to make arrangements.