Senior Site Reliability Engineer Job at San Francisco Compute Co., San Francisco, CA

MnlFb0d0eUZXUmlidGpuUVZLL0ZMR1VWd3c9PQ==
  • San Francisco Compute Co.
  • San Francisco, CA

Job Description

About We're the San Francisco Compute Company. We're building the first real-time trading platform for compute. Everyone from startups to enterprises to research labs and individuals can buy and sell compute, from 1 to 1000+ nodes for an hour to over a year. With our liquid market to resell unused compute hours, buyers no longer need to worry about contract lock-in and providers have less idle nodes. Over the next decade, we anticipate thousands of enterprises, governments, startups, and labs will be training and serving large models, and we’re building a team to scale our market. About the Role ML training clusters are some of the most high performance computers on the planet. Even relatively small clusters would have been in the TOP500 5 years ago. Our supercomputing team is responsible for keeping our compute clusters running smoothly, monitoring hardware health, and fixing things when they go wrong. We believe strongly in automation — code is the only reliable way to manage hardware at scale. As we scale, this will become a more data-driven role, predicting failures before they happen. We’re a small team, so you’ll be spending time talking to customers as well. About You You’ve managed at least one GPU training cluster in the past ( ideally a cluster with >1k GPU’s but not required ) You appreciate and value good documentation You have experience provisioning and managing Kubernetes clusters You deeply understand Linux, networking fundamentals, CUDA, NCCL, and Infiniband You enjoy creating large self-correcting systems that keep hardware humming You meet at least two of the nice-to-haves below Some Nice to Haves Experience with Go or Rust (>2 years) Experience with distributed storage systems (Weka, VAST, Ceph, etc.) Experience with HPC network architectures (eBGP, fat-tree, VXLAN, MCLAG, etc.) Experience with Linux virtualization (KVM, QEMU, libvirt, etc.) Experience with performance optimization of machine learning kernels Benefits Unlimited office book budget: You can buy as many books for the office as you want. You’re encouraged to spend time during the workday reading! Generous equity grant: Team members are offered a competitive salary along with equity in the company Retirement matching: We match 401(k) plans up to 4% Medical, dental & vision: We offer competitive medical, dental, vision insurance for employees and dependents and cover 100% of premiums Time off: We offer unlimited paid time off as well as 10+ observed holidays Parental leave: We offer biological, adoptive, and foster parents paid time off to spend quality time with family Daily lunch: We cover lunch daily for employees Visa Sponsorships: Yes, we sponsor visas and work permits The San Francisco Compute Company is committed to maintaining a workplace free from discrimination and harassment. We make employment decisions based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, belief, national origin, social or ethical origin, age, physical, mental, or sensory disability, sexual orientation, gender identity or expression, marital status, civil union or domestic partnership status, past or present military service, HIV status, family medical history or genetic information, family or parental status including pregnancy, or any other status protected by law. We welcome the opportunity to consider qualified applicants with prior arrest or conviction records. Our commitment to diversity includes hiring talented individuals regardless of their criminal history, in accordance with local, state, and federal laws, including San Francisco’s Fair Chance Ordinance and California’s ban-the-box laws. If you require reasonable accommodation for any reason, please reach out to us at team@sfcompute.com . #J-18808-Ljbffr San Francisco Compute Co.

Job Tags

Holiday work, Contract work, Local area, Visa sponsorship, Work visa,

Similar Jobs

Staffmark

Accounts Receivable Specialist Job at Staffmark

 ...Staffmark is seeking an experienced Accounts Receivable Specialist to join a leading distributor of industrial equipment and supplies with over 40 years of reputable success. Our client is located in Charlotte, NC, and we are excited to get you conne Accounts Receivable... 

ResourceTek, LLC

Japanese Translator Job at ResourceTek, LLC

 ...In partnership with our client, a tier 1 supplier to the automotive industry, ResourceTek is seeking multiple Japanese Translators to join their team in Huntsville, AL. These are long term, multi-year contract opportunities. RESPONSIBILITIES & REQUIREMENTS Manufacturing... 

Hardin County Schools

Coach - Asst Football Coach #2 (85%) - CHHS - 8432-01 Job at Hardin County Schools

 ...CLASS TITLE: ASSISTANT COACH II This is an extracurricular coaching position paying an annual stipend. BASIC FUNCTION: Assist coaches in various secondary school athletic programs. REPRESENTATIVE DUTIES/ESSENTIAL FUNCTIONS: Assist in promoting... 

University of Wisconsin–Madison

Assistant, Associate, or Full Professor in Sustainable Design: Textile & Fashion | University of Wisconsin-Madison Job at University of Wisconsin–Madison

 ...applications for a full-time (9-month) faculty position at an open rank (assistant, associate or full) with an expertise in sustainability in fashion and design to join the faculty of the Design Studies Department. The department seeks a colleague who is an imaginative... 

Nightingale Nurses

Travel Occupational Therapist - $1,988 per week Job at Nightingale Nurses

 ...Job Description Nightingale Nurses is seeking a travel Occupational Therapist for a travel job in Medford, Oregon. Job Description & Requirements ~ Specialty: Occupational Therapist ~ Discipline: Therapy ~ Start Date: 01/27/2025~ Duration: 13 weeks ~4...