About the role
Who you are
- Strong knowledge of Kubernetes (specially EKS), including deploy patterns, rollout safety, and core debugging workflows
- 4+ years of experience with programming languages (Python or Golang preferred)
- Strong experience managing projects and initiatives end-to-end
- Hands-on experience with AI-assisted development tools such as Cursor, GitHub Copilot or Claude for code generation, debugging, and documentation
- Demonstrated ability to write effective prompts to get high-quality, reliable outputs from LLMs
- Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs
- Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
- High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables
- Experience with technologies such as Terraform, Buildkite, and/or ArgoCD is required
- Bachelor’s or Master’s degree in a relevant field such as Computer Science, or equivalent experience
What the job involves
- The Site Reliability Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams’ capability to design, build and operate robust systems at scale
- We are hiring a Sr. SRE to join our Compute SRE team
- This team is responsible for ensuring that all compute workloads run smoothly on Pinterest
- We're building the future on kubernetes and our job is to connect it with what Pinterest needs
- Pinterest’s applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale
- As a Pinterest SRE, you will design and build systems, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems
- Tackle project challenges on EKS, such as implementing Karpenter. This work affects how every developer codes, tests, and improves their work
- Collaborate across various teams to drive projects forward using open-source tools
- Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
- Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
- Build meaningful, insightful and actionable SLIs
- Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
- Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
- Use AI for analysis of incidents, operational signals, and system behaviors to help identify patterns and generate plans and propose remediation approaches
- Leverage AI to speed development of runbooks, automation workflows, reliability tooling by drafting, iterating, and refining approaches