Who you are

Strong knowledge of Kubernetes (specially EKS), including deploy patterns, rollout safety, and core debugging workflows
4+ years of experience with programming languages (Python or Golang preferred)
Strong experience managing projects and initiatives end-to-end
Hands-on experience with AI-assisted development tools such as Cursor, GitHub Copilot or Claude for code generation, debugging, and documentation
Demonstrated ability to write effective prompts to get high-quality, reliable outputs from LLMs
Demonstrated ability to use AI to improve speed and quality in your day-to-day workflow for relevant outputs
Strong track record of critical evaluation and verification of AI-assisted work (e.g., testing, source-checking, data validation, peer review)
High integrity and ownership: you protect sensitive data, avoid over-reliance on AI, and remain accountable for final decisions and deliverables
Experience with technologies such as Terraform, Buildkite, and/or ArgoCD is required
Bachelor’s or Master’s degree in a relevant field such as Computer Science, or equivalent experience

What the job involves

The Site Reliability Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams’ capability to design, build and operate robust systems at scale
We are hiring a Sr. SRE to join our Compute SRE team
This team is responsible for ensuring that all compute workloads run smoothly on Pinterest
We're building the future on kubernetes and our job is to connect it with what Pinterest needs
Pinterest’s applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale
As a Pinterest SRE, you will design and build systems, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems
Tackle project challenges on EKS, such as implementing Karpenter. This work affects how every developer codes, tests, and improves their work
Collaborate across various teams to drive projects forward using open-source tools
Build a deep understanding of how Pinterest’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
Build meaningful, insightful and actionable SLIs
Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world
Use AI for analysis of incidents, operational signals, and system behaviors to help identify patterns and generate plans and propose remediation approaches
Leverage AI to speed development of runbooks, automation workflows, reliability tooling by drafting, iterating, and refining approaches

Senior Site Reliability Engineer

About the role

Who you are

What the job involves

About Pinterest