jobs Logo
XP Venture Labs logo

Site Reliability Engineer

XP Venture Labsabout 21 hours ago
Remote
CA$150,000 - CA$180,000/annually
Senior Level
CONTRACTOR

Top Benefits

Competitive salary ($150k–$180k CAD annually)
Fully remote work across Canada

About the role

Senior Site Reliability Engineer (SRE) with Team Lead Experience

This role may also be a strong fit if you've held titles such as Lead Production Engineer, Reliability Engineering Lead / Manager, Platform Reliability Engineer, Infrastructure Reliability Engineer, Systems Reliability Engineer, or Production Operations Lead. The discipline matters more than the title.

About XP Venture Labs

At XP Venture Labs, we partner with ambitious companies to solve complex technology challenges and accelerate growth. Our teams are composed of highly skilled engineers, architects, and technology leaders who bring deep technical expertise and real-world delivery experience. We don't operate as traditional consultants, we embed as strategic partners to design scalable systems, modernize platforms, improve reliability, and help our clients navigate high-impact technical decisions with confidence.

From cloud architecture and distributed systems to platform engineering and large-scale modernization, we specialize in the kinds of problems that demand precision, experience, and a relentless focus on outcomes.

About the Role

As our Senior SRE Team Lead, you own the reliability, availability, and performance of our production systems, and you lead the team responsible for keeping them healthy around the clock. This is a hands-on leadership role for someone who has already run an SRE function: defined SLOs, is comfortable being on call, has led incident bridges at 3 a.m., and built the tooling, standards, and on-call culture that prevent the next outage rather than just reacting to it.

A few things that make this role what it is:

We move at startup speed. You'll be building core reliability infrastructure from the ground up rather than inheriting a mature platform, juggling several high-impact initiatives at once, and making consequential calls with incomplete information. If you're energized by fast-moving, high-ownership, sometimes-ambiguous environments, you'll thrive here. If you need a slow, tightly-scoped, ticket-by-ticket setup, this likely isn't the right fit, and that's okay. Our production systems span both Windows and Linux. We need someone genuinely fluent in both, not strong in one and passable in the other. You should be equally at home debugging an IIS / .NET issue and a Linux, systemd, or kernel-level one. This is a reliability role, not a pipeline role. We're looking for an SRE, someone who applies software engineering to operations, owns production outcomes, and obsesses over SLOs, observability, and resilience. (More on what we mean by that below.)

You'll have the autonomy to evaluate and introduce new tools, establish best practices, and define the standards that guide the SRE function as we scale.

What You'll Own

The reliability, availability, and performance of production systems running on both Linux and Windows. Defining and managing SLAs, SLOs, SLIs, and error budgets to drive measurable, accountable reliability improvements. 24/7 monitoring, observability, and on-call including building the dashboards, structured logging pipelines, and actionable alerting that keep us ahead of problems. Leading incident response, postmortems, and root-cause analysis to cut recurrence and mean time to recovery (MTTR). Architecting and maintaining scalable, highly available AWS infrastructure. Mentoring and developing the SRE team including setting on-call rotations, operational-excellence standards, and a culture of reliability. Proactive capacity planning and performance optimization before scale becomes a problem. Partnering with Engineering, Security, and Product to embed reliability across the SDLC. Evaluating and adopting tools, technologies, and operational frameworks that improve resilience and efficiency.

What We're Looking For

Dual Windows + Linux production expertise. Deep, hands-on experience operating, tuning, and troubleshooting production systems on both operating systems, not just one. Deep AWS experience. You've architected and run secure, scalable, highly available environments on AWS, not just deployed into them. Direct experience leading a team of SREs. You've owned the function including on-call rotations, standards, hiring, and growth, not just mentored a junior or two as a senior IC. A track record of production SLAs and 24/7 operations. You've maintained production-grade SLAs and built/run the monitoring and observability that backs them. Comfort in fast-paced, high-intensity environments. You've worked somewhere startup-like, building and maintaining systems from scratch while juggling many concurrent priorities, and you do your best work there.

Technical Requirements

Operating systems (both required): Linux - administration, performance tuning, networking, troubleshooting, and Bash (e.g., Ubuntu / RHEL, systemd). Windows Server — administration, IIS, and PowerShell for Windows/legacy automation. AWS (deep): EC2, ECS/EKS, Lambda, RDS, DynamoDB, S3, IAM, VPC, and networking with a proven ability to architect secure, scalable, highly available environments. Containers & orchestration: strong hands-on experience with Docker and Kubernetes. Observability & monitoring (core to this role): designing performance dashboards, structured logging pipelines, and actionable alerting. Bonus points for Grafana, BetterStack, and Kibana. Reliability engineering practices: SLOs/SLIs/error budgets, incident response, postmortems, RCA, and MTTR reduction. Infrastructure-as-Code: advanced Terraform experience required. (AWS CloudFormation / SAM a plus.) Databases: MS SQL Server performance tuning and optimization. Application context: experience operating and monitoring .NET (C#) backend services with Angular front-ends. Distributed systems: strong understanding of microservice/API architecture, networking, and high-availability design principles. Message brokers: experience operating and tuning brokers such as RabbitMQ and KAFKA. Automation & tooling: scripting (Python, Bash, Shell) to build reliability tooling and reduce operational toil. Excellent written and verbal communication in English. Must be physically located in Canada and legally authorized to work in Canada.

Nice to Have

Networking depth: DNS management, VPN configuration, and packet analysis. Chaos/resilience testing and game-day experience. Experience standing up an SRE practice or reliability tooling from zero.

What We Offer

Competitive compensation: $150,000 – $180,000 CAD annually, based on experience and expertise. Meaningful growth: diverse, high-impact projects that expand your technical depth and leadership. A high-performance culture: a team that values innovation, technical excellence, and shared success. Fully remote: 100% work-from-home across Canada. Modern stack: cutting-edge tools across observability, data visualization, and cloud infrastructure. Human-centered hiring: no AI tools are used at any stage of the application or evaluation process.

How to Apply

Interested candidates are encouraged to submit their resume at their earliest convenience.

About XP Venture Labs

IT Services and IT Consulting

Similar Jobs