jobs Logo
Appspace logo

Senior DevOps & Site Reliability Engineer (Americas)

Appspace14 days ago
Toronto, Ontario, Canada
Senior Level
Full-Time

Top Benefits

Generous PTO
Flexible work schedules
Casual dress work environment

About the role

  • Our Cloud Operations team is seeking a Senior DevOps & Site Reliability Engineer who will play a critical role in ensuring the reliability, performance, and scalability of our diverse SaaS applications
  • This role is a specialized hybrid, bridging the gap between legacy VM-based architectures and modern cloud-native standards through aggressive automation and development-focused operations
  • Unlike a traditional SRE, this role is deeply integrated with the software development lifecycle, focusing on the consolidation and optimization of platform operations
  • You will be responsible for building the CI/CD frameworks, self-service tools, and AI-driven automation that allow our engineering teams to move faster while maintaining rock-solid stability
  • Your mission is to maximize the ROI of our existing infrastructure by “automating away” manual toil
  • On-call coverage will be required on a weekly rotation basis
  • In this role, you will be the technical anchor for a global platform footprint that includes a mix of Azure IaaS/PaaS, Google Cloud Platform (GCP), Kubernetes, and various data platforms. Your day will consist of:
  • Intelligent Automation & DevOps: Identifying manual “toil” and replacing it with automated workflows for monitoring, change management, and routine administration of large-scale VM environments to ensure a positive ROI
  • AI-Enhanced Operations: Leading the integration of AI tools for automated code reviews, development frameworks, and predictive log analysis to drive departmental velocity and efficiency
  • Scalable CI/CD & Provisioning: Designing and maintaining “self-service” deployment frameworks and CI/CD pipelines (GitHub Actions, Bamboo) using Infrastructure as Code (Bicep, Terraform)
  • Strategic ROI Projects: Evaluating platform components to determine the most cost-effective path: automating the current state or migrating features to modern, shared architectures
  • Unified Observability: Designing and maintaining a comprehensive observability stack across Azure and GCP (metrics, logs, traces) to identify performance bottlenecks and proactively address system defects
  • Cross-Functional Collaboration: Partner with engineering, security and operations teams to ensure new features are “born” with reliability, security and automated delivery in mind; Ensure adherence to security best practices and compliance standards (SOC2, HIPAA, ISO 27001) and operational excellence with cost efficiency
  • Root Cause Analysis & Forensics: Investigating complex performance defects by following log trails across web, application, and database tiers (SQL Server, MongoDB, MySQL)
  • Governance & Security: Ensuring all platforms meet security standards (SOC2, HIPAA, ISO 27001) through automated policy enforcement across Azure and GCP

Benefits

  • Generous PTO
  • Flexible work schedules
  • A casual dress work environment
  • Paid company holidays
  • Remote work opportunities
  • Appspace Quiet Fridays (No non-essential internal meetings scheduled)- Experience with AI-driven log analysis or automated incident remediation
  • Knowledge of database tuning (SQL Server, MySQL, MongoDB)
  • 6+ years in DevOps or SRE roles, with a proven track record of bridging development and operations in complex cloud environments
  • Expert-level PowerShell and Python skills. Hands-on experience with Bicep or Terraform is required
  • You are a problem-solver and an automator at heart
  • Experience with Atlassian suite (Jira, Confluence, Bitbucket)
  • Familiarity with various middleware and PaaS technologies (e.g. Event Hub, Service Bus, CosmosDB, RabbitMQ, MongoDB, etc.)
  • Must have a passion for life-long learning
  • Familiarity with compliance standards (SOC2, HIPAA, GDPR)
  • Extensive experience with Microsoft Azure (IaaS, PaaS, App Services, Networking) and/or Google Cloud Platform (GCP)
  • Strong background in Windows/Linux Server OS, Kubernetes (AKS/GKE), Helm, and container orchestration
  • Expert-level troubleshooting and the ability to reason through complex process workflows to identify faults in large-scale platform environments

About Appspace

Software Development