Senior MLOps Engineer
Toronto, Ontario
$175,000 - $200,000/yearly
Senior Level
Full-Time
About the role
- You will own and evolve the infrastructure that powers our ML pipelines – from cloud environments and CI/CD systems to workflow orchestration and model deployment
- You will work closely with ML scientists, bioinformaticians, and software engineers to keep our platform reliable, reproducible, and scalable
- You’ll maintain and improve cloud infrastructure (GCP) using Infrastructure-as-Code tools (Terraform)
- Manage IAM, RBAC, and permission policies across cloud environments
- Own and evolve CI/CD pipelines (CircleCI, GitHub Actions) and ensure best practices are followed across the engineering and ML teams
- Administer and support workflow orchestration platforms (e.g., Seqera/Nextflow, Argo, Kubeflow)
- Operate and configure ML experiment tracking and registry tooling (e.g., W&B, MLflow)
- Build and maintain containerized environments (Docker) and manage Kubernetes clusters
- Manage GPU resources – provisioning, scheduling, and debugging hardware and driver issues
- Write and maintain Python tooling, scripts, and integrations that support ML infrastructure
- Help deploy ML models to production environments and monitor their performance- If this sounds like you, we would love to hear from you
- You have 4+ years of experience in production infrastructure or MLOps, you write solid Python, and you are curious about the ML and scientific workflows your work supports
- You are someone who enjoys keeping the infrastructure running smoothly so that scientists can focus on their research
- Above all, you are a collaborative, kind team member who communicates clearly, adapts to evolving needs, and is happy to help colleagues grow their own infrastructure skills along the way
- You are comfortable working across cloud platforms, CI/CD systems, containers, and GPUs – and you take pride in making these systems reliable and easy for others to use
- Extensive Hands-on experience with Kubernetes and containerization (Docker)
- Familiarity with Python package and environment management (e.g., pip, conda, pixi)
- Strong Python programming skills
- Experience managing GPU compute (provisioning, debugging, driver management)
- 4+ years of experience operating production infrastructure
- Proficiency with cloud platforms (GCP preferred; AWS/Azure acceptable) and Infrastructure-as-Code (Terraform)
- Self-motivated problem solver with excellent communication skills
- Solid background in CI/CD systems (CircleCI, GitHub Actions, or similar)
- Understanding of ML frameworks (e.g., PyTorch, PyTorch Lightning), ML workflows (training, inference, evaluation), and the model lifecycle
- Familiarity with Kubernetes CRDs and batch/gang schedulers (e.g., Volcano, Kueue)
- Experience working with large-scale datasets (storage, versioning, efficient access patterns)
- Experience working directly with scientists and researchers in an interdisciplinary setting
- Knowledge of biology and/or machine learning science
- Familiarity with data compliance and governance frameworks (e.g., HIPAA, SOC 2)
- Previous startup experience
- Familiarity with MLOps tooling (e.g., W&B, Ray, VertexAI) and distributed compute patterns (e.g., DDP, realtime/batch inference, multi-node training).