ELK DevOps Automation Engineer
About the role
Role Overview: We are looking for an ELK DevOps Automation Engineer to take ownership of an enterprise Elasticsearch platform supporting critical workloads for a major US financial institution. This role combines deep Elasticsearch engineering expertise with DevOps and Site Reliability Engineering responsibilities. The person in this position will be accountable for cluster architecture, performance optimization, platform stability, automation, and long-term scalability — operating within a highly regulated banking environment. This is not a dashboard-focused or entry-level ELK role. We are looking for someone who has designed, scaled, and stabilized large production clusters, led migrations, and can operate confidently within a structured DevOps and SRE model in a regulated financial services environment. Responsibilities: Elasticsearch Architecture & Engineering: Design, build, and manage distributed, multi-node Elasticsearch clusters in production on Azure AKS and Azure VMs. Define cluster sizing strategy, node roles, shard allocation, and scaling models for high-volume banking workloads. Design and manage data streams, index lifecycle management (ILM), and data retention policies aligned to compliance requirements. Optimize indexing and search performance — including shard strategy, mapping design, query tuning, and Grok-based log parsing pipelines. Lead Elasticsearch upgrades, migrations, and re-architecture initiatives with minimal downtime and documented rollback plans. Ensure high availability and fault-tolerant configurations across all environments. Production Reliability & SRE Practices: Take ownership of Elasticsearch platform stability in a production banking environment with strict SLA requirements. Lead troubleshooting across complex, high-availability clusters — using logs, metrics, and traces correlation to isolate failures. Perform detailed root cause analysis and implement permanent corrective actions. Define and track SLIs, SLOs, and SLAs for the Elasticsearch platform and build Kibana dashboards for real-time SLA compliance. Forecast capacity requirements and proactively plan scaling thresholds. Develop and maintain operational runbooks and incident response processes. Act as escalation point during critical production incidents. DevOps, Cloud & Automation: Deploy and manage Elasticsearch in Microsoft Azure, covering Elastic Cloud on Azure, self-managed on AKS, and Azure VM deployments. Manage centralized log collection using Elastic Agent and Fleet, designing Agent Policies for large-scale data ingestion. Build and maintain Logstash and ingest pipelines for parsing complex, custom log formats using Grok scripting and Painless. Implement automation using Python and the official Elasticsearch Python client (elasticsearch-py) for index management, reporting, and platform integrations. Integrate the Elastic Stack with the OpenTelemetry (OTEL) framework, configuring the OTEL Collector to receive traces, metrics, and logs and export to Elasticsearch. Contribute to CI/CD pipelines for Elasticsearch deployments and configuration management using Terraform and Ansible. Integrate Elasticsearch with Azure Monitor, Azure Log Analytics, Dynatrace, LogicMonitor, and PagerDuty. Security & Compliance: Configure and maintain Elasticsearch security, including TLS encryption, RBAC, audit logging, and SAML/SSO integration with Azure Active Directory. Ensure platform compliance with SOC 2 and HIPAA requirements, covering audit log retention, PII handling, access controls, and evidence collection for compliance cycles. Design and enforce data classification and PII masking policies for log ingestion pipelines. Technical Leadership: Provide architectural guidance and best practices for Elasticsearch cluster design in regulated banking environments. Mentor engineers on performance tuning, scaling strategies, compliance, and troubleshooting. Drive continuous improvement initiatives across the ELK platform and contribute to long-term reliability and resilience planning. Requirements: Core Elasticsearch (Must Have): 5+ years hands-on Elasticsearch in enterprise production Cluster sizing, shard allocation, node roles & scaling Index Lifecycle Management (ILM) & data streams Query performance tuning & search profiling Elasticsearch migrations & version upgrades Kibana — alerting, dashboards, ML anomaly detection Logstash pipelines — Grok, Painless, ingest enrichment Elastic Agent & Fleet for centralized agent management Cloud & Infrastructure (Must Have): Microsoft Azure — AKS, Azure VMs, Azure Monitor Docker & Kubernetes (AKS specifically) Elastic Cloud on Azure deployment & management Azure Active Directory (AAD) — SAML/SSO integration Terraform & Ansible for infrastructure as code CI/CD pipelines for Elasticsearch deployments Automation & Integration (Must Have): Python scripting using elasticsearch-py client OpenTelemetry (OTEL) — SDK instrumentation & Collector REST API integration for Elasticsearch administration Elasticsearch Watcher for automated alerting Dynatrace, LogicMonitor, or PagerDuty familiarity Security & Compliance (Must Have): Elasticsearch RBAC & audit logging configuration TLS encryption for data in transit & at rest PII/PHI masking & data classification in pipelines SOC 2 or HIPAA compliance awareness Elasticsearch security in regulated environments SRE Practices (Must Have): SLI / SLO / SLA definition & tracking P1 incident handling & root cause analysis MTTR reduction using correlated logs/metrics/traces Capacity planning & proactive scaling Operational runbook development Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent practical experience) Preferred Certifications: Elastic Certified Engineer Elastic Certified Observability Engineer Elastic Certified Analyst Microsoft Certified: Azure Administrator Associate or Azure DevOps Engineer Expert Preferred Experience: Experience in financial services, banking, or other regulated enterprise environments. Exposure to large-scale data ingestion pipelines using Kafka, Filebeat, or Fluentd. Experience with Apache Airflow or similar workflow orchestration tools. Familiarity with Microsoft Sentinel or other SIEM platforms for security monitoring. Experience with Prometheus and Grafana for supplementary metrics observability.
Similar Jobs
ELK DevOps Automation Engineer
About the role
Role Overview: We are looking for an ELK DevOps Automation Engineer to take ownership of an enterprise Elasticsearch platform supporting critical workloads for a major US financial institution. This role combines deep Elasticsearch engineering expertise with DevOps and Site Reliability Engineering responsibilities. The person in this position will be accountable for cluster architecture, performance optimization, platform stability, automation, and long-term scalability — operating within a highly regulated banking environment. This is not a dashboard-focused or entry-level ELK role. We are looking for someone who has designed, scaled, and stabilized large production clusters, led migrations, and can operate confidently within a structured DevOps and SRE model in a regulated financial services environment. Responsibilities: Elasticsearch Architecture & Engineering: Design, build, and manage distributed, multi-node Elasticsearch clusters in production on Azure AKS and Azure VMs. Define cluster sizing strategy, node roles, shard allocation, and scaling models for high-volume banking workloads. Design and manage data streams, index lifecycle management (ILM), and data retention policies aligned to compliance requirements. Optimize indexing and search performance — including shard strategy, mapping design, query tuning, and Grok-based log parsing pipelines. Lead Elasticsearch upgrades, migrations, and re-architecture initiatives with minimal downtime and documented rollback plans. Ensure high availability and fault-tolerant configurations across all environments. Production Reliability & SRE Practices: Take ownership of Elasticsearch platform stability in a production banking environment with strict SLA requirements. Lead troubleshooting across complex, high-availability clusters — using logs, metrics, and traces correlation to isolate failures. Perform detailed root cause analysis and implement permanent corrective actions. Define and track SLIs, SLOs, and SLAs for the Elasticsearch platform and build Kibana dashboards for real-time SLA compliance. Forecast capacity requirements and proactively plan scaling thresholds. Develop and maintain operational runbooks and incident response processes. Act as escalation point during critical production incidents. DevOps, Cloud & Automation: Deploy and manage Elasticsearch in Microsoft Azure, covering Elastic Cloud on Azure, self-managed on AKS, and Azure VM deployments. Manage centralized log collection using Elastic Agent and Fleet, designing Agent Policies for large-scale data ingestion. Build and maintain Logstash and ingest pipelines for parsing complex, custom log formats using Grok scripting and Painless. Implement automation using Python and the official Elasticsearch Python client (elasticsearch-py) for index management, reporting, and platform integrations. Integrate the Elastic Stack with the OpenTelemetry (OTEL) framework, configuring the OTEL Collector to receive traces, metrics, and logs and export to Elasticsearch. Contribute to CI/CD pipelines for Elasticsearch deployments and configuration management using Terraform and Ansible. Integrate Elasticsearch with Azure Monitor, Azure Log Analytics, Dynatrace, LogicMonitor, and PagerDuty. Security & Compliance: Configure and maintain Elasticsearch security, including TLS encryption, RBAC, audit logging, and SAML/SSO integration with Azure Active Directory. Ensure platform compliance with SOC 2 and HIPAA requirements, covering audit log retention, PII handling, access controls, and evidence collection for compliance cycles. Design and enforce data classification and PII masking policies for log ingestion pipelines. Technical Leadership: Provide architectural guidance and best practices for Elasticsearch cluster design in regulated banking environments. Mentor engineers on performance tuning, scaling strategies, compliance, and troubleshooting. Drive continuous improvement initiatives across the ELK platform and contribute to long-term reliability and resilience planning. Requirements: Core Elasticsearch (Must Have): 5+ years hands-on Elasticsearch in enterprise production Cluster sizing, shard allocation, node roles & scaling Index Lifecycle Management (ILM) & data streams Query performance tuning & search profiling Elasticsearch migrations & version upgrades Kibana — alerting, dashboards, ML anomaly detection Logstash pipelines — Grok, Painless, ingest enrichment Elastic Agent & Fleet for centralized agent management Cloud & Infrastructure (Must Have): Microsoft Azure — AKS, Azure VMs, Azure Monitor Docker & Kubernetes (AKS specifically) Elastic Cloud on Azure deployment & management Azure Active Directory (AAD) — SAML/SSO integration Terraform & Ansible for infrastructure as code CI/CD pipelines for Elasticsearch deployments Automation & Integration (Must Have): Python scripting using elasticsearch-py client OpenTelemetry (OTEL) — SDK instrumentation & Collector REST API integration for Elasticsearch administration Elasticsearch Watcher for automated alerting Dynatrace, LogicMonitor, or PagerDuty familiarity Security & Compliance (Must Have): Elasticsearch RBAC & audit logging configuration TLS encryption for data in transit & at rest PII/PHI masking & data classification in pipelines SOC 2 or HIPAA compliance awareness Elasticsearch security in regulated environments SRE Practices (Must Have): SLI / SLO / SLA definition & tracking P1 incident handling & root cause analysis MTTR reduction using correlated logs/metrics/traces Capacity planning & proactive scaling Operational runbook development Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent practical experience) Preferred Certifications: Elastic Certified Engineer Elastic Certified Observability Engineer Elastic Certified Analyst Microsoft Certified: Azure Administrator Associate or Azure DevOps Engineer Expert Preferred Experience: Experience in financial services, banking, or other regulated enterprise environments. Exposure to large-scale data ingestion pipelines using Kafka, Filebeat, or Fluentd. Experience with Apache Airflow or similar workflow orchestration tools. Familiarity with Microsoft Sentinel or other SIEM platforms for security monitoring. Experience with Prometheus and Grafana for supplementary metrics observability.