We are seeking an experienced Observability SME with deep expertise in observability architectures and leading monitoring platforms. This role will be responsible for designing, implementing, and optimizing end-to-end observability solutions for applications, infrastructure, and networks. The ideal candidate will have extensive hands-on experience with platforms such as ELK (Elasticsearch, Logstash, Kibana), Dynatrace, BMC TrueSight, and SolarWinds , ensuring seamless monitoring, alerting, and analytics to enhance IT operations and service reliability.
Key Responsibilities :
- Observability Strategy & Architecture : Design and implement comprehensive observability solutions to monitor applications, infrastructure, and network performance.
- Monitoring Tool Implementation & Optimization : Deploy and fine-tune monitoring solutions using ELK, Dynatrace, BMC TrueSight, and SolarWinds.
- Log Management & Analysis : Establish centralized logging, log parsing, and correlation for improved event detection and troubleshooting.
- Metrics & Performance Monitoring : Define KPIs, dashboards, and alerts for proactive IT service monitoring.
- Incident Management & Root Cause Analysis : Collaborate with IT operations, DevOps, and SRE teams to diagnose and resolve performance issues.
- Automation & Integration : Integrate monitoring tools with ITSM platforms, AIOps solutions, and automation frameworks for enhanced efficiency.
- Capacity Planning & Optimization : Analyze historical trends and real-time data to optimize resource allocation and performance.
- Stakeholder Collaboration : Work closely with developers, network engineers, system administrators, and business units to ensure observability best practices are followed.
- Continuous Improvement : Stay updated on emerging observability technologies and recommend improvements to existing processes and tools
Qualifications :
Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience).Expertise in Observability & Monitoring Platforms : 8+ Years Hands-on experience with ELK Stack, Dynatrace, BMC TrueSight, SolarWinds, and similar platforms.Strong Knowledge of Infrastructure & Application Monitoring : Experience monitoring cloud, on-premise, and hybrid environments.Experience with Log & Event Correlation : Ability to configure and analyze logs for anomaly detection and security insights.Automation & Scripting : Proficiency in scripting languages such as Python, PowerShell, or Bash for automation.Cloud & DevOps Understanding : Experience with cloud platforms (AWS, Azure, GCP) and CI / CD pipelines.ITIL & Incident Management Exposure : Understanding of ITIL processes and IT service management (ITSM) practices.Networking & Security Awareness : Knowledge of network monitoring, SNMP, and security monitoring practices.Excellent Communication & Documentation Skills : Ability to present findings, create technical documentation, and train teams on observability best practices.Preferred Qualifications :
Certifications in Dynatrace, ELK, BMC TrueSight, or SolarWinds .Experience with AIOps, Machine Learning for Anomaly Detection, or AI-driven Observability .Background in Site Reliability Engineering (SRE) or DevOps .Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Ansible.