01 Zakres zadań
- Develop a standardized observability ecosystem and implement a conscious telemetry model focusing on structured events, distributed tracing, and intelligent sampling strategies.
- Act as a strategic partner to product engineering teams, providing the platform, standards, and data for service reliability ownership; use error budgets and alerting to balance feature velocity with stability.
- Enhance detection capabilities with early-warning systems and AI/ML for automated anomaly detection and intelligent data analysis to strengthen system resilience.
- Build internal automation and tooling that streamlines SRE workflows, automates routine tasks, and enhances efficiency across the technology stack.
- Participate in an on-call rotation for incident management, ensuring rapid resolution, effective communication, and post-incident analysis for continuous improvement.