Senior Database Reliability Engineer (DBRE) & Architect
Alex Staff Agency Посмотреть все вакансии
- Казахстан
- Постоянная работа
- Полная занятость
- Designing and implementing a self-service platform (Terraform + Ansible) for deploying HA clusters (PostgreSQL, ClickHouse, MongoDB, Redis) in a heterogeneous environment (Bare Metal, OpenNebula, K8s, Public Clouds).
- Managing rapidly growing analytics clusters (12+ clusters, tens of terabytes), focusing on sharding, ReplicatedMergeTree, and building reliable S3 backup pipelines under high load.
- Maintaining and scaling infrastructure for Apache Airflow and Redash, ensuring the reliability of ETL pipelines and visualization tools.
- Implementing SRE practices in data management: replacing manual incident response with automated self-healing mechanisms and defining SLO/SLIs.
- Migrating legacy solutions to modern cloud patterns and implementing Kubernetes operators for stateful workloads.
- Serving as a technical authority for product teams to optimize data schemas and SQL queries for high-load systems.
- DB: PostgreSQL 15+ (Patroni, PgBouncer), ClickHouse (Sharded/Replicated), MongoDB, Redis, Kafka.
- Data & Analytics: Apache Airflow, Redash.
- Infrastructure: Hybrid Cloud (3+ private DCs, OpenNebula, K8s, Bare Metal, AWS, GCP, Azure, DO).
- IaC & CI/CD: Terraform, Ansible, Python/Go, GitLab, Jenkins, Gerrit.
- Observability: VictoriaMetrics, Grafana, Loki.
- 5+ years of PostgreSQL expertise: deep knowledge of MVCC, locking mechanics, expert-level Patroni/PgBouncer configuration, and experience with seamless major version upgrades under load.
- ClickHouse mastery: experience operating large clusters, understanding ZooKeeper/ClickHouse Keeper, sharding, replication internals, and performance diagnostics at the data-part level.
- Engineering mindset (SRE/DevOps): experience writing complex Terraform modules and Ansible roles; proficiency in Python or Go for automation is a major asset.
- Hybrid environment experience: understanding the nuances of running DBs on Bare Metal vs. Kubernetes vs. Public Cloud, with the ability to optimize TCO and disk subsystem performance (NVMe, Network Storage).
- Systems approach: understanding the full stack from network packets to business logic, including security standards (FIPS, Audit logs) and Disaster Recovery.
- Experience building an Internal Developer Platform (IDP).
- Experience operating databases in Kubernetes via operators (CloudNativePG, Altinity Operator).
- Background working with Cloud or Hosting providers on similar services.