Site Reliability Engineer
Job Overview
-
Date Posted:
-
Location:
Remote Work -
Company:
Stratologon Software Solutions Pvt Ltd -
Jobs Type:
Site Reliability Engineer -
Job Categories:
Site Reliability Engineer -
Qualification:
UG or PG Degree -
Experience:
3- 6 Yrs
Full description
Role & Responsibilities: Reliability and Stability:
- Own and operate our application stack and AWS and Azure infrastructure to orchestrate and manage our hosted customer instances of Metabase.
- Debug runtime issues across the different levels of our application stack and hosting stack.
- Continuously improve our automated deployments and testing.
- Carry out all activities pertaining to supporting our Application and Cloud Infrastructure that our platform runs on, including but not limited to monitoring the Application, investigating and resolving Alerts and Outages, configuring the Monitoring/Alerting tooling, investigating external and internal client reported issues and carrying out BAU maintenance activities.
- Deploy application and infrastructure upgrades and enhancements to UAT and Production environments.
- Provision new / manage existing UAT and Production Environments.
- Coordinate and carry out Security Incident Management related to our application and infrastructure in accordance with our Security Incident Management processes.
- Maintain our SOC2 compliance and security posture.
- Where necessary, be prepared to work in shifts (early/late, weekends) to provide 24x7 Support for our platform.
Service-Level Objectives (SLOs):
Develop and build our internal tooling and automation to manage the lifecycle of a hosted Metabase installation, from purchase to deployment, zero-downtime upgrades, and general operational health.
Automation and Tooling:
- Continuously improve our automated deployments and testing.
- Automate EKS and AKS cluster provisioning.
- Extend our CRDs and Operators.
- Improve the RDS sharding strategy for our multi-tenant platform.
- Unify and improve our CI/CD platforms.
Required Knowledge, Skills, and Abilities
- 2-5 years’ experience building and operating production infrastructure, ideally on public cloud
- Experience supporting business-critical systems (Incident, Change and Problem management process) in a large-scale operations team.
- Broad knowledge of IT Operations concepts, architecture & information security (ITIL/ Security).
- Hands-on commercial experience of supporting cloud-based SaaS systems (both Amazon AWS and Microsoft Azure).
- Experience in setting up EC2, SNS, Database Instances, securing of VPC, implementation of Security Groups, Identity and Access Management, Backups, Restore and Disaster Recovery, and the equivalent technologies on Azure.
- Hands-on commercial experience in both Linux and Windows systems administration and automation scripting.
- Hands-on commercial experience managing Kubernetes Clusters
- Good understanding of DevOps principles (CI/CD, release automation).
- Knowledge of Clusters, Storage, Backups, Data Export/Import, Monitoring tools and Disaster Recovery.
- Hands-on commercial experience using a wiki (ideally Confluence) to document processes that comprise our Knowledge Base.
- Experience with TCP/IP network and various fundamental network services such as DNS, DHCP, SMTP, NTP, telnet, SSH, etc.
- Ability to read/understand & debug Python and Java.
- Working experience with MongoDB, MariaSQL and PostgresSQL.
- Working experience with Application Monitoring tools
- Practical application of scripting (e.g. Python, cron), to automate repeated tasks.
- ITIL Foundation Qualified
Capacity Planning:
Continually seek and implement improvements in the environment – cost control, automation, rationalizing the estate, and processes.
Collaboration:
Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration.
Performance Optimization:
Collaborate with core application developers on changes to improve our application metrics, deployment speeds, and CI integration
Must Haves
Nice to Haves