You like food and you like reliability, this is the job for you!
Olo is experiencing tremendous growth and as we enhance our platform to support increased demand, it must be positioned for continued stability, reliability and resiliency. Reporting to the Engineering Manager of Site Reliability, the Site Reliability Engineer will partner with Engineering and Product Managers to learn, improve system availability and sharpen our execution skills to provide an amazing experience for our customers.
Olo is a remote-first company, offering all full time employees the option to work from anywhere in the U.S. Additionally, our NYC office will remain available for those that prefer to go in.
What You’ll Do
Guide observability and SLIs/SLOs to Incident Response to postmortems and follow-up actions.
Implement and tailor our incident response tools to minimize outage durations.
Build collaborative monitoring solutions with members across multiple product teams.
Contribute insights across teams to help us improve or re-architect existing systems to support scale, performance and extensibility.
Rethink our observability tooling to improve architecture, knowledge models, user experience, performance and stability.
Analyze and mature our processes around Incident Response, Observability, Postmortems and Predictive Monitoring.
Influence an engineering culture of reliability, observability, and availability.
Mentor engineering teams through game days, SRE boot camps and other training and feedback channels.
What We’ll Expect From You
3+ years of professional experience building scalable, efficient, and resilient systems.
Experience with monitoring tools like Datadog, Sumo Logic, Raygun, New Relic or similar.
Fluency in Incident Management using tools such as FireHydrant, OpsGenie, PagerDuty, VictorOps or similar.
Experience with build and deploy tools (ie. Jenkins, TeamCity, Octopus, or CircleCI).
Prior hands-on software development experience.
To apply for this job please visit jobs.lever.co.