The e-commerce department of Alaska Airlines is seeking an experienced Senior Site Reliability Engineer (Sr. SRE) to be responsible for the reliability, resiliency, and performance of the technology systems supporting our multibillion, multi-channel e-commerce business. This role is part of a functional team that owns Tier 2 and Tier 3 support for all e-commerce systems including alaskaair.com, customer mobile apps, loyalty systems, and our back-end tier of large scale, distributed and highly available services. The position is highly technical and balances between engineering operations and software development to enable rapid product development.
The ideal candidate will have hands on coding and scripting experience in the areas of infrastructure automation and instrumenting health monitors. They can build creative engineering solutions to operation problems and understand the big picture of how systems relate to each other. They will eliminate manual work through automation and partner with our Development teams to ensure that services are designed and delivered to be mission critical with a focus on security, resiliency, scale, and performance. They are familiar with a DevOps culture and work to spread a DevOps culture to their own team and others. They understand agile development values and practices including small, iterative, frequent, and continuous delivery of value.
Key Duties
Guide and train agile engineering teams to optimize service quality and ensure adoption of reliability best practices.
Introduce and evaluate cutting edge software tools that pushes our core tech stack forward and improves the reliability and stability of our site.
Leads the team to collect metrics, crunch data, build dashboards and improve service monitoring to detect problems before customer is impacted.
Drives a continuous improvement mindset with the team, embracing a DevOps culture by automating everything possible and constantly finding ways to make our systems more reliable.
Understands, experiments, and adopts emerging industry practices in the systems operations space.
Practices, coaches, and evangelizes reliability best practices.
Works with product teams to establish SLAs around performance that can then be integrated into our monitoring/alerting solutions.
Automates existing manual processes and provides more self-service functionality to Tier 2 team.
Develops engineering solutions to repetitive failures and other problems that adversely affect production systems.
Practices agile principles to organize and deliver work.
Brings modern delivery practices to legacy systems.
Enables software development teams to continuously push their code to production.
Helps build container based software delivery to production.
Job-Specific Experience, Education & Skills Required
A minimum of 5 years of hands-on software development experience.
A minimum of 3 years of Reliability Engineering experience.
Experience with Git.
Proficiency in infrastructure scripting and configuration automation tools (Chef/Bamboo/Jenkins).
Experience in Windows Azure /AWS.
Expertise in monitoring tools (AppDynamics/App Insights/Sumo Logic/etc.).
Expertise in incident and problem management including timely problem identification, successful resolution, and root-cause analysis.
Strong verbal and written communication skills to communicate technology concepts and practices.
Experience working in a high-scale, high-traffic, 24/7 environment.
Preferred
A Bachelor of Arts or a Bachelor of Science degree, with a focus in computer science or similar technical field, is strongly preferred.
Experience with test-driven development (TDD), unit testing, pair/mob programming and other Extreme Programming (XP) techniques.
Expertise with modern design principles, such as the development and utilization of cloud APIs, single-page web apps, hybrid mobile development, and SOLID principles.
Experience in Agile/Lean development methodologies.
by via developer jobs - Stack Overflow
No comments:
Post a Comment