Job Description

Staff Engineer, SRE- 14329

Our direct client is a software company that processes almost a quarter of U.S. mortgage applications. This Contract role located in Pleasanton, CA.



This is a fantastic opportunity to work and collaborate closely with our software engineering, architecture and operations teams at Client. Our Site Reliability Engineering and its Development (DevOps) are responsible for ensuring Client’s services are highly available, reliable, secure and scalable.


This is a very senior level position. The ideal candidates are fluent in systems programming and/or automation and can leverage their experience to solve complex problems associated with running production environments at massive scale in multi-tenant environments.



  • Employ deep troubleshooting and scripting skills to improve the availability, performance, and security of Client Services.
  • Coding and Automation of Applications on Cloud Platform
  • Implement automated tests, automated deployments, and operational tools
  • Collaborate with Product and Support teams to plan and deploy product releases
  • Set Strategic and Operational goals for team, and work with team to deliver on goals.
  • Work with Cloud Platform and Operations leaders to develop narratives, backlog grooming, epic planning and overall sprint planning processes
  • Work with Engineering leadership to build shared services that meet the requirements and need of the platform and application teams
  • Ensure services are designed with 24/7 availability and operational readiness and rigor
  • Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
  • Participate in on-call rotations, driving restoration and repair of service-impacting issues
  • Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
  • Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services
  • Takes a command and control role as Incident Manager during critical incidents focusing on minimizing MTTR & MTTD
  • Participates in After Action Reviews and facilitates discovery of Root Cause.
  • Identifies, evaluates and executes preventive measures to minimize/avoid impact to the
  • customers experience. Proactive v/s Customer escalated
  • Conduct Root Cause Analysis and drive repair of Problem Records in order to prevent
  • recurrence through to closure including, but not limited to, resolution of product/service defects
  • or design changes, infrastructure changes, or operational changes
  • Partner with SREs and lead by example - contributor more than a delegator



  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience



  • 10+ years of Systems/Applications automation in 24x7 Production Services environments
  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
  • Fluency with at least one current generation scripting language used by DevOps professionals (Python, Perl, PHP, Ruby) + Java Development and/or .NET
  • Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network theory & protocols (TCP/IP, UDP, ICMP) ability to read a packet capture/tcpdump, etc.
  • Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)
  • Experience with elastically scalable, fault tolerance and other cloud architecture patterns
  • Experience operating on AWS (both PaaS and IaaS offerings)
  • Experience in both Windows (2k8R2+) and Linux (centos) + Security triage & forensic analysis
  • Experience with Continuous Integration and Continuous Delivery concepts, including Infrastructure as code utilizing tools like Terraform, Cloudformation and Chef/SaltStack
  • Familiarity with Containerization concepts like Docker, and PaaS services on AWS.
  • Experience with elastically scalable, fault tolerance and other cloud architecture patterns
  • NoSQL/Docker/Micro-services/Forensic-Analysis experience is a big plus
  • Demonstrated strength in SaaS services, experience in massive scale web operations.


Interested in Applying?

We can’t wait to see your resume! Please apply below with your most current resume and anything else you’d like us to know about you (US work authorization, current location, etc.) Feel free to email Himanshu Jha ( or call 408-715-1210 x 106