Ellie Mae (NYSE:ELLI) is a leading provider of enterprise on-demand solutions, including an online network, software and services for the residential mortgage industry. Ellie Mae is leading the mortgage industry into the future, developing and marketing software solutions that are transforming how mortgage lenders, investors, and settlement service providers work—and work together.

The Company offers an end-to-end solution, delivered using a Software-as-a Service model that serves as the core operating system for mortgage originators and spans customer relationship management, loan origination, and business management. Its solutions help thousands of lender users to streamline and automate the mortgage origination process by increasing efficiency, facilitating regulatory compliance and reducing documentation errors .

Summary of Responsibilities

This is a fantastic opportunity to work and collaborate closely with our software engineering, architecture and operations teams at Ellie Mae. Our Site Reliability Engineers are responsible for ensuring Ellie Mae services are highly available, reliable, secure and scalable.

The ideal candidates are fluent in systems programming and/or automation, and can leverage their experience to solve complex problems associated with running production environments at massive scale in multi-tenant environments.

Primary Responsibilities & Objectives

  • Employ deep troubleshooting and scripting skills to improve the availability, performance, and security of Ellie Mae Services
  • Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
  • Participate in on-call rotations, driving restoration and repair of service-impacting issues
  • Conduct Root Cause Analysis and drive repair of Problem Records in order to prevent recurrence through to closure including, but not limited to, resolution of product/service defects or design changes, infrastructure changes, or operational changes
  • Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
  • Author system support documents and update production application service run books where needed Qualifications, Skills and Education

  • 7+ years of Systems Engineering in 24x7 Production Services environments
  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
  • Seasoned professional in critical incident triage and response
  • Effective working under pressure
  • Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network
  • Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems
  • Fluency with at least one current generation scripting language used by DevOps professionals (Python, Bash, Perl..)
  • Deep knowledge of Windows Server or Linux systems internals (system libraries, file systems, client-server protocols)
  • Experience with network theory and protocols (TCP/IP, UDP, ICMP, DNS, Load Balancing…), ability to read a packet capture/tcpdump
  • Self-starter who can take ownership of technical issues and follow-through to repair
  • Experience in both Windows Server (2k8R2+) and Linux (centos a plus) systems administration a plus
  • Experience developing in Java or C# a plus
  • Security triage and forensic analysis a plus
  • Experience in supporting microservices a big plus
  • MongoDB experience a plus
  • Docker experience a plus Ellie Mae is an Equal Opportunity/ Affirmative Action Employer. Minorities, Females, Disabled and Veterans are encouraged to apply.

We do not accept resumes from headhunters, placement agencies, or other suppliers that have not signed a formal agreement with us.

Desired Skills and Experience

See application page for details