HPC System Engineer / Administrator
With Stanford University in Stanford CA USMore jobs from Stanford University
Posted on January 07, 2020
About this job
Job type: Full-time
Role: System Administrator
**HPC System Engineer / Administrator
The Stanford Research Computing Center (SRCC) is seeking outstanding applicants for the position of HPC System Engineer/Administrator. Based in Polya Hall on the Stanford campus, you will join a dynamic and growing team of technology specialists supporting the computational and data needs of Stanford's research community. This role will focus primarily on the management and support of an HPC cluster and multipetabyte storage platform that provides essential infrastructure for Stanford's bioinformatics and genomics communities.
The hiring range for this position is $135,000 - $150,000.
The successful candidate will be someone who:
- Has built, managed, secured and supported scalable HPC clusters before and is comfortable with handling all aspects of that, from racking servers, to configuring networking, to installing software for end-users to providing one-on-one instruction and support
- Has managed large-scale research storage platforms
- Thrives when working in an academic environment
- Is passionate about technology and is driven by challenge and intellectual curiosity
- Is self-motivated to learn, sometimes on your own time
- Has user support experience and actually likes working with end-users on a daily basis
- Is thorough, detail-oriented, documents well, and understands the importance of documentation
- Isn't afraid of hardware
- Loves problem-solving
- Understands the need to ensure the usability of systems from the end-users' perspectives
- Works independently but also collaborates within and across teams
- Has a sense of humor
The SRCC is jointly sponsored by University IT (UIT) and the Office of the Dean of Research. The SRCC team of 18 cyberinfrastructure professionals offers research computing platforms, hosting, consultation, tool and software development, system engineering, and system administration in support of computational and data-intensive research across the Stanford campus.
This position will provide system administration, engineering and specialized technical consultation for existing and future systems and services for research computing workloads. The position will also specifically have responsibilities for department or faculty- owned research computing HPC cluster and server environments, including filesystems and storage platforms, Linux server environments, job schedulers, scientific tools, and application software.
- Support and administration of research computing clusters, servers and storage systems, including installation, network and security configuration, monitoring, maintenance, application software build/configuration, upgrading, patching, and complex user problem solving. Those systems may be in Stanford data centers or in Stanford research labs and units.
- Provision computing platforms and associated storage and networking for research environments, incorporating novel technical solutions as needed to meet research requirements. Install, test and configure software tools, libraries and compilers to meet researchers' needs.
- Customize environments as requested by research teams, with specific focus on the optimization of end-users' experiences
- Provide advanced cyberinfrastructure training and consultation for faculty, postdocs and graduate students across a wide-array of university research units and departments.
- Ensure systems are configured and managed in accordance with Stanford policies and any regulatory requirements specific to data sources and classifications.
- Conceive, design, develop, optimize, integrate, and maintain information technology at a complex level.
- Troubleshoot highly complex problems for which the analysis and resolution require extensive knowledge of many diverse system components
- Develop long range technology plans.
- Provide leadership and IT solutions for complex problems
*Other duties may be assigned.
The job duties listed are typical examples of work performed by positions in this job classification and are not designed to contain or be interpreted as a comprehensive inventory of all duties, tasks, and responsibilities. Specific duties and responsibilities may vary depending on department or program needs without changing the general nature and scope of the job or level of responsibility. Employees may also perform other duties as assigned.
Bachelor's degree and eight years of related increasingly technical work experience or a combination of education and relevant experience. Strong, demonstrated knowledge of Linux and demonstrated experience managing complex multiuser HPC clusters and large-scale research storage environments are required as well.
Knowledge, Skills and Abilities
Advanced knowledge of Linux is required; experience managing, using, supporting and consulting on research computing cyberinfrastructure in an academic or research environment is strongly preferred. Proven ability to deliver outstanding system and service administration and end-user support in a thorough and timely manner is needed. This position requires that you be able to juggle multiple competing priorities, work quickly and accurately, and demonstrate initiative in conceptualizing and moving technical projects successfully to completion. The position must be able to do independent analysis, troubleshooting and problem resolution, but also must work collaboratively with other team members and across organizational group boundaries.
This position requires hands-on experience building and supporting multi-tenant Linux servers/clusters and their associated networks, file systems and storage devices in production research environments. Specifically, this technical knowledge needed to be successful in this positon includes:
- Expert demonstrated knowledge of clustered Linux systems, including securing systems, and day-to-day troubleshooting, monitoring, support, software packaging, and working within industry-wide best practices
- Experience administering, configuring, and supporting HPC clusters, including systems with accelerators, and high performance file systems and storage. This includes hardware installation, configuration, upgrades and repairs
- Knowledge of and experience utilizing data and system security techniques, practices and standards as they relate to HPC systems, storage and networks
- Experience installing and supporting parallel computing environments (e.g. OpenMPI, MVAPICH, etc.)
- Hands-on experience installing, configuring and supporting job schedulers and resource managers (e.g., SLURM, OGE, LSF, Torque, Maui, etc.)
- Familiarity with deploying virtualization technologies and basic knowledge of container technologies
- Exceptional written and verbal communication skills
- Experience using shell scripts, programming languages (Python), and programming automated system management tools, both at a general level (e.g. Puppet) and at a cluster-level (e.g. Rocks)
- Experience installing, configuring, managing and supporting GPFS parallel file systems is desired but not required
- Familiarity with TCP/IP, Internet Routing Protocols, private and public networks, VLANs, Firewalls, Load Balancers, addressing schemes, subnet creation and subnet masking. Proven ability to troubleshoot basic network issues and communicate and work with a team of network engineers to solve possible network design issues in HPC
- Familiarity with the intersection of storage and networking disciplines: i.e. transport media, speeds of media, storage networks, IP based storage delivery, other storage delivery technologies
- Experience with some the following applications: Git, Apache, Kerberos, LDAP
- Software installation and maintenance experience supporting research codes and clients
- Exceptional client service and communication, focusing on proactive system administrator actions and interactions to reduce or remove barriers to clients' efficient use of resources to advance research
Why Stanford is for You
Imagine a world without search engines or social platforms. Consider lives saved through first-ever organ transplants and research to cure illnesses. Stanford University has revolutionized the way we live and enrich the world. Supporting this mission is our diverse and dedicated 17,000 staff. We seek talent driven to impact the future of our legacy. Our culture and unique perks empower you with:
- Freedom to grow. We offer career development programs, tuition reimbursement, or audit a course. Join a TedTalk, film screening, or listen to a renowned author or global leader speak.
- A caring culture. We provide superb retirement plans, generous time-off, and family care resources.
- A healthier you. Climb our rock wall or choose from hundreds of health or fitness classes at our world-class exercise facilities. We also provide excellent health care benefits.
- Discovery and fun. Stroll through historic sculptures, trails, and museums.
- Enviable resources. Enjoy free commuter programs, ridesharing incentives, discounts and more!
How to Apply
We invite you to apply for this position by visiting https://careersearch.stanford.edu/jobs/hpc-system-engineer-administrator-8961. To be considered, please submit your resume and a cover letter along with your online application.
Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of his or her job.
Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.
- Schedule: Full-time
- Job Code: 4770
- Employee Status: Regular
- Grade: M
- Requisition ID: 85728
Copyright ©2017 Jobelephant.com Inc. All rights reserved.
jeid-4cec070d34056a448478c32b212e9275Jobelephant.com.Category: Technology, Keywords: High Performance Computing (HPC) Administrator