Senior IT Systems Engineer at Columbia University-Institute for Genomic Medicine (New York, NY)
The Senior IT Systems Engineer will work in a team environment to deploy and support mission-critical, high availability servers, storage, and backup at a high performance computing and storage facility for Columbia’s Institute for Genomic Medicine (IGM). Serve as the primary system administrator for a large SGE cluster including maintenance and upkeep of cluster-class servers and enterprise class storage area networking devices, and ensure data integrity and availability. Responsible for production software deployments, upgrades, capacity planning, backup strategies, as well as software/hardware integration planning. Coordinate regularly with departmental users to ensure that infrastructure and technology are current, meet all the requirements and supported accordingly. Interface with vendors to ensure security updates, patches, and configurations are installed properly. Also, analyze the cost and benefits of technology used by the center and forecasting the direction of any future technology or current technology refresh.
RESPONSIBILITIES:
Desired Skills and Experience
- Maintain mission-critical IGM computing operations – including a cluster computing environment of over 2000 CPU cores and ~2 PB of storage.
- Administer, validate, and review user and system accounts, access controls, audit logs and system integrity to maximize system security and data confidentiality.
- Design, implement, install, configure, maintain, and support the IGM High Availability/ Clustering server environment
- Provide technical support and monitoring of an in-house software application, third-party software, open source software, and MySQL databases.
- Author, implement, execute, and periodically update System Security, Business Continuity, and Disaster Recovery Plans to be consistent with Columbia Medicine policies and standards.
- Interface between IGM and technology vendors on all Hardware/Software updates, installs and purchasing.
- Ensure all measures are in place to support relevant CUMC and IGM Security guidelines.
- Primary contact in an on-call rotation that provides 24x7x365 coverage of mission critical functions
- Solid computer systems knowhow building/maintaining systems in a High Performance computing environment.
- Expert Linux systems administration skills with a minimum of 2 years of experience, preferably within a heterogeneous environment consisting of several subsystems
- Thorough understanding of network setup and configuration, with the ability to troubleshoot and solve network bottlenecks
- Experience working in a diverse data center environment consisting of an HPC cluster and high-volume storage systems
- Experience with shell/python/perl scripting, preferably while integrating with a source code versioning system (git or subversion) is a plus
- Relational database management experience (e.g. MySQL, PostgreSQL, Oracle)
- Proven ability to read, understand, and apply technical documentation, and to learn new technologies quickly
- Ability to multi-task and prioritize, work within a team in a quickly evolving environment
- Ability to communicate effectively with team members and customers, both verbally and through documentation