HPC Engineer
Avalore, LLC
- Annapolis Junction, MD
- Permanent
- Full-time
- Responsible for the normal day-to-day HPC operations and maintenance of the HPC systems
- Provide day to day systems administration duties for Nvidia GPUs, Commodity Cluster Systems and Cray HPC environments
- Perform system monitoring, software installations, debug, upgrades, health checks, and identification/implementation of automated business processes
- Provide assessments, on-going performance analysis and recommendations for future architectures
- Responsible for operating all the host systems for the analysis
- Works in a liaison role, linking the analysts and their specialty codes and applications, to the computing systems that are focused on yielding in-depth technically sound results.
- Oversees analytic applications running on a clustered HPC fabric including CPU and GPU systems
- Managing job submission to clients applications and codes using MPI/OpenMPI
- Provide in-depth analytic results, to achieve a best-tool-for-the-job approach.
- Partners with data scientists, engineers, and analysts conducting specialized scientific and engineering analysis.
- Escalate issues and problems to hardware support and/or engineering management as necessary
- Responsible for continuous performance analysis and tuning the HPC environment
- Assist with the identification, troubleshooting, and repair of software problems impacting performance of implemented HPC solutions
- Perform installation of software patches including upgrades to operating systems and firmware
- Assist with the resolution of trouble tickets and software problems identified by system’s users
- Identify and expand services and functionalities offered in HPC environment
- Be a primary point of contact to resolve any hardware or software malfunctions, including working with service personnel as necessary
- Review system logs to identify and resolve software and systems related issues
- Prepare reports related to the operational efficiency of the hardware and execution of users jobs
- Experience with MPI/OpenMPI, SLURM, and Linux Operating Systems essential
- Prior experience as a Systems Administrator essential, with a preference for experience working with clustered systems including GPUs in the hardware stack
- Experience with high speed networking, and CUDA preferred
- Software integration experience a plus
- Other duties could be required to support the customer’s mission
- Minimum of 6 years demonstrated on-the-job experience
- Demonstrated on-the-job experience with integrating functionality from disparate systems via scripting/tooling/automation
- Demonstrated on-the-job experience with the Sponsor's system security environment and requirements
- Demonstrated experience leading systems architecture, operations, maintenance and administration
- Employer-Paid Health Care Plan (Medical, Dental & Vision)
- Retirement Plan (401k, IRA) with a generous matching program
- Life Insurance (Basic, Voluntary & AD&D)
- Paid Time Off (Vacation, Sick & Public Holidays)
- Short Term & Long Term Disability
- Training & Development
- Employee Assistance Program