Senior Software Engineer - High Performance Computing
Redmond, WA 
Share
Posted 19 days ago
Job Description
OverviewAs a Senior Software Engineer - High Performance Computing on the HPC(High Performance Computing)/AI (Artificial Intelligence) team, you'll have the opportunity to work on cutting-edge technology that powers our cloud AI supercomputers (Azure HPC documentation | Microsoft Learn). You will be working directly with GPU (Graphics Processing Unit) and other accelerators, delivering world-class performance and enabling breakthroughs in industry, research, and AI. Join us and help shape the future of high-performance computing!You will play a critical role in delivering and maintaining the infrastructure for our cloud supercomputers and enabling the revolution of AI. You will be responsible for owning the delivery and burn-in of clusters into Azure independently, ensuring that the hardware is stable for customers to run their applications. This will involve working closely with hardware vendors and other teams to ensure that the clusters are properly configured and optimized for performance across CPU (Central Processing Unit), accelerators, and network infrastructure as well as tracking progress during all the stages of the process.In addition, you will be responsible for automating the quality process and debugging issues as they arise, ensuring successful resolution. This will involve developing and maintaining tools and processes to automate testing and ensure that quality is built into every step of the development process. You will also work closely with other teams to diagnose and resolve issues, and to ensure that our customers have seamless experience using our cloud supercomputers, as well as becoming the voice of the customer to represent their issues.Your attention to detail will be critical in this role, as you will be responsible for ensuring that quality is always front and center as well as having the desire to identify and isolate potential issues in the early phases of the project. This will involve reviewing system level specification, code and configurations, and working with other teams to identify and address any issues that arise. You will also be responsible for documenting processes and procedures, and for ensuring that our team is following the industry's best practices and standards for software development and deployment.This opportunity will allow you to participate in a highly agile and fast paced environment of HPC, accelerate your career growth and be part of the AI revolution.Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesCollaborates with appropriate stakeholders to determine user requirements for a scenario.Drives identification of dependencies and the development of design documents for a product, application, service, or platform.Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).Leverages subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items.Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.

 

Job Summary
Company
Start Date
As soon as possible
Employment Term and Type
Regular, Full Time
Required Experience
Open
Email this Job to Yourself or a Friend
Indicates required fields