My name is Pete Shima and I have a passion for modern approaches to keeping systems up. I have done operational and development work across companies large and small from systems with ten nodes to millions of nodes. I am an expert in technical and sociotechnical strategies and implementations to improve reliability and resilience in an organization or a product.

Epic Games, Inc

Founded in 1991, Epic Games is the creator of Fortnite, Unreal, Gears of War, Shadow Complex, and the Infinity Blade series of games. Epic's Unreal Engine technology brings high-fidelity, interactive experiences to PC, console, mobile, AR, VR and the Web. Unreal Engine is freely available at At Epic I joined just after the launch of Paragon and worked through the launch of Fortnite, Epic Games Store, and Epic Online Services and the beginning of the development of the Metaverse!

Engineering Director, Reliability

January 2020-Present

After building the infrastructure group I took on a new challenge on to create the Reliability Engineering group at Epic. The Reliability team focuses on Post Incident Analysis and Learnings, Event and Service Readiness, and Development Projects to keep our systems up. The team focuses on sociotechnical and technical elements of our engineering organization.

  • Created the Reliability organization from the ground up as a new team in Epic comprising of Engineering, Program and Product Management, and Risk Analysis.
  • Work with the team to create a robust, reliable, and secure services for the company through our programs and interactions with development teams.
  • Used modern approaches to refresh our incident analysis program used with the entire engineering organization.
  • Experimented with modern approaches to risk management such as Risk Radar.
  • Completed deep technical and risk analysis for major launches and communicated and worked with senior leadership.
  • Developed and implemented multiple scoring systems to identify and categorize important incidents based on customer pain or other factors.
  • Organized and ran large scale events for Fortnite with millions of concurrent users such as our Travis Scott Astronomical event or Ariana Grande events as well as narrative story events.
  • Worked with our Epic Games Store team(s) as well as our Epic Online Services teams for major launches or changes.
  • Worked with service and product teams on service readiness and developed the standard program used by Epic for readiness.
  • Lead initiatives to transparently share our learnings on major public outages such as our April 2021 Certificate issues.
  • Direct involvement in large scale, high impact critical incidents for Epic.
  • Built multiple engineering communities and ran an internal organization wide engineering podcast with over 250 subscribers.
  • Staff hiring, performance management, and development.
  • Pushed for higher standards on engineering excellence and provided expert advise to many engineering teams.

Lead and Engineering Director, Infrastructure

August 2016-January 2020

This team transitioned from a more traditional centralized operations team to an engineering organization that provided services to development teams.

  • Run a team of 40+ across 5 different infrastructure verticals with staff consisting of producers, developers, and infrastructure engineers.
  • Developed and mentored a global contracting team providing 24x7 follow-the-sun first line support. Augmented full time needs through contracting.
  • Was well known for having great success in hiring. Grew headcount faster than any other team across the organization.
  • Built and ran a DBA team running some of the largest MongoDB clusters in the world on top of AWS.
  • Scaled services and our platform from 0 to 10+MM concurrent users with the launch of Fortnite.
  • Worked with development teams on implementing operational excellence best practices.
  • Manage contract negotations and vendors from proof of concept to implementation.
  • Launched from design to implementation.
  • Improved our post mortem process including publishing several large scale public post mortems. 1. 2.
  • Piloted and ran a team that built our next generation platform on top of Kubernetes.
  • Created and ran our cross team embedded operations(SRE) program.
  • Completed short and long term planning/roadmapping for an extremely fast-paced organization.
  • Managed spend, budgets, and cost management for contracts and services in excess of $100M.
  • Direct involvement in high impact critical player issues.
  • Improved or replaced many legacy systems while reaching unmatched scaling peaks.
  • Made large improvements to core infrastructure systems such as metrics and provisioning.
  • Launched a secrets management platform based on HashiCorp Vault along with migrating all configuration secrets to this platform.
  • Launched the organizations first service discovery and configration management features backed by HashiCorp Consul.
  • Delivered a PCI compliant environment from scratch in 3 months working with compliance, security, and development leaders.
  • Migrated from single account/single VPC architecture to multi-account and cross connected VPCs in AWS.


Site Reliability Engineer (SRE)

October 2015-August 2016

HashiCorp builds tools to power the modern datacenter. HashiCorp's most popular tools are Vagrant (run local virtual machines easily), Packer (build images for distribution) and Terraform (infrastructure as code) which are developer tools used to create and manage infrastructure. HashiCorp also has several runtime tools such as Consul (service discovery and key value store), Vault (secrets management), and Nomad (cluster/container scheduler). I am currently the team lead on the site reliability team at HashiCorp which is responsible for the reliability of the Atlas (software as a service product) and Private Atlas installations.

  • Created and built the on-call program spanning multiple development teams with escalation workflows and critical alarming with runbooks.
  • Designed and implemented a post mortem process to reduce repeat issues and help with company operational growth.
  • Manage and own the infrastucture as code through Terraform across multiple environments.
  • Migrated Atlas, the SaaS product, from instances to Nomad, a cluster scheduler for running containerized processes.
  • Built, designed, and implemented a process to run the SaaS product on customer premises.
  • Work directly on-site with large customers to setup Private Installations.
  • Push for operational excellence across all the tools and platform.
  • Built staging stacks used for pre-production testing that can be created from scratch in minutes.
  • Implemented centralized logging for all production services.
  • Built canary systems to help detect and measure faults or unexpected issues.
  • Create and manage a roadmap for the reliability team.

Amazon Elastic Load Balancing (ELB)

Manager, Operations

August 2014-October 2015

"Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances in the cloud. It enables you to achieve greater levels of fault tolerance in your applications, seamlessly providing the required amount of load balancing capacity needed to distribute application traffic." -

In October 2015 I transferred from the Amazon S3 team to the Amazon ELB team as a Systems Engineering manager. I developed and grew a team of System and Support engineers to solve the problems of a massive scale service used in majority of AWS architectures. I met directly with customers and wrote and delivered multiple externally facing post mortems for large scale events. Managed the capacity for the service at scale and created a roadmap and charter to define the team as it grew.

  • Manage a team of 4-12 engineers including managing performance.
  • Managed capacity for systems measured in hundreds of thousands across 400+ dimensions.
  • Piloted and created a customer experience team to engage with customers directly.
  • Worked virtually and on-site with multi-million dollar and fortune 500 companies.
  • Hiring and recruiting for multiple positions including roles requiring security clearance.
  • Piloted a Systems Engineering AWS wide community program.
  • Mentored and developed staff inside and outside of direct organization.
  • Developed a charter and roadmap for the team creating an identity for Systems Engineering.

Amazon S3

Manager, Operations

October 2013-August 2014

"Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers." -

I was an Systems Engineering manager for the team running the S3 indexing and metadata service. The indexing and metadata service is responsible for handling hundreds of thousands of requests per second across a large distributed system. Built and grew a new team of Systems Engineers/Ops/SRE to handle fleet management, administration, scaling, and capacity of the S3 indexing services as well as develop and adopt new programs to improve reliability and performance.

  • Manage a team of 3-6 including managing performance.
  • Lead and piloted a change management board approving 800+ production changes.
  • Managed a project of 10+ engineers to deploy a mission critical time sensitive update across every running production host and service with no outages.
  • Developing a team charter and goals.
  • Hiring and recruiting.
  • Fleet management.
  • Capacity management.

Amazon S3

Systems Engineer, Operations

May 2012-October 2013

"Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers." -

As a member of the S3 Ops team, we keep S3 running from the front end to the back end.

  • Automation.
  • Working at large scale.
  • Solving deep technical problems.
  • Measurement and alarming.

King of the Web

Operations Engineer

July 2011-May 2012

My role at King of the Web is to ensure our architecture is online, secure, and scaled to our needs. Being a viral video site and showing multi million real-time vote counts present interesting challenges. We use Ruby on Rails and we don't mess around.

Major Goals:

  • Migration off of Engine Yard PAAS on to private hybrid cloud. Migration completed 5 months after start date.
  • Use Chef as a configuration management tool to setup end to end infrastructure, ongoing documentation and management.
  • Auto scale production web site to meet viral video traffic spikes which typically increase site traffic by 500%+ in minutes.
  • Provide a stable environment that can be handled by a single staff member and limited on-call. High focus on visibility and KISS (keep it simple stupid).
  • Develop and see from start to finish projects that span multiple departments.
  • Be an enabler, not a barrier, for development and business ops. Build tools and write maintainable code to automate whatever possible and keep dev happy.
  • Release management and zero downtime deploys to production.
  • Integration with multiple APIs and development of tools that create solutions to business challenges.

Rockstar Games


April 2011-July 2011

As a remote employee in Seattle, Washington my focus is to provide expert advice and resource across the Rockstar Games development process. Using my diverse skillset and newly available technology I help to keep Rockstar on the bleeding edge of modern day development tools.

Major Goals:

  • Work with technology directors and executive staff to keep development needs at the forefront of IT focus.
  • Review existing workflows for data movement and provide recommendations along with end to end completion of agreed solutions.
  • Investigate business development ideas and provide advice on 3rd party software including business case scenarios and potential return on investments.
  • Provide documented architecture design complete with proof of concept or alpha implementations of desired feature sets or tools.

Take-Two Interactive Software Europe

Infrastructure Manager (Label Technology)


In our centrally located London office I was responsible for designing and implementing solutions with a goal to improve speed of the game development process through technology. Being the sole member of this department I utilized my in-depth experience within the company to identify and implement solutions to help developers.

Major Accomplishments:

  • Designed and implemented a cross studio global file transfer platform with Aspera technology across 20 different studios and many game titles.
  • Designed, developed and implemented a custom web portal to securely manage deployment of game builds and common development and publishing functions including SDK upgrades.
  • Developed back end systems for a global video sharing solution integrated into an existing large scale development toolset.
  • Worked with development staff including producers, programmers, artists, studio heads and more along with internal/external IT teams to define needed tools for the future.

Take-Two Interactive Software Europe

Infrastructure Manager (Corporate)


At the European headquarters for Take-Two Interactive (NASDAQ:TTWO) I was responsible for managing end to end infrastructure across London, Windsor, Germany, Spain, France, Netherlands, Italy, Geneva, Singapore, and Australia. I also worked closely with other sites in the European time zone and interfaced with global IT leaders to develop strategy and provide leadership for IT staff.

Major Accomplishments:

  • Over 12 months integrated all European and Pacific Rim sites into global Active Directory and Exchange Forest.
  • Implemented hardware, software and configuration standards across a disparate infrastructure.
  • Hired and developed new European IT team consisting of support staff and engineers.
  • Transitioned European headquarters from Geneva, Switzerland to Windsor, UK including staff and datacenter.
  • Consolidated core services such as email, blackberry and backups into European headquarters.
  • Implemented the largest internal company sharepoint system and provided maintenance and support.
  • Renegotiated and consolidated vendor contracts providing cost savings along with realigning budgets.
  • Developed and implemented a global level 2/3 service desk for "follow the sun" support.

Take-Two Interactive Software Inc.

Senior Systems Engineer, NYC


The focus of Senior Systems Engineer was to take a global approach to architecture and engineering across the company. I was responsible for level 2/3 support, architecture and implementation across all sites in the global Active Directory forest. In addition to that I played a big part in the continual improvement of our public facing datacenter.

Major Accomplishments:

  • Continued integration of international sites into global Active Directory and Exchange Forest from a wide range of independently run studios.
  • Implemented hardware, software and configuration standards across disparate infrastructure.
  • Developed a global centralized event logging system monitoring logs across servers, network gear, and more.
  • Provided level 2/3 support and training to local IT staff in various locations.
  • Worked with developers to update and improve web servers and databases for public facing websites.
  • Resolved difficult technical issues with sensitive time frames.

Rockstar Games

Systems Engineer, NYC


In my position as systems engineer I was responsible not only for the administration and maintenance of the server infrastructure but to provide leadership for global IT architecture. I was also providing level 1-3 support for local staff.

Major Accomplishments:

  • Discovery, design and launch of a global Active Directory and Exchange Forest.
  • Architected and setup a new centralized spam filtering solution which blocks over 5 million spam messages a month.
  • Designed and implemented a global DFS structure allowing all staff globally to map 1 single network drive.
  • Developed global technology standards used across the organization.

Hello, how are you?

Please don't spam me

But I'd love to hear from you.


Twitter: petey5k

LinkedIn: Pete Shima

GitHub: pshima

Pete Shima

Fair and abiding citizen

Seattle, WA