Skip to content

Latest commit

 

History

History
235 lines (220 loc) · 25.3 KB

resume.md

File metadata and controls

235 lines (220 loc) · 25.3 KB
layout title
cv
Matt Pursley's CV



Matt Pursley, RHCE PSM
Technical Team and Project Leader
Site Reliability Engineer, DevOps, CloudOps






Skills

                                             Technical Team and Project Leader
Site Reliability Engineer, DevOps, CloudOps
 
  Cloud Infrastructure & Observability Leadership: Lead the design, planning, and deployment of product solutions using major cloud providers (AWS, GCP, Azure) and microservices running on Kubernetes, acting as a Team Lead for infrastructure initiatives. Utilize industry-standard tools like Terraform, Helm, Vault, Grafana, Prometheus, ELK, and other CNCF and open-source technologies. Focus on implementing robust observability for monitoring, troubleshooting, and cost optimization.
  Resource Planning, Cost Optimization, & Team Oversight: Develop and implement resource planning strategies to ensure efficient utilization of vendor and cloud resources, minimizing waste and maximizing cost savings. Manage budgets and resource allocation across multiple projects. Have experience analyzing vendor and cloud billing data and identifying opportunities for optimization. Lead migration and consolidation efforts to improve infrastructure efficiency and reduce operational costs.
  Cross-Functional Team Leadership: Lead and facilitate cross-functional collaboration efforts with various teams (Engineering, QA, Product) to ensure that deployed applications consistently meet SLOs and SLAs while adhering to budget constraints. Serve as a primary point of contact and decision-maker for infrastructure-related issues.
  Backend Development & Collaboration: Directly manage and contribute to the development and updates of segmented parts of backend infrastructure and applications. Collaborate with SME Engineers, QA, and Project Managers to ensure timely delivery.
  Data-Driven Decision Making & Reporting: Take ownership of establishing and tracking SLI, SLA, and SLO metrics to ensure products deliver value to internal and external customers. Utilize data analysis to inform resource planning, cost optimization strategies, and provide regular progress reports to stakeholders, including senior management.
  Incident Management & Response Leadership: Own, develop, and maintain the Incident Management process, operating as First Responder, Tech Engineer, and Incident Manager, to ensure rapid issue resolution and minimize service disruption. Delegate tasks and coordinate response efforts during critical incidents.
  Post-mortem Analysis & Process Improvement: Drive blameless post-mortem processes and implement resulting action items, to ensure that unexpected issues/incidents are not repeated and inform improvements to infrastructure and processes. Foster a culture of continuous learning and improvement within the team.
  Project Management & Executive Reporting: Manage project deliverables, timelines, and budgets. Generate and present project deliverable reports and timelines to Senior ICs, Directors and C-Level Teams, including budget and resource utilization reports. Effectively communicate project status, risks, and dependencies.
  Mentorship & Team Development: Lead and mentor a team of Engineers, helping them to be more productive and achieve Company and Personal goals. Focus on developing individual skills and fostering a collaborative team environment.
  Talent Acquisition & Onboarding Leadership: Lead the effort to Recruit, Interview, Validate, Hire and Ramp-up top tier staff members to join the Team and get started delivering fixes and features quickly. Develop and improve the onboarding process for new team members.

                                             Systems Platforms Scripting & Coding Monitoring & Alerting
  • Amazon EKS, Google GKE • Shell Script • Grafana, Kibana, Elasticsearch
  • Ubuntu, CentOS, Fedora • Python • Prometheus, Alerts, Exporters
  • MacOS • Golang • CDCI (Github, Kubernetes)
  • Windows, WSL • Javascript, Typescript • Atlassian (Jira, Confluence, etc.)





Work Experience

                                              
2021 - Present Open World Tech, https://openworldtech.io
  B2B Technology consulting, focusing on automation, scalability, serverless using large Cloud Providers and GitOps based CD/CI pipelines with open-source technologies and tools.
  Co-Founder
   
  Cloud and Serverless Tech Development and Consulting
  • Developing and Deploying GitOps based automated and scalable solutions for Web, Mobile, API, and dedicated Game Servers. Focusing on Kubernetes and other open Source software, tool sets and integrations.
  • Incorporating "cloud streaming/pixel streaming" options for games, using remote/cloud graphics cards.
  • Integrating Nvidia Omniverse digital-twin objects, actors and scenes with interactive games and environments, using Unreal Engine 5.
  • Incorporating Web3 protocols for in-game transactions (NFTs).
  • Incorporating LLM APIs to create more life-like and human game character interactions.
   
Mar 2023 - Jan 2025 Zepz Inc, https://zepzpay.com
  British based global digital cross-border payments platform that enables international money transfers.
  Sr. Site Reliability Engineer
   
  Sr. Site Reliability Engineering
  • Plan, develop, and deploy improvements for observability, synthetic monitoring, and incident management.
  • Migrate and aggregate all available observability data (metrics, logs, dashboards, alerts, oncall rotations and schedules) into Grafana Cloud and Grafana Oncall from various internal and 3rd party vendors (e.g. Pagerduty, Datadog, NewRelic, Heroku, PGAnalyze, CloudFlare, AWS Cloudwatch, GCP Monitoring, Azure Monitor, etc)
  • Migrate remote synthetic checks and alerts into Grafana from various vendors (e.g. Site24x7.com, Pingdom, Amazon CloudWatch Synthetics, etc)
  • Work closely with DBAs and DevOps/Infra Teams to migrate and improve monitoring and alerting for all company databases and related cloud infrastructure.
  • Leverage company-wide knowledge of services and infrastructure to provide fast and effective support for Dev, Operations, and Customer Support teams regarding best-practices, support and requests for all observability, monitoring, and alerting.
  • Participate in Oncall rotations, managing incidents ranging from company-wide outages to software/infrastructure deployment failures, bugs, customer service related problems and other unexpected issues. Respond quickly and efficiently to maximize 9s in all business service uptimes.
  • Participate in cross-functional teams to complete large-scale and complex projects, migrations, deployments, and other improvements and deliverables.
   
2021 - 2023 Improbable Worlds, https://www.improbable.io
  British multinational company focusing on technology to support large scale games, metaverse, and virtual worlds/events
  Sr. LiveOps, DevOps and Site Reliability Engineer
   
  Infrastructure and Application Deployment and Management
  • Work closely with the LiveOps, DevOps and Dev Teams to build, test and deploy scalable application and infrastructure Tech stacks based on Terraform, Dockerized Applications, Helm Charts, Kubernetes Clusters, etc.
  • Plan, Develop, Test and Deploy large-scale system upgrades, with additional HA redundancy and monitoring to reduce the risk of Customer or User impact.
  • Develop and upgrade dashboards and alerts using SLOs, SLAs, KPI metrics and logs.
   
2019 - 2021 Sage Intacct, https://www.sageintacct.com
  British based software company focusing on financial services and management
  Sr. SRE, Site Reliability Engineer
   
  Infrastructure and Application Monitoring and Alerting
  • Work closely with the SRE, Cloud Ops and DBA teams to build, test and deploy scalable application and infrastructure Tech stacks based on custom Python, Bash and Config Management Tools.
  • Complete a deep-dive review of the existing Metrics Collection, Storage and Visualizations infrastructure. (ELK Stack, Ansible/Chef, Nagios/Zabbix, Pagerduty, etc)
  • Complete full evaluation and scoring for several modern industry standard alternatives against existing requirements and desires for a complete revamp/replacement.
  • Collect feedback from various stake-holder Teams and individuals about scoring values for viable alternatives.
  • Architect a full project plan to deploy and migrate to a newly developed metrics collection and storage solution, while carefully scaling back and retiring the legacy system.
   
2012 - 2019 Sony Interactive Entertainment, Playstation, https://www.playstation.com
  Playstation Now, a global video game streaming platform
  SRE, Site Reliability Engineer
   
  Infrastructure and Application Monitoring and Alerting
  • Worked directly with Onsite DC and "Remote Hands" Engineers to deploy thousands of new servers and network hardware to dozens of datacenters and POPs in countries around the world
  • Define and update KPIs, SLOs, SLIs, SLAs, metrics and alerting
  • Design and develop solutions to collect, search and visualize logs and events and fire alerts and notifications to appropriate Teams based on application errors, logs, KPI and SLA breaches. Utilizing internal and open-source tools and tech like Elasticsearch, Kibana, Prometheus, Grafana, Ansible, Ceph, Opsgenie, Kubernetes, Gitlab CDCI, Fluentd, Rsyslog, SNMP, etc.
   
  Automation and Hands-on Operations
  • Configure and maintain Amazon Web Services (AWS) and Google Cloud Platform (GCP) cloud computing environments
  • Perform operational tasks to mitigate major (business or customer impacting) incidents, or unblock Team members, where automation is not yet in place.
  • Develop operational tooling, for "one off" updates and playbook automation
  • Improve automation for systems inventory updates and configuration management
  • Optimize and improve SDLC/CDCI pipeline, processes and infrastructure
   
  Solutions Architecture
  • Perform requirements gathering and resource planning for new projects
  • Research and evaluate industry standard solutions
  • Evaluate and compare onsite, private, public cloud service options and offerings, including feasibility, compatibility, security, compliance and TCO evaluations
  • Maintain up-to-date understanding of all mission critical infrastructure, service architecture and updates
  • Document, communicate and advocate for SRE best practices throughout the company
   
  Technical Lead and Project Management
  • Manage project timelines, deliverables and resource planning
  • Lead architecture and design sessions for cross team projects
  • Provide cross-team architectural consulting, production readiness review and validation
   
  SRE Team Building
  • Proactively help to build and scale out an effective global SRE Team
  • Review, interview and screen potential SRE candidates
  • Train and mentor team members
  • Develop, maintain and update candidate screening and interview procedures and processes
  • Update and maintain "New SRE" startup and training materials
   
  Incident Management Process and Reporting
  • Participate in Oncall Rotation
  • Develop and Maintain Incident Management and Review Processes
  • Develop and communicate RCA and issue mitigation plans
  • Refine and improve KPIs, SLOs, SLIs, SLAs, metrics and alerting, based on incidents and discovered observability gaps
  • Perform and report RCA and Postmortem findings
  • Troubleshoot/break fix and/or escalate discovered issues to relevant teams or engineers
   
2010 - 2012 Digital Domain, https://digitaldomain.com - Vancouver, BC and Port St Lucie, FL
  Sr. Systems Admin and Sr. Systems Engineer
  On-screen Credits: The Legend of Tembo, Jack the Giant Killer, Transformers 3, Tron Legacy, Thor
  https://www.imdb.com/name/nm1250137/
   
  Systems and Infrastructure:
  • Worked to duplicate, setup and integrate Linux environments for new 200 seat and then new 500 seat VFX Studios. Which included 200+ HP Workstations, 1000+ HP High Density Blade Servers and 100+TB of Isilon or NetApp Enterprise class Storage, and high performance Brocade switching environment.
  • Setup, configure, and maintain OS and Software installation and configuration management systems ( Redhat Kickstart, Onesis, Puppet, CFEngine, etc).
  • Worked with sister companies in the US and Canada to integrate VFX Pipeline and Software synchronization. Including CentOS Linux operating system updates and changes, site specific software package installations and deployments, etc.
  • Worked with Linux Kickstart, Onesis and Puppet to setup fully automated bare metal installs for CentOS Linux Operating systems, custom packages, connections to shared storage, custom CG Pipeline and Toolset, etc.
  • Worked to develop scripts and procedures to bind CentOS Linux and MacOSX workstations and servers to Windows Server 2008 via LDAP with Kerberos encryption.
  • Acted as Lead Support for all Render Queueing and Job Management, including automation and scripting.
  • Handled large scale file system sorting, cleanup, transfers, and digital delivery packaging.
  • Configured Symantec Netbackup to run daily, weekly and monthly backups. As well as final show archiving, removals and restorations.
  • Worked with VMWare ESXi Server to deploy, maintain and balance several key server VMs.
  • Setup and maintain Monitoring and Alerting systems for all Storage, Networking, Servers and Workstations for the Studios.
  • Acted as Level 2 and 3 technical support for all Linux and Unix based issues with all Workstations and Servers.
  • Provided detailed documentation and training for Level 1 and Level 2 Technical Support to handle commonly occurring issues.
   
2007 - 2009 Keystone Pictures
  Visual Effects, Lead Technical Director, Technical Supervisor
  Onscreen Credits : The "Buddies" Series (Space Buddies, Santa Buddies, Adventure Buddies, etc. )
  https://www.imdb.com/name/nm1250137
   
  Systems and Infrastructure:
  • Worked with several hardware and software vendors to install and configure a 100 SGI Linux 1U render nodes, 25 MacPro Workstations, and a 40TB SGI Raid Storage Server, connected through a new HP ProCurve Gigabyte network.
  • Developed a clone-able dual-boot MacOSX and Fedora Core Linux system install for the studio's 25 MacPro Workstations.
  • Developed a clone-able Fedora Core Linux based system install for the Studio's 100 Render nodes, using Render Management through PipelineFX's Qube.
  • Manage and Support the Studio's Render-farm with 100 Linux 1U RenderNodes and 25 MacPro workstations.
   
  Render and Color Pipeline:
  • Worked with the CG Supervisor to help develop an AOV based render work-flow for workstations and renderfarm using Mental Ray 3.6.
  • Developed a LUT to translate between the 10Bit Log Panasonic Genesis Camera format to linear, and back within Shake 4.1.
   
  Character Lighting and Fluid FX:
  • "Finaled" the Lighting and Rendering of 65 animated face replacement shots.
  • "Finaled" all in-house Fluid FX using Maya 2009 and Houdini Master 10. Including dust, smoke, clouds, rocket thrusters, etc.
   
2000 - 2007 American Museum of Natural History (https://www.amnh.org/)
  Rose Center Engineering (RCE), Rose Center Productions (RCP) and Science Bulletins (SciBul) Departments
  Technical Director/Unix Systems Administrator
   
  Systems and Infrastructure:
  • Began working with Engineering and Productions, which is a group of about 15 VisualFX Artists, System Administrators, Video Engineers, and Production Staff responsible for developing, maintaining and upgrading all Computers, Video Systems and Video Content for the Digital Dome and Space Shows. This includes two SGI OnyxII Reality Monster Super Computers, several SGI Octanes, O2, Linux and Windows graphics workstations, and 7xHDTV and 4xHDTV Projector Theaters.
  • Worked with Systems Admins and Video Engineers to Design, Create and Test a 7 Node Linux Graphics Cluster for Interactive 3D and Digital Dailies playback in the Hayden Planetarium in preparation for the upcoming show. This system was based on non-proprietary, commodity-based hardware (Dual AMD64, Nvidia Quadro FX 4400, etc) and software (Linux, PiranhaHD).
   
  Full Dome Visual FX and Animation:
  • Worked with the Art Director to design, model and animate "Feather Dream", which comprises 2 of the 36 Minute Planetarium Music Show entitled "Sonic Vision". Additionally, worked to create several background elements and transitions between other shots within the show. "Feather Dream" was created using Maya6 and Shake3.5.
  • Created two 2.5 minute quarterly news animation sequences for the Science Bulletins Department at AMNH using Partiview, Uniview, Maya7, Shake3.5 and PiranhaHD, which were recorded to HDCam and then encoded to HDTV Mpeg2. For playback to visitors within AMNH, and distributed to a network of Museums and Educational Institutions around the world via the Internet.
   
1999 - 2000 New York Institute of Technology (https://www.nyit.edu)
  Advanced Computer Graphics Department
  SGI/Unix Systems Admin
   
  Systems and Infrastructure:
  • Maintained and supported graphics software and hardware for Computer Graphics Labs in Manhattan, NY. Including Silicon Graphics (Unix) Workstations, Avid Video Editor, Softimage3D, Alias Wavefront, etc.





Personal Projects and Research

                                              
Oct 2022 - Present OpenWorldGame.io, https://github.com/OpenWorldGame-Io
  Co-Founder, Lead Game Dev/Contributor
  "Open World" is an open "sandbox" project that leverages Epic Game's free and open-source Unreal Engine 5.x and Lyra starter project/game, along with a customized code and content delivery backend, to provide a free-to-play open space environment that Players/Users can use to chat, communicate, show and share ideas. While also working to incorporate Infrastructure, Metrics and other types of Data Visualization. E.g https://github.com/mpursley/UnrealEngine-Example_BluePrints_Using_Rest_APIs. As well as including some modern updates and customizations to scale the active/concurrent User/Player count beyond what is available from a single backend dedicated gameserver. More information on this project will be released as Prototype, Alpha and Beta releases are developed and deployed.






Education

                                              
2018 • PSM (Professional Scrum Master), Scrum.org
2005 • RHCE (RedHat Certified Engineer), Redhat, Inc.
1996 - 1998 • Digital Arts/3D Animation, The Art Institute of Vancouver