Container Management Platform(CMP) and PyPE Transition Project : Status Updates

Status Updates

In May, 2023, the following was completed:

  • User Acceptance Testing (UAT) complete
  • Lead test plan was written and executed
  • CMP-Batch accommodating Task Manager was put into production
  • Beta go live was declared

In April, 2023, the following activities were completed:

  • Design and code review for CMP Batch – Task Manager was completed
  • CMP Batch – Task Manager was implemented in the CMP production cluster
  • User acceptance testing of the Task Manager functionality was started
  • Fixes for issues found in testing were addressed to unblock further user acceptance testing
  • User documentation was installed on the CMP wiki

In March, 2023, the following activities were completed:

  • UT CMP Batch capability for Stonebranch-based projects achieved live production as of March 1, 2023, one week earlier than the original baselined plan!
  • UT CMP Batch for Task Manager-based projects is in development
  • Projects were identified for Batch Task Manager testing
  • ISO team started analysis and testing that can proceed in parallel with the development, test and deployment of CMP Batch Task Manager

In February 2023, the following activities were completed:

  • ISO pentesting was completed with a few low and moderate issues identified.  Moderate issues were addressed with design updates.
  • Two migration walkthroughs of actual AIS job group were conducted
  • Analysis of high touch patterns for AIS workflows was completed
  • Based on high touch pattern analysis along with the and the learning from migration walkthroughs, a rough order of magnitude estimate was made for AIS project/job-group migration.   Based on the resulting high estimate, it was decided that CMP-Batch team needed to design an alternative to accommodate task manager projects.  This work was started.
  • CMP-Batch for Stonebranch based PyPE Batch jobs is in production, ready for user migration, ahead of plan!

In January 2023, the following activities were completed:

  • CMP-Batch design implementation completed
  • AIS job group walk through conducted to characterize task manager functionality
  • Migration POC of a chosen AIS job group was undertaken
  • Inputs provided to ISO to enable start of security testing
  • ISO security testing started

In December 2022, the following activities were completed:

  • “Welcome to CMP” FYI session held on December 7
  • Batch design review with ISO was completed
  • AIS developers are engaged with CMP-Batch team to provide user experience inputs and testing

In November 2022, the following activities were completed:

  • The two of three major design modules (Stonebranch-Kubernetes integration and yaml build design) for Batch are completed
  • The Stonebranch-Kubernetes design module is in ISO review
  • CMP on-boarding sessions were held, this will be on-going
  • CMP Lessons Learned session was conducted

In October 2022, the following activities were completed:

  • Load testing of the system was performed and completed. 
  • Documentation needed to guide migrations was completed.
  • User Acceptance testing was run with 11 developers engaged.   Two key findings were identified with actions logged.
  • Action items stemming from user stories were resolved
  • On boarding sessions were made available via UT Learning & Development sign up for the first groups in the transition plan
  • Batch development was planned, defining tasks and milestones
  • The first of four major design deliverables for Batch was completed and the second put into progress
  • UT CMP Service went live!

In September 2022, the following activities were completed:

  • A revision to the go-live was proposed and accepted by project sponsors.  The re-plan was needed to accommodate work on security remediations, load/stress testing and user acceptance testing work
  • Significant efforts were focused on infrastructure changes needed to address security test findings
    • 19 of 20 findings were remediated
    • The one area outstanding, pod security, could not be addressed without breaking system functionality and thus, written exceptions are needed to give time for resolution that requires more development work.  The ISO will track the exceptions
  • The team revisited the user stories written in the beginning of the project to review and demonstrate each.  At this time 65% of these are complete
  • User documentation (Wiki) was outlined with sections in progress.  At this time 70% of the user documentation for PyPE migrations and support is done
  • Batch design work was resumed with a review of the design with ISO

In August 2022, the following activities were completed:

  • All Security review and test results were reviewed with project leads, service owner, and ISO with agreement on the mitigation plans for 20 findings
  • Implementation, review and rework of security mitigations was completed for 15 of 20 findings, the remaining 5 are in implementation or rework
  • Early user engagement continues in a dedicated Teams channel and weekly office hours
  • Three different applications from early “friendly” users were migrated in UT CMP, with feedback and insights from the users captured in documentation updates; these are in addition to 5 others migrated within the project team.
  • Outreach to departments slated to migrate their apps in the first months after go-live is highlighted the need for more hands-on, personalized training that the project team will accommodate
  • Devised plan for load and stress testing as a joint collaboration of developers and service team, to be started after all changes from security findings are completed 
  • Team agreed on a freeze on any changes or updates to CMP prod prior to go live
  • Scheme for labeling application-specific log data was developed and is in implementation.   This will enable developers to access their logs in the CMP.
  • UT CMP SLA was approved by BAITLC
  • ServiceNow updates for the UT CMP service are programmed

In July 2022, the following activities were completed:

  • Execution of all the security reviews of the four major CMP subsystems, including Rancher/Kubernetes, Harbor, GitHub CI/CD and Backstage
  • Penetration testing and reports completed for Rancher/Kubernetes, Harbor and GitHub CI/CD
  • Backstage on-boarding procedure documented and tested
  • User documentation improvements implemented
  • New Batch design proposed and initial security review conducted
  • Work underway to resolve security findings for Rancher, Harbor and GitHub CI/CD
  • Backstage augmented with link to Splunk dashboards, logs deployments, requests, uptime and basic performance stats (e.g., storage, I/O, CPU, memory)
  • CMP alpha user on-boarding and support provided, and kicked off a series of weekly user office hours to assist the early users
  • First alpha tester completed CMP on-boarding and migration procedure
  • Project team technical experts committed to supporting CMP service team post-go live
  • Initial meet up with Service Desk leads to share overview of CMP Service and share short term support post-go live
  • CMP Service Go-Live criteria developed and reviewed with key stakeholders
  • Work progressing to resolve security findings for Rancher, Harbor and GitHub Action Runners
  • GitHub Action Runners FYI conducted, showcasing a GitHub action runners tutorial, along with diversity of application implementations using runners in CMP, IAM, and developer test

In June 2022, the following activities were completed:

  • Completed development of Continuous Delivery (CD) API
  • Implemented logging for developer CI/CD troubleshooting
  • Confirmed Rancher/Kubernetes and GitHub logs, including build and deployment actions, are being sent to Splunk
  • Kicked off modular approach to UT CMP security reviews of four major CMP subsystems (Rancher/Kubernetes, Harbor, Github CI/CD, Backstage)
    • Rancher/Kubernetes and Harbor (image repository) security reviews, scanning, and penetration testing have been completed
    • ISO reports have been completed and delivered to the team
    • No major rework needed, mitigations in progress
  • Developed initial drafts of onboarding user documentation, including:
    • Getting started landing page with pre-requisite actions and set up instructions
    • Migration procedures documented
  • Alpha version of CMP is considered complete and ready for external users to test
  • Ten developers, external to the project team, signed on to test the CMP alpha platform and documentation with their applications; kick off meetings held
  • Presented Dependabot FYI, giving the development community insights into this capability that can be leveraged beyond CMP (86 attendees!)
  • SLA review completed by the Critical Production Infrastructure Committee (CPIC)

 

In May 2022, the following activities were completed:

  • All dev experience user stories were reviewed tasked in project plan, or documentation backlog. Many of these tasks are complete or in progress as of May
  • Monitoring and logging work tasks elaborated, assigned and are started
  • A Kubernetes plug-in was successfully integrated in Backstage to provide user application status 
  • API for secure deployment was implemented, reviewed and completed
  • Six application migrations completed as part of the migration test effort
  • User documentation strides with landing page created and on-boarding info completed
  • Friendly users identified to test migration and documentation
  • ITS jobs were posted and candidates interviewed for CMP service team positions
  • The project agreed to ISO proposal for modular approach to security reviews to facilitate more tractable reviews of all major CMP elements, including: 
  1. Rancher Kubernetes platform
  2. Harbor image repo,
  3. GitHub actions and CI/CD API
  4. Backstage

 

In April 2022, the following activities were completed:

  • Platform Infrastructure including cluster integration w/ GitHub action runners, Splunk, storage, EntAuthN, Harbor, CI and basic CD was readied for migration testing
  • Enterprise Authentication on Rancher Cluster implemented 
  • Baseline image strategy finalized and documented
  • Revised baseline plan in place
  • Container storage plugin integrated in prod cluster
  • Migration test team formed, test strategy agreed upon, apps to migrate identified
  • Review with sponsors to align Dev-Ops-Sec on security requirements review for new CD approach
  • Focus put on defining and developing Dev Experience features to meet development community needs and satisfaction.  Sources of the features include PyPE functionality and user stories defined early in the project
  • New, streamlined SLA draft reviewed by BAITLC Critical Production Infrastructure Committee on April 22
  • Job descriptions for ITS service team openings were written, submitted to HR and, HR was given a priority request by ITS AVP top three needs
  • Knowledge sharing session with University of Michigan on service structure and strategy

 

In March 2022, the following activities were completed:

  • Production and Dev Rancher Cluster environments brought up
  • GitHub action runners with AD were made operational on enterprise managed user GitHub
  • Backstage 1.0 functionality completed for MVP, serving as a developer portal to view one’s application and provide links to resources such as splunk logs, GitHub repo accessible in “a single pane of glass”
  • Project team, led by ISO, review of various tools to incorporate security scanning in the deployment stream – Anchore, Tenable.cs, Harness, Argo
  • Proof of concept and demo of Argo by ISO  
  • BAITLC Critical Production Infrastructure Committee (CPIC) review of first SLA draft to provide valuable inputs to the SLA creation
  • Funding approval granted by Budget Council, awaiting finalization
  • Job descriptions for ITS service team openings are written and submitted to HR
  • Adoption of new naming convention of Container Management Platform (CMP) 
  • Docker for Local Development course completed and announced to PyPE developers

 

In February 2022, the following activities were completed:

  • Capability to route PyPE application urls from PyPE to NGP made available in PyPE
  • Migration walkthrough demonstrated for project team
  • Github actions workflow and environment created verified
  • Identification of first application organizations for Spring 2022 migrations
  • Identified criteria for MVP
  • BAITLC Critical Production Infrastructure Committee kick off to start work on SLA
  • User Documentation outline drafted
  • “Getting Started with Docker” training course created to help the development community acclimate to design with containers
  • Local Development on Docker workshop conducted with over 80 in attendance
  • Discovery of potential developer base image proliferation raising security vulnerability concern
  • Sponsor review/approval of image security mitigation strategy 
  • Meeting with Anchore vendor on security scanning and auditing for images

 

In January 2022, the following activities were completed:

  • Thirteen critical technical design decisions were reviewed and agreed upon
  • Batch design reviewed and agreed upon
  • Decision to use Backstage as the “one stop shop” developer portal, production instance built
  • Azure AD integration to github.com proven for authentication and team synchronization, to be used in the implementation of tenancy groups, backstage, image repo
  • Build & deploy capabilities implemented via github actions
  • Utility to route PyPE urls to NGP coded, reviewed, tested and delivered
  • RHEL 8  base image built and verified
  • Two PyPE apps, owned by team members, successfully migrated to NGP dev
  • Docker for local development tested and documented
  • Additional staff joined the project team, paving the way for ITS taking on the NGP central services role
  • Transition and operational service budget proposal completed and reviewed with CFO

 

In December 2021, the following activities were completed:

  • VMWare integration with Kubernetes found to be feasible, de-risking the storage concern  
  • Critical technical solution decisions identified, 7 of 11 resolved  
  • Back-up implemented in Rancher  
  • Build & Deploy capabilities implemented via github actions  
  • Project website created and live on BAITLC website  
  • NGP Service Proposal delivered to Sponsors, BAITLC Platform subcommittee, and BAITLC – all endorsed 
  • December 15 FYI session conducted, attendance of 161  

 

In November 2021, the following activities were completed:
  • Charter version 1.0 approved  
  • First communique to IT community  
  • Stakeholder Engagement Plan approved and in implementation 
  • Storage challenges identified with mitigation planning underway  
  • Rancher integration with Ansible  
  • Logging functionality implemented in Rancher  
  • PyPE user survey launched to gather info on current applications by organization

 

In October 2021, the following activities were completed:

  • PyPE functionality analyzed for scoping in NGP  
  • Technical working team identified to stand up Rancher, daily stand-up in progress  
  • Second technical working team formed to focus on development experience 
  • Rancher initial instance built 
  • Github actions POC built 

 

In September 2021, the following activities were completed:

  • Project Kick-Off 
  • BAITLC Decision to deliver NGP in March, 2022 to enable PyPE retirement in March, 2024