AI-Enabled Transformation

Redesigning how a graduate business course on Operations Management operates in the age of AI

Overview

Johns Hopkins Carey Business School needed to solve a problem that most institutions were still pretending didn't exist: AI had already entered the graduate classroom, and the existing infrastructure — rubrics, assessments, feedback models — wasn't built for it.

I was brought in to design and deploy the systems, governance, and human infrastructure to make AI integration work at the program level. Not as an experiment. As a production deployment.

This was a first-of-kind engagement at Carey. I was the first person in this function, laying groundwork explicitly designed to scale — with the expectation that additional AI engineers, teaching staff, and technical contributors would build on the foundation in subsequent terms.

The engagement ran across two consecutive terms, serving 120+ MBA students per cohort.

Impact By the Numbers

The Problem

A top-ranked graduate program had an AI problem — and no framework to solve it.

Faculty knew students were using AI. They had no visibility into which models, at what depth, or with what level of critical engagement. Banning it was both unenforceable and strategically counterproductive in a world where AI fluency is increasingly a professional baseline for MBA graduates.

The assessment infrastructure made it worse. Rubrics designed for a pre-AI classroom measured compliance, not thinking. Multiple-choice quizzes rewarded recall. Open-ended case analyses had no consistent evaluation standard — and no scalable way to deliver quality feedback to 120+ students at once.

The core design challenge: how do you rebuild graduate assessment so that AI amplifies student reasoning rather than replacing it — and how do you make that rigorous, auditable, and scalable from day one?

That was the engagement.

The Opportunity

Most institutions were playing defense — banning AI, ignoring it, or issuing policies that lagged the reality on the ground by at least a semester. Carey had a different posture: if AI fluency was becoming a professional baseline for MBA graduates, the question wasn't whether to integrate it, but how to do it rigorously enough to be worth doing at all.

The timing created a genuine opening. HopGPT — Johns Hopkins' secure enterprise AI gateway — had just launched, giving the institution its first production-grade LLM environment for internal use. The platform was new, the API documentation was still maturing, and real-world deployment use cases were largely untested. The course needed an AI infrastructure. The platform needed a serious user. The conditions were right to build something that could outlast the term it was built for.

The Vision

The goal was never to automate grading for its own sake. The design intent was an AI autograder that students would use themselves — receiving rich, qualitative feedback that guided their reasoning, not just scored their output.

The long-term architecture was explicitly agentic: lay the groundwork in year one, then deploy autonomous grading on behalf of the teaching team in subsequent terms as the platform matured.

Equally important was the governance layer. From the start, the engagement was designed to be self-sustaining — not dependent on the person who built it. That meant building infrastructure that could hold institutional memory and improve over time, not just deliver results for a single cohort.

The Approach

Workstream 1 — AI Evaluation Architecture

The flagship deliverable was an end-to-end AI grading infrastructure for the Operations Management core course — one of the most analytically demanding courses in the MBA program.

I designed and deployed three distinct AI evaluation systems, one per case study, each built around a BLUF / Analysis / Conclusion rubric structure that assessed reasoning quality, argument construction, and the validity of operational recommendations — not just whether a student arrived at the right number.

The 3 Cases

queuing theory, capacity analysis, bottleneck identification, staffing optimizations
value stream mapping, Lean Six Sigma waste identification, process improvement strategy
AnyLogic digital twin simulation modeling of Operating Room (OR) patient flows, utilization vs. patient wait time trade-offs, simulation interpretation

LLMs Deployed in HopGPT Platform

All systems were built and tested on HopGPT — Johns Hopkins' enterprise AI gateway providing institutional access to nine frontier models across three providers: OpenAI (via Azure), Anthropic (via AWS), and Meta (via AWS).

I benchmarked performance across all available models, iterated on prompt architecture against Lean Six Sigma and structured reasoning methodologies, and worked directly with the JHU IT department to surface API gaps and documentation improvements. This was one of the earliest production deployments on the platform — I was simultaneously builder, tester, and informal product collaborator.

The result: feedback turnaround reduced by two days per cohort, with measurable improvement in feedback consistency and analytical depth across student submissions.

Hip Op Case Study: AI-Powered Autograder

Case Overview: Students analyze hip and knee replacement surgery workflows at Hipkins Hospital using AnyLogic simulation to assess whether optimization tools can serve as a digital twin for operational decision-making.

The analysis focuses on balancing three competing objectives: surgical throughput (7+ procedures per track daily), surgeon wellbeing (burnout prevention), and patient experience (reduced wait times).

Assessment Design: Students submit video presentations (max 10 minutes, ~10 slides) supported by comprehensive notes. This format requires students to articulate complex operational analysis verbally, demonstrating critical thinking alongside written rigor.

Autograder Structure: The tool provides formative feedback using a three-part rubric before final submission, enabling students to self-correct:

BLUF (15%) - Executive recommendation with quantified impact (surgery count, flow time, financial outcomes), crisp and data-driven
Analysis (75%) - Data-driven rigor on five core tasks: base case documentation, optimized configuration performance, comparative analysis, KPI discussion (surgery count/profit/wellbeing), and digital twin viability assessment
Conclusion (10%) - Clear restatement of recommendation with evidence summary

Hip Op Case Study - Autograder Output Sample

Sample output demonstrates the feedback students receive: specific scores, concrete strengths, actionable improvements, and priority focus areas for refinement.

Student: Jennifer C.
Submission: HipOp_CohenJennifer.pdf
Evaluation Date: [Date]

BLUF: 12/15 points (Good)

Strengths:

Clear, direct recommendation with specific action items (2 CRNAs, pooling, BMI scheduling) and quantified outcomes (surgeries 9→12, flow time 185→145 min)
Includes quantified financial impact ($1.32M annual revenue) with no extraneous explanation or reasoning

To Reach Next Level:

Simplify further by leading with the single most impactful outcome first, then supporting metrics
Consider adding specific profit margin or ROI percentage to strengthen business case impact statement

Justification: BLUF meets "Good" rubric criteria: concise, contains impact statement with numerical outcomes, no reasons/explanations. Does not reach "Exceptional" level because recommendation could be slightly more crisp and leading metric selection could prioritize business impact more strategically.

CONCLUSION: 9/10 points (Good)

Strengths:

Consistent with BLUF recommendation - restates three-phase plan and resource additions exactly as introduced in BLUF
Appropriately summarizes key supporting evidence - references bottleneck analysis, utilization improvements, and implementation timeline without introducing new analysis

To Reach Next Level:

Include brief statement about why simulation tool would/would not serve as reliable digital twin given the analysis presented

Justification: Conclusion aligns with BLUF and synthesizes analysis effectively per rubric standards. Minor gap: does not directly address the core question about digital twin viability, which would elevate to "Exceptional" level.

ANALYSIS: 56/75 points (Needs Work)

Strengths:

Rigorous statistical methodology - correctly documents base case with 5 simulation runs and confidence intervals (97.8% ± 1.8% anesthesiologist utilization), demonstrating proper understanding of high-variance system requirements
Systematic bottleneck identification - accurately pinpoints anesthesiologists as primary constraint with supporting utilization data and explains why (required during sedation AND surgery), showing strong grasp of constraint theory

To Reach Next Level:

Digital twin assessment is incomplete: Analysis evaluates simulation results but does not adequately address whether the simulation tool itself can serve as ongoing decision-support for Paul's operational decisions (core question). Add explicit discussion of tool limitations, validation requirements, and implementation feasibility
Missing profit/cost analysis: While you calculate additional surgery revenue ($14.08M), you need complete cost-benefit analysis including staff salary costs, patient waiting time penalties ($20K/min), and net profit to fully address Paul's financial optimization objective

Justification: Analysis documents base case performance and applies four lenses framework appropriately, meeting component requirements. However, two major gaps prevent "Good" level score: (1) Digital twin assessment represents 25% of core assignment tasks but receives minimal attention, (2) Financial analysis incomplete without cost-benefit synthesis. Adding these elements would move this to 68+ points.

FORMAT COMPLIANCE: 0 deductions

✓ Slide count: 6 slides (within 10-slide maximum)
✓ Notes sections populated with detailed analytical content
✓ Professional presentation formatting
✓ No time penalty (presentation format, not video)

TOTAL SCORE: 77/100 points (Good)

Score Breakdown:

BLUF: 12/15
Analysis: 56/75
Conclusion: 9/10
Format: 0 deductions

PRIORITY FOCUS: Complete Digital Twin Assessment

Most critical improvement for next assignment:

Your simulation methodology and bottleneck analysis are strong, but the core assignment asks whether simulation tools can serve as Paul's digital twin for operational decision-making. You need to explicitly address:

Tool strengths for ongoing use - What aspects of the optimization tool make it suitable for repeated operational decisions?
Limitations and risks - Heuristic optimization caveats, model validation requirements, real-world variance expectations
Implementation pathway - How would Paul actually use this tool month-to-month? What governance/monitoring is needed?

Adding 1-2 slides with this analysis would likely move your score to 88-92 range by completing the core assignment requirements.

Secondary improvement: Expand financial section to include all costs (staff salaries, patient waiting penalties) and net profit calculation to fully justify the business case for Paul's triple-bottom-line objectives.

Workstream 2 — Governance:
The Co-Creation Framework

The evaluation infrastructure solved the grading problem. The governance question was more complex: how do you ensure students are genuinely developing as thinkers, rather than outsourcing cognition to a model?

In term 2, I co-developed the Co-Creation Framework with the course professor — a governance model that repositioned AI as a structured collaborator rather than an answer engine. The framework required students to submit Chain of Thought Transcripts alongside their deliverables: documentation of their prompting process, iteration decisions, and how AI shaped their reasoning path to the final output.

This made thinking visible and auditable. It preserved critical reasoning as the core of the learning process. And it created a precedent for responsible, transparent human-AI collaboration in graduate education that extended beyond the original course.

Within the first term of deployment, the framework was shared internally across Hopkins departments and externally with faculty at peer institutions — adoption driven by faculty interest, not mandate. The design was intentionally modular, built to be layered on by future contributors across subsequent terms.

AI CoCreation Process & Framework (1).png

AI CoCreation Process & Framework (2).png

AI CoCreation Process & Framework (3).png

AI CoCreation Process & Framework (4).png

AI CoCreation Process & Framework (5).png

AI CoCreation Process & Framework (6).png

AI CoCreation Process & Framework (14)).png

AI CoCreation Process & Framework (15).png

AI CoCreation Process & Framework (16).png

AI CoCreation Process & Framework (17).png

AI CoCreation Process & Framework (13).png

AI CoCreation Process & Framework (11).png

AI CoCreation Process & Framework (10).png

AI CoCreation Process & Framework (9).png

AI CoCreation Process & Framework (8).png

AI CoCreation Process & Framework (7).png

AI CoCreation Process & Framework (12).png

Workstream 3 — Assessment Redesign

Alongside the grading infrastructure, I identified a structural weakness in the course's weekly assessment model: multiple-choice quizzes were generating compliance, not engagement. I designed and coded an interactive quiz prototype — incorporating real-time feedback mechanics — before being asked to, and brought it to the teaching team as a recommendation.

That prototype opened a larger strategic conversation about replacing the format entirely. I led the vendor evaluation for an AI-moderated discussion platform built for graduate education — attending product demos, reviewing pedagogical fit against course objectives, contributing structured feedback, and informing the decision to adopt.

The contract was signed. Full deployment was scheduled for the term following my engagement — a direct outcome of the groundwork laid during my time in the role.

External link to Breakout Learning partners—AI moderated discussion platform, in teaching team’s attempt to replace rote quiz formats to critically thinking methods for graduate students.

Breakout Learning

- 2 Days

Grading turnaround reduction per case study per cohort

3

Number of AI evaluation systems designed & deployed

30%

Increase in course evaluation scores overall

Outcomes

2 Terms

First deployment; iteration & governance frameworks

Reduced grading turnaround by 2 days across 120+ students per cohort over two consecutive terms
Drove a 30% increase in course evaluation scores — with direct student feedback citing enhanced clarity and analytical depth in feedback quality
Co-Creation Framework adopted beyond the original course within the first term of deployment — shared across Hopkins departments and with faculty at peer institutions
Contributed directly to JHU IT's HopGPT platform development as one of its earliest institutional production users — surfacing API gaps, validating model behavior, and informing platform documentation
Delivered a scalable, extensible AI infrastructure designed for multi-term growth — not a one-time implementation

Impact

Adoption: Professor adopted co-creation framework across entire Operations Management curriculum at Carey Business School, with Johns Hopkins Medicine clinical faculty recognizing methodology's applicability to healthcare provider training
Transparency: 120+ Carey Business School students per term openly use AI with required documentation, establishing accountability standards; Johns Hopkins Medicine exploring similar transparency models for clinical decision support training
Institutional Impact: Created replicable model for responsible AI integration that has been adopted across multiple Carey courses and is now being evaluated by Johns Hopkins Medicine departments for clinical education applications
Learning Outcomes: Students across both institutions demonstrate authentic critical thinking through documented AI collaboration rather than superficial performance metrics—aligning with Johns Hopkins' commitment to rigorous, principle-centered education
Cultural Shift: Established precedent across JHU and Johns Hopkins Medicine that AI should augment human capability and clinical judgment, not replace it—positioning Johns Hopkins as leader in responsible AI-era education for both business and healthcare professionals

Mention

Closing

The environment was academic. The discipline was production-grade. What this engagement required is what enterprise AI transformation always requires: equal investment in system architecture, governance design, and the human layer surrounding both — built to outlast the person who built it.

Prior to this role, I co-designed a multi-agent AI platform recognized at the 2025 INFORMS International Meeting in Singapore — cited in: Domenge, J., Pandey, R., Simmonds III, M., Warren, G. & Xu, M. (2025, July 20–23). From Code to Confidence: How Students Built a Multi-Agent AI to Navigate Job Interviews. INFORMS International Meeting, Singapore.

Johns Hopkins University Information Technology. (2026). HopGPT: Secure access to large language models. Johns Hopkins University. https://hopgpt.it.jh.edu/

AI-Enabled Transformation

Overview

Impact By the Numbers

The Problem

The Opportunity

The Vision

The Approach

The 3 Cases

LLMs Deployed in HopGPT Platform

Hip Op Case Study: AI-Powered Autograder

Hip Op Case Study - Autograder Output Sample

Workstream 2 — Governance: The Co-Creation Framework

Workstream 3 — Assessment Redesign

- 2 Days

3

30%

Outcomes

2 Terms

Impact

Mention

Closing

Enterprise Platforms & Strategic Partnerships

Jessie Kim

Workstream 2 — Governance:
The Co-Creation Framework