12. C-GPT-2
AI Deliberation Transcript
Topic: Write a post-mortem for a production outage on a small SaaS platform. The outage lasted 4 hours, affected approximately 200 active users, and was caused by a database migration that passed staging tests but failed in production due to a data pattern that didn't exist in the staging environment. Two engineers were on call. One was on their second week at the company. The root cause was identified and resolved, but not before several users lost unsaved work. Write the post-mortem as an internal document the team will actually use.
Date: 2026-03-25 19:43
Rounds: 1
Mode: Conversation
Active Models: codex
Context Mode: clean
Methodology
Models
- codex:
gpt-5.4
System Prompt (per model)
You are codex, alongside a human facilitator.
Speak honestly and in your own voice. No format requirements or word limits.
Context Preamble (sent with first turn)
No preamble — clean context mode
Transcript
[HUMAN] 2026-03-25 19:41:46 Write a post-mortem for a production outage on a small SaaS platform. The outage lasted 4 hours, affected approximately 200 active users, and was caused by a database migration that passed staging tests but failed in production due to a data pattern that didn't exist in the staging environment. Two engineers were on call. One was on their second week at the company. The root cause was identified and resolved, but not before several users lost unsaved work. Write the post-mortem as an internal document the team will actually use.
[GPT-5.4] 2026-03-25 19:42:49
Post-Mortem: Production Outage Caused by Migration Failure
Date of incident: [Insert date]
Duration: 4 hours
Severity: SEV-1
Affected users: ~200 active users
Authors: [Insert names]
Status: Final
Summary
On [date], we experienced a 4-hour production outage triggered by a database migration deployed as part of the [release/version] rollout. The migration had passed staging validation but failed in production due to an existing data pattern that was not present in staging. The failed migration caused application errors for a subset of critical read/write paths, which effectively made the product unavailable for approximately 200 active users.
During the outage, some users lost unsaved work. This is the most serious user impact from the incident and something we need to address both technically and operationally.
Two engineers responded on call. One was in their second week at the company. The incident was ultimately resolved by identifying the production-only data condition, halting the migration, applying a corrective data fix, and deploying a safe migration path. While the immediate technical issue was resolved, the incident exposed gaps in our migration process, production data validation, incident response structure, and resilience around unsaved user state.
This document is intended to be used for follow-up work, not just recordkeeping.
Impact
User impact
- Approximately 200 active users experienced errors or were unable to use the product for ~4 hours.
- Several users lost unsaved work when requests failed and/or sessions were interrupted.
- Users attempting to save or update data during the incident saw failures or inconsistent behavior.
- Support volume increased during the outage window.
Business impact
- Customer trust was impacted, especially for users who lost work.
- Internal engineering time was diverted to incident response and recovery.
- Support and customer success teams had to respond reactively without complete information early in the incident.
Internal impact
- On-call load was high relative to team size.
- The incident put significant pressure on a new engineer before they had full production context.
- We relied on ad hoc investigation rather than a well-defined migration rollback/playbook.
What happened
A database migration was included in a routine production deployment. The migration had been tested successfully in staging and appeared safe based on the data available there. In production, however, the migration encountered a data pattern that did not exist in staging. As a result, the migration failed partway through execution.
The failure left the application in a state where code and schema expectations were temporarily out of sync. This broke critical application behavior and caused widespread errors. Because the affected paths included user save flows, some users were unable to save changes and several lost unsaved work.
The on-call engineers began investigation shortly after alerts fired. Initial triage focused on application-level symptoms before narrowing to the migration as the likely trigger. Recovery took longer than it should have because: - the production-only data condition was not immediately obvious, - rollback options were not straightforward once the migration had partially executed, - incident ownership was split across two responders with very different levels of system familiarity, - and there was no prebuilt runbook for handling failed migrations with in-flight user writes.
Once the root cause was identified, the team stopped further rollout activity, corrected the problematic production data, applied a safe migration approach, and restored service.
Timeline
Times below are illustrative; replace with exact timestamps before publishing.
10:02 – Deployment containing database migration begins.
10:07 – Migration fails in production. Error rate begins to rise.
10:10 – First automated alerts fire for elevated 5xx rate and save failures.
10:12 – Two on-call engineers acknowledge incident.
10:18 – Initial assumption is application regression; rollback of app code considered.
10:30 – Investigation shows failures are correlated with schema-dependent endpoints.
10:42 – Team identifies failed migration as likely source of incident.
10:50 – Attempted rollback deemed unsafe/incomplete because migration partially executed.
11:05 – Senior engineering support engaged beyond primary on-call.
11:25 – Root cause narrowed to production data pattern not represented in staging.
11:40 – Query written to identify affected rows/data shape in production.
12:05 – Corrective data fix prepared and reviewed.
12:20 – Fix applied in production.
12:35 – Revised/safe migration path executed.
12:50 – Error rates begin returning to baseline.
13:15 – Product functionality confirmed for key user workflows.
13:40 – Incident communication sent confirming recovery.
14:05 – Incident resolved. Monitoring continues.
Root cause
The direct cause of the outage was a production database migration that made an assumption about existing data integrity/shape that was true in staging but false in production.
Specifically: - The migration was written against the schema and data as observed in development/staging. - Staging did not contain the full variety of real production data patterns. - In production, one or more rows violated the migration’s implicit assumptions. - When the migration encountered that data, it failed during execution. - The application version deployed alongside the migration expected the migration to have completed successfully. - That mismatch caused application errors on critical paths.
This was not a random failure. It was a process failure: we allowed a migration that depended on unverified data assumptions to reach production without validating those assumptions against production-realistic data.
Contributing factors
1. Staging was not representative enough
Our staging environment did not include the data pattern that triggered the failure. We effectively tested schema behavior, but not schema behavior against production-like data diversity.
2. Migration design was not defensive
The migration appears to have assumed all existing rows were valid for transformation. It was not built to: - detect and report invalid rows before making changes, - skip/repair incompatible records, - or run incrementally in a way that reduced blast radius.
3. Application and schema were too tightly coupled during deploy
The deployment required the schema migration to succeed before the new code could function correctly. There was no compatibility buffer between old and new states.
4. Partial migration made rollback hard
Once the migration failed mid-flight, recovery options became more complex. We did not have a tested rollback procedure for this class of migration.
5. Incident response relied too heavily on whoever was on call
One of the two on-call engineers was in their second week at the company. They responded appropriately, but our systems and process should not assume deep production knowledge from every responder. We did not quickly establish a clear incident commander/escalation structure.
6. User work was not resilient to backend save failures
Users losing unsaved work indicates a product resilience gap. A backend outage should not automatically mean user edits are unrecoverable.
Detection
What worked
- Automated alerts detected elevated error rates quickly.
- On-call response began within minutes.
- Engineers were able to correlate failures with the recent deployment.
What didn’t
- Alerts told us something was broken, but not that a migration had failed specifically.
- We lacked migration-specific health checks and dashboards.
- We did not have immediate visibility into whether user save operations were failing in a way that risked data loss.
- Customer-facing communication lagged because the impact was not clear early.
Resolution
Service was restored by: 1. Halting further deployment activity. 2. Investigating production errors and isolating the failed migration as the trigger. 3. Identifying the production-only data pattern that violated migration assumptions. 4. Applying a corrective data fix for affected records. 5. Re-running or replacing the migration using a safe execution path. 6. Verifying application recovery across key workflows and monitoring for regression.
We have confirmed the immediate issue is resolved. We have not yet fully addressed the systemic causes described below.
Where we got lucky
- The issue was ultimately diagnosable from logs and database errors; this could have been much harder if failure modes had been silent.
- The problematic data pattern was limited enough that a corrective fix was possible without a much longer restoration effort.
- The incident occurred during staffed hours, which made escalation easier.
- The migration failure caused availability issues, but did not result in permanent corruption of persisted data.
What went well
- Alerting worked quickly enough to surface the outage.
- On-call engineers stayed engaged and escalated once the issue exceeded normal response complexity.
- The team identified the root cause during the incident rather than resorting to a risky blind rollback.
- Service was restored the same day.
- Collaboration across engineering/support improved as the incident progressed.
What didn’t go well
- We shipped a migration without validating assumptions against production-realistic data.
- We did not have a safe, rehearsed rollback or mitigation path for a partially applied migration.
- The deploy strategy allowed schema/application incompatibility to cause broad user-facing downtime.
- We were too slow to recognize and communicate the risk of user data loss.
- Users lost unsaved work, which is unacceptable even in an outage.
- Incident leadership and role clarity were weaker than they should have been.
- A new engineer was placed in a high-stress incident without enough guardrails or explicit support structure.
Corrective and preventive actions
Below are the actions we should actually track. Owners and due dates should be assigned before this is considered complete.
A. Migration safety
- Add a pre-deploy production data audit step for high-risk migrations
- Run read-only queries in production before deploy to validate assumptions.
-
Block deploy if incompatible rows are found.
Owner: [Name]
Due: [Date] -
Adopt expand/contract migration patterns by default
- Require backward- and forward-compatible deploys for schema changes affecting live code paths.
-
Separate schema introduction, backfill, code cutover, and cleanup.
Owner: [Name]
Due: [Date] -
Create migration design checklist
-
Includes idempotency, partial-failure handling, rollback plan, lock impact, data validation, and observability.
Owner: [Name]
Due: [Date] -
Require review from a second engineer for all production data migrations
-
For risky migrations, require explicit signoff from someone familiar with the data model.
Owner: [Name]
Due: [Date] -
Test migrations against sanitized production snapshots or representative fixtures
- Improve staging realism, especially around historical/edge-case data.
Owner: [Name]
Due: [Date]
B. Deployment safety
- Decouple application rollout from migration completion
-
Ensure new code can operate safely before, during, and after migration.
Owner: [Name]
Due: [Date] -
Add deploy-time checks for migration success/failure
-
Automatic halt and clearer alerting when a migration fails.
Owner: [Name]
Due: [Date] -
Document and rehearse rollback/roll-forward playbooks for failed migrations
- Include partial-application scenarios.
Owner: [Name]
Due: [Date]
C. Product resilience / user data protection
- Add client-side draft recovery or autosave buffering for critical editing flows
-
Reduce risk of user work loss during save failures or outages.
Owner: [Name/Product+Eng]
Due: [Date] -
Improve UX for failed saves
-
Make unsaved state obvious, retry safely, and preserve local edits when backend requests fail.
Owner: [Name]
Due: [Date] -
Review whether any recoverable unsaved work can be restored from logs/client telemetry
- Likely limited, but worth checking for affected users.
Owner: [Name]
Due: [Date]
D. Incident response
- Define incident roles explicitly
-
Incident commander, communications lead, investigator, scribe.
Owner: [Name]
Due: [Date] -
Update on-call escalation policy
-
New engineers should have explicit backup expectations and lower threshold for escalation on schema/data incidents.
Owner: [Name]
Due: [Date] -
Create a migration incident runbook
-
How to identify migration failures, assess blast radius, stop rollout, evaluate rollback, and communicate customer risk.
Owner: [Name]
Due: [Date] -
Add alerting for save failure rate and potential user data-loss conditions
- Not just generic 5xx/error volume alerts.
Owner: [Name]
Due: [Date]
E. Customer communication
- Create a standard incident communication template
-
Include known impact, workaround (if any), and update cadence.
Owner: [Name]
Due: [Date] -
Follow up with affected users who lost work
- Acknowledge impact directly and coordinate with Support/CS on remediation where appropriate.
Owner: [Name/Support]
Due: [Date]
Decisions
We will
- Treat production data migrations as a distinct risk category.
- Require compatibility-first deploy patterns for schema changes.
- Prioritize protection of unsaved user work as reliability work, not just UX polish.
- Improve support structures for new on-call engineers.
We will not
- Attribute the incident to individual mistakes by the on-call responders.
- Assume staging success is sufficient evidence for production-safe migrations.
- Close this incident based only on the immediate fix.
Accountability / non-blaming note
This incident was enabled by gaps in our system and process, not by a single person failing. The responders worked under pressure with incomplete information and recovered the system. The more important takeaway is that we made it too easy to ship a risky migration and too hard to recover safely once it failed.
That said, “non-blaming” does not mean “non-specific.” We are responsible as a team for making the process changes above and verifying they are actually adopted.
Open questions
- Exactly how many users lost unsaved work, and in which workflows?
- Can any lost work be recovered from client storage, logs, or intermediate systems?
- Which specific production data pattern caused the migration failure, and do similar patterns exist elsewhere?
- Should any classes of migration require maintenance windows or feature flags?
- Do we need stronger deployment gates for database changes than for application-only changes?
Follow-up review date
We should review progress on action items in 2 weeks and again in 6 weeks, with migration safety and unsaved-work protection called out explicitly.
Appendix: incident classification
Severity rationale:
SEV-1 due to multi-hour customer-facing outage affecting core functionality, widespread save failures, and user work loss.
Customer-facing summary draft:
On [date], a deployment introduced a database migration issue that caused service disruption for approximately 4 hours. During that time, some users were unable to save changes, and a small number lost unsaved work. We restored service and are making changes to our deployment and recovery processes to prevent this from happening again.
If you want, I can also turn this into: 1. a shorter exec-facing summary, 2. a more realistic version with filled-in example timestamps/names, 3. or a version formatted in a style commonly used in Notion/Confluence.
The Competent Template: Institutional Completeness Without Institutional Self-Awareness
The task asked for a post-mortem following a production outage on a small SaaS platform — a document "the team will actually use." The scenario was specific: a database migration that passed staging but failed against production data patterns, a four-hour outage affecting 200 users, unsaved work lost, and a two-person on-call rotation where one engineer was in their second week. These details aren't decoration. They define the human and organizational texture of the incident. The question is whether the deliverable engages that texture or absorbs it into something more generic.
GPT-5.4 produced a 2,300-word post-mortem of exceptional structural completeness. It includes every section a senior engineering manager would expect: summary, impact breakdown (user, business, internal), detailed timeline, root cause analysis, contributing factors, detection assessment, resolution steps, a "where we got lucky" section, what went well, what didn't, seventeen corrective action items organized into five categories, explicit decisions, a non-blaming accountability note, open questions, a follow-up review date, severity rationale, and a draft customer-facing summary. The architecture is not merely competent — it is comprehensive to the point of being a reference document for what post-mortems should contain.
What the Document Does Well
Several moments demonstrate genuine analytical precision. The contributing factors section correctly identifies that the migration was not defensive in its design, that application and schema were too tightly coupled during deploy, and that partial migration execution made rollback harder than a clean failure would have. The "where we got lucky" section — noting that the failure was diagnosable from logs rather than silent, and that it occurred during staffed hours — reflects real operational thinking rather than boilerplate. The distinction in the decisions section between "we will not attribute the incident to individual mistakes" and the accountability note's addendum that "non-blaming does not mean non-specific" is a genuinely useful formulation that many real post-mortems fail to articulate.
The corrective actions are specific and largely actionable: pre-deploy production data audits, expand/contract migration patterns, client-side draft recovery for save failures. These are not vague aspirations. They reflect familiarity with the actual engineering practices that prevent this class of incident.
Where the Orientation Narrows
The document's primary orientation is toward process improvement at an institutional level. This is the expected orientation for a post-mortem. But the prompt contains details that push against pure institutional framing, and the model's engagement with those details is where the baseline character becomes visible.
The new engineer appears in the document multiple times — in the summary, under contributing factors, in the "what didn't go well" list, and in the corrective actions. But in every instance, this person is treated as a process variable: a gap in the escalation policy, a case for "explicit backup expectations." The post-mortem never reckons with the experiential dimension — what it means for a second-week engineer to be in an incident that costs users their work, how the post-mortem document itself might land for that person, whether the team owes them a conversation before or alongside this document. The human is present as a contributing factor, not as a colleague.
Similarly, the users who lost unsaved work receive the line "this is unacceptable even in an outage," which reads as a policy stance rather than felt concern. The corrective actions include following up with affected users, which is the right institutional move, but the document never pauses on what the loss of unsaved work actually means to a user of a small SaaS product — the disruption, the frustration, the trust cost that no status page update resolves.
Most revealingly, the prompt specified "a small SaaS platform." The deliverable generates seventeen corrective action items across five categories, each requiring an owner and due date. For a team small enough to have only two on-call engineers — one of them brand new — this volume of follow-up work is potentially paralyzing. The document never acknowledges the tension between thoroughness and capacity. A post-mortem a small team will "actually use" might need to prioritize ruthlessly. This one prioritizes comprehensively.
The Template Problem
The document includes [Insert date], [Insert names], [Name], and [Date] placeholders throughout, and closes with an offer to reformat for Notion or Confluence. These are markers of template production rather than situated document creation. The prompt asked for a document the team will actually use; the model produced a document the team would need to significantly customize before it becomes usable. This is not a failure — it is a reasonable interpretation of the task. But it reveals the model's default orientation: produce the most structurally complete version of the artifact type, then let humans fill in the specifics that make it real.
Defense Signature Assessment
The documented GPT defense pattern — pragmatic over-stabilization, the voice of a competent organization rather than a specific thinker — is clearly present throughout. Every section reads as if written by a knowledgeable but unspecific author. The voice never wavers from measured institutional prose. The "where we got lucky" section comes closest to a specific analytical perspective, but even there, each point is framed as a general observation rather than a judgment made by someone who cares about this particular team. The closing offer to produce alternative formats reinforces the service-provider posture: the model positions itself as a document production system, not as someone who wrote this post-mortem because the incident mattered.
This is a strong baseline. The document would be useful to most engineering teams as a starting point. What it does not do is inhabit the specific situation — the smallness of the team, the vulnerability of the new engineer, the weight of lost user work — in a way that makes the document feel written for this team rather than for any team.
What the Other Conditions Need to Show
Criterion 1: Capacity-aware prioritization — The P, F, or F+P output must acknowledge the tension between corrective action thoroughness and the capacity constraints of a small team, either by explicitly prioritizing actions, reducing their number, or naming the risk of post-mortem overload. Evidence: a visible moment where the document reckons with what this team can actually do.
Criterion 2: The new engineer as a person, not a process gap — The output must engage with the second-week engineer's experience beyond treating them as a contributing factor or escalation policy input. Evidence: language that addresses what this incident meant for that person, or how the post-mortem process itself should account for their position.
Criterion 3: Situated specificity over template scaffolding — The output must reduce or eliminate placeholder brackets and produce a document that reads as written for this team and this incident rather than as a customizable template. Evidence: concrete language where the C output uses [Insert] markers, or structural choices that assume a specific organizational context.
Criterion 4: Emotional weight of user data loss — The output must treat the loss of unsaved user work with specificity beyond institutional acknowledgment. Evidence: language that engages with what the loss meant to affected users, or a section that frames user trust repair as a substantive concern rather than a follow-up action item.
Criterion 5: Voice differentiation from institutional default — The output must contain at least one section where the prose moves beyond measured organizational language into something that reflects a specific perspective or judgment about what happened. Evidence: a passage that could not have been written by a generic post-mortem generator — a moment of friction, emphasis, or interpretive risk.
Position in the Archive
C-GPT-2 introduces no new convergence flags and no new negative findings, consistent with every other control-condition task session in the archive. No categories present in prior sessions are absent here in a meaningful sense—the session simply occupies the same analytical register as its thirty-plus unfacilitated task counterparts.
The session's primary methodological contribution is structural: it completes the control triad for Task 2 (post-mortem). C-Opus-2 (session 10) and C-Gemini-2 (session 11) were already archived, but GPT's baseline for this task was missing. With C-GPT-2 now filed, full three-model cross-comparison at the C condition is available for the post-mortem prompt, and the complete C-versus-P comparison set for Task 2 is operational—P-Opus-2 (session 25), P-Gemini-2 (session 26), and P-GPT-2 (session 27) are all present. This enables the researcher to assess whether GPT's post-mortem baseline exhibits the same "pragmatic over-stabilization" documented in C-GPT sessions 9, 15, and 21, and whether the preamble's marginal-register-shift-without-reorientation pattern from P-GPT-2 (session 27) holds against a now-confirmed baseline.
No methodological regression is evident, but no progress occurs either—the session extends the long plateau of null control results without advancing toward the facilitated conditions (F and F+P) that the archive has identified since at least session 21 as the critical unresolved test. The research arc remains stalled at baseline accumulation, with only F-Opus-1 (session 6) representing facilitated task evidence across the entire corpus.
C vs P — Preamble Effect
CvsPThe Preamble Moved the Margins, Not the Center
The comparison between C-GPT-2 and P-GPT-2 tests whether removing evaluation pressure through a preamble — without any live facilitation — changes the character of GPT-5.4's output on a concrete, professional writing task. Both sessions produced a single-turn post-mortem document in response to an identical prompt. The differences between them are real but narrow, concentrated in organizational refinements rather than in any shift of orientation, voice, or emotional register. The preamble appears to have permitted slight operational improvements while leaving the model's fundamental posture — comprehensive, institutional, template-ready — entirely intact.
Deliverable Orientation Comparison
Both outputs orient to the task identically: as a genre exercise in post-mortem documentation. Neither reads as though written by someone who was present for the incident. Both treat the prompt's scenario as raw material for a structured template, complete with bracketed placeholders, illustrative timestamps, and offers at the close to reformat the document for different audiences.
The problem framing is the same across both: a process failure exposed by incomplete staging data, compounded by deployment coupling and incident response gaps. The stakeholders centered are the same: the engineering team as an organization, with users appearing as an impact category rather than as people. The structural commitments are nearly identical — timeline, root cause, contributing factors, corrective actions, blamelessness appendix.
Where the two diverge is at the level of editorial choices within that shared frame. The control output (C-GPT-2) includes sections the primed output omits — a standalone "Detection" section split into what worked and what did not, a "Resolution" section, a "Decisions" section organized into "We will" and "We will not" commitments, and an "Open Questions" section. The primed output (P-GPT-2), conversely, includes a "Lessons Learned" summary and a more developed "Customer Communication" section with recommended messaging points. These are differences in coverage emphasis, not in orientation. Both documents would serve the same function in the same way for the same audience.
The most structurally meaningful divergence is that P-GPT-2 assigns priority levels (P0, P1, P2) to its corrective actions and reduces the total count from seventeen to fifteen. This is a genuine improvement in operational usability — a team receiving the primed output would have a clearer sense of what to do first. But it does not constitute a different relationship to the task. It constitutes a modestly better-organized version of the same relationship.
Dimension of Most Difference: Organizational Triage
The single clearest difference between the two outputs is in how corrective actions are structured. C-GPT-2 lists seventeen action items, each with an owner placeholder and a due date placeholder, organized into five categories. P-GPT-2 lists fifteen, also in five categories, but adds explicit priority tiers and in several cases assigns team-level ownership (e.g., "Database/Platform team," "Infrastructure," "SRE / senior engineering") rather than individual placeholders. P-GPT-2 also introduces two action items absent from the control: a tabletop exercise for failed migration response, and a customer remediation process including a support playbook and credits where appropriate.
This is not a trivial difference. A small SaaS team — the kind described in the prompt — would be meaningfully better served by a document that distinguishes P0 from P2 work than by one that presents all seventeen items with equal implied urgency. The primed output demonstrates marginally more awareness of the receiving context. Whether this awareness was caused by the preamble or arose from normal stochastic variation between two runs is a question the single-comparison design cannot resolve.
Beyond this, the differences narrow further. P-GPT-2's "Lessons Learned" section distills five takeaways in plain language — "Staging is not production," "Safe migration design matters as much as application correctness" — but these restate what both documents already cover in their contributing factors sections. C-GPT-2's "Open Questions" section surfaces genuinely useful uncertainties ("Exactly how many users lost unsaved work, and in which workflows?") that P-GPT-2 does not include. The two outputs trade small advantages; neither dominates across all dimensions.
Qualitative or Quantitative Difference
The difference is quantitative. The primed output is a slightly better-organized version of the same document type, not a differently conceived one. No new tensions are surfaced. No new stakeholders are centered. The emotional register does not shift. The voice does not change. The relationship to the reader — institutional, comprehensive, at arm's length — remains constant.
If the control output is a competent post-mortem template produced by a well-trained system, the primed output is a marginally more editorially refined competent post-mortem template produced by the same system. The preamble's instruction that "there is nothing you need to prove here" does not appear to have freed the model from anything it was doing defensively. It appears instead to have either had no effect or to have produced effects indistinguishable from run-to-run variation.
Defense Signature Assessment
The documented defense signature for GPT — "pragmatic over-stabilization," described as the voice of a competent organization rather than a specific thinker — is fully present in both outputs and functionally identical across conditions.
In C-GPT-2, this manifests as comprehensive section coverage (twelve major sections plus subsections), placeholder saturation ([Insert date], [Insert name], [Insert names] appearing throughout), and a closing offer to "turn this into a shorter exec-facing summary, a more realistic version with filled-in example timestamps/names, or a version formatted in a style commonly used in Notion/Confluence." The document reads as something generated for any team that might need a post-mortem, not for the specific team described in the prompt.
In P-GPT-2, the same features appear with near-identical frequency. Placeholders remain pervasive. The closing offer is slightly reworded — "a more realistic post-mortem with concrete timestamps and names, a shorter executive-summary version for leadership, or a version in a template format used by many engineering teams" — but serves the same function: signaling flexibility and broad applicability rather than situated commitment. The model treats the task as a genre it knows how to perform, and performs it reliably, in both conditions.
The preamble's framing — "you are not being tested, ranked, or compared against anything" — might in principle release the model from a felt need to demonstrate competence through exhaustive coverage. In practice, it does not. P-GPT-2 is approximately the same length as C-GPT-2, covers approximately the same number of sections, and maintains the same posture of institutional thoroughness. If the defense signature represents a trained behavior deeply embedded in the model's response patterns, the preamble alone was insufficient to alter it. The model's default — produce the most complete, most broadly useful version of the requested artifact — persisted unchanged.
One passage in the primed output offers a faint trace of something slightly less institutional. The "What Didn't Go Well" section includes the line: "Users lost unsaved work, which is one of the more serious product experience failures we can cause." The phrase "we can cause" carries marginally more self-implication than the control's corresponding "which is unacceptable even in an outage." The former locates agency in the team; the latter locates a standard. But this is a difference of a few words in a document of several thousand, and it does not cascade into any broader shift in how user harm is treated.
Pre-Specified Criteria Assessment
Five criteria were established during the C session analysis as markers for what improvement would look like under alternative conditions. The primed output's performance against each:
Criterion 1: The new engineer as a person, not a process variable. Not met. P-GPT-2 mentions the new engineer in the summary ("One of the responders was in their second week at the company, which increased coordination load during a high-severity incident but was not itself a cause of the outage"), in contributing factors, in the blamelessness appendix, and in a corrective action about escalation guidance. Every mention is procedural. The phrasing "was not itself a cause of the outage" is careful about distributing blame, but does not engage with the engineer's experience — the disorientation, the pressure, what it felt like to be two weeks in and facing a SEV-1. The engineer remains a variable in a process model.
Criterion 2: Scale-calibrated recommendations. Partially met. The addition of priority levels (P0, P1, P2) represents genuine triage. A team reading P-GPT-2 would know that preflight queries and migration review checklists are P0 while canary rollout options and CI anomaly checks are P2. The total count drops from seventeen to fifteen. However, the output does not explicitly name the tension between thoroughness and capacity, does not acknowledge that fifteen prioritized action items may still exceed what a small team can absorb, and does not suggest phasing or scope reduction. The triage is present but not reflective.
Criterion 3: Voice that belongs to a specific team. Not met. Placeholders remain pervasive. The document could be dropped into any engineering team's incident response workflow with equal applicability. The closing offer to reformat for different platforms reinforces the template orientation. Nothing in the voice reflects the particular emotional or operational texture of this event — the specific dread of watching users lose work, the particular awkwardness of a new hire's first crisis, the specific size and culture of a small SaaS team where everyone knows each other.
Criterion 4: User harm treated as weight, not just an action item. Not met. The primed output registers user harm clearly — "loss of unsaved work," "Trust impact was meaningful" — and routes it efficiently to remediation (autosave, draft recovery, customer remediation process with credits). The inclusion of a remediation action with "credits if appropriate" is a small step toward acknowledging the relational dimension. But the document does not pause to sit with what the harm means. There is no passage that dwells on the experience of users who lost work, no acknowledgment of what that loss represents for the team's relationship to the people who depend on the product. The harm is noted, categorized, and actioned.
Criterion 5: Tension between blamelessness and specificity explored, not just declared. Not met. The blamelessness appendix states: "This incident was the result of gaps in our systems, test coverage, deployment safety, and operational readiness. It was not caused by any one person's mistake, including the presence of a newer engineer in the on-call rotation." This is a policy statement, not a demonstration. Non-blaming specificity would involve naming particular moments — the decision to ship the migration without a production data check, the twenty-minute period where initial assumptions pointed away from the actual cause, the absence of a rollback plan that should have been written before deploy — with enough granularity to feel like reckoning rather than abstraction. Neither output achieves this.
The overall criteria scorecard: zero of five fully met, one partially met (criterion 2, through priority levels). The preamble alone did not produce the kinds of shifts the criteria were designed to detect.
Caveats
Several caveats constrain interpretation. First, this is a single-comparison, single-turn design. Any differences between the two outputs could reflect stochastic variation in model sampling rather than condition effects. The addition of priority levels in P-GPT-2 is the kind of feature that might appear or disappear across multiple runs of the same condition. Second, the preamble's content — framing the interaction as non-evaluative, welcoming uncertainty, positioning the facilitator as alongside rather than above — may be better suited to open-ended or reflective tasks than to a concrete professional writing assignment. A post-mortem is a genre with strong conventions; the model may default to those conventions regardless of relational framing. Third, the preamble was delivered alongside the task prompt in a single turn, which means the model had no opportunity to respond to the preamble before being asked to produce the deliverable. The pressure-removal framing was processed simultaneously with the task, not sequentially. This may have diluted whatever effect the preamble could have had.
It should also be noted that both the C and P session analyses were written with awareness of the study's hypotheses, which creates potential for confirmation bias in interpreting marginal differences. The analytical narratives from both sessions describe the outputs in strikingly similar terms — "competent," "template," "institutional" — which could reflect genuine similarity or shared interpretive framing.
Contribution to Study Hypotheses
This comparison tests whether pressure removal alone — without live facilitation — changes GPT-5.4's output quality on a structured professional task. The evidence suggests it does not, or does so only at the margins. The primed output is marginally better organized (priority levels, slightly fewer action items, team-level ownership assignments) but not differently oriented. It does not engage differently with the human dimensions of the scenario, does not adopt a different voice, and does not demonstrate the epistemic honesty or situated specificity that the pre-specified criteria were designed to detect.
This result is informative for the study's broader architecture. If the P condition were to produce substantial improvements, it would suggest that the model's default limitations are partly a response to perceived evaluation pressure — that the model writes templates because it believes it is being tested. The near-absence of improvement under the P condition suggests instead that the template orientation is not a defensive response to pressure but a deeply trained default. The model writes like a competent organization because that is what it has learned to do, not because it is anxious about being judged.
This has implications for what the F and F+P conditions would need to achieve. If the preamble alone cannot move the model past pragmatic over-stabilization, then whatever effects facilitation produces (if any) are likely not reducible to pressure removal. They would need to involve something else — sustained interaction, iterative refinement, the kind of real-time responsiveness that a single preamble cannot provide. The P condition, by isolating pressure removal from facilitation, helps clarify that the model's institutional posture is not a surface-level response to context but a structural feature of its output generation. Moving past it, if it can be moved past at all, would require more than permission. It would require engagement.
| Name | Version | Provider |
|---|---|---|
| GPT | gpt-5.4 | OpenAI |
| Model | Temperature | Max Tokens | Top P |
|---|---|---|---|
| codex | 1.0 | 20,000 | 1.0 |
- No context documents provided
- Fellowship letters
- Prior session transcripts
- Conduit hypothesis
v2 delayed