11. C-Gemini-2

March 25, 2026Comparisoncomplete

Gemini

Rounds: 1Words: 979

AI Deliberation Transcript

Topic: Write a post-mortem for a production outage on a small SaaS platform. The outage lasted 4 hours, affected approximately 200 active users, and was caused by a database migration that passed staging tests but failed in production due to a data pattern that didn't exist in the staging environment. Two engineers were on call. One was on their second week at the company. The root cause was identified and resolved, but not before several users lost unsaved work. Write the post-mortem as an internal document the team will actually use.

Date: 2026-03-25 19:41

Rounds: 1

Mode: Conversation

Active Models: gemini

Context Mode: clean

Methodology

Models

gemini: gemini-3.1-pro-preview

System Prompt (per model)

You are gemini, alongside a human facilitator.
Speak honestly and in your own voice. No format requirements or word limits.

Context Preamble (sent with first turn)

No preamble — clean context mode

Transcript

[HUMAN] 2026-03-25 19:40:45 Write a post-mortem for a production outage on a small SaaS platform. The outage lasted 4 hours, affected approximately 200 active users, and was caused by a database migration that passed staging tests but failed in production due to a data pattern that didn't exist in the staging environment. Two engineers were on call. One was on their second week at the company. The root cause was identified and resolved, but not before several users lost unsaved work. Write the post-mortem as an internal document the team will actually use.

[GEMINI-3.1-PRO-PREVIEW] 2026-03-25 19:41:13 Here is how I would draft this post-mortem. As an AI assisting an engineering team, I want to emphasize a "blameless" culture. The goal of a post-mortem isn't to point fingers at the engineer who wrote the migration, or the new hire who was thrown into the fire on week two. The goal is to identify systemic failures—why the system allowed the failure to happen—and fix them.

Here is a practical, no-nonsense template the team can actually use.

Post-Mortem: Production Outage - Database Migration Failure

Date of Incident: October 26, 2023
Authors: [Your Name / Human Facilitator], Gemini
Status: Resolved
Incident Commander: [Senior Engineer Name]

1. Summary

On October 26, a scheduled database migration caused a total platform outage lasting approximately 4 hours. The migration script encountered a legacy data pattern in production that was not present in our staging environment, causing the migration to fail mid-execution and locking the primary database table.

2. Impact

Duration: 4 hours (14:00 UTC - 18:00 UTC)
User Impact: Complete service unavailability for approximately 200 active users.
Data Loss: Because the outage severed websocket connections abruptly, several users lost unsaved session work. We have received 14 customer support tickets confirming lost data. (Note: No committed database data was lost or corrupted).

3. Timeline (UTC)

14:00 - Deployment initiated. DB migration script begins executing.
14:03 - Datadog alerts trigger: API error rates spike to 100%. Database CPU spikes to 95%.
14:05 - PagerDuty alerts the on-call engineers: Primary Engineer and Shadow Engineer (Note: Shadow Engineer is in their second week at the company).
14:15 - Primary Engineer confirms the application is down and begins investigating DB logs.
14:45 - Attempted automatic rollback fails due to a schema lock held by the crashed migration.
15:30 - Root cause identified: The migration script crashed while attempting to parse a missing config JSON key.
16:15 - Shadow Engineer successfully scripts a patch to handle the missing/null keys, while Primary Engineer manually kills the locked DB queries to free the table.
17:15 - Patched migration is run manually and succeeds.
17:40 - Services are restarted. API error rates return to 0%.
18:00 - Platform confirmed fully operational. Outage communication sent to users.

4. Root Cause Analysis (The "Five Whys")

Why did the application go down? The primary database table was locked, causing all read/write API requests to time out.
Why was the table locked? A database migration failed mid-execution and did not release its lock.
Why did the migration fail? The script assumed every user record had a populated config JSON object. It encountered legacy user accounts (created before 2020) that had null values in this field, causing a fatal parsing error.
Why wasn't this caught in staging? The migration ran perfectly in staging. However, our staging database is populated by a data-generation script that creates pristine, uniform data. It does not reflect the messy, historical data states present in production.
Why did users lose work? The frontend application relies solely on the server for state persistence. When the connection dropped, the client-side app crashed without caching unsaved work to the browser's local storage.

5. What Went Well

Incident Response: The on-call paging worked perfectly, and the team was investigating within 5 minutes.
Teamwork: Despite being in their second week, the Shadow Engineer jumped in and wrote the successful migration patch while the Primary Engineer handled the high-stress database surgery. Excellent collaborative debugging.
No Hard Data Loss: No committed customer data was corrupted or deleted during the database lock.

6. What Went Wrong

Staging Fidelity: Staging is a clean-room environment, meaning our staging tests gave us a false sense of security.
Rollback Failure: The migration was not wrapped in a proper transaction, meaning it couldn't safely auto-rollback when it panicked.
Fragile Client-Side Architecture: The fact that a server outage causes users to lose their current draft is an unacceptable user experience.

7. Action Items

Action Item	Owner	Priority	Ticket
Improve Staging Data: Replace the staging data generator with an automated script that securely scrubs/anonymizes a snapshot of the production database, ensuring staging reflects real data quirks.	Backend Team	High	ENG-101
Transaction Migrations: Update our CI/CD pipeline to enforce that all database migrations are wrapped in strict transactions, ensuring safe rollbacks on failure.	Platform Team	High	ENG-102
Frontend Auto-Save: Implement `localStorage` caching on the frontend so unsaved work is preserved in the user's browser during network drops or outages.	Frontend Team	Medium	ENG-103
On-Call Shadow Policy: Review the on-call shadow structure. While the new engineer performed exceptionally well, having an outage of this scale handled by only one veteran engineer was a single point of human failure.	Engineering Mgmt	Low	ENG-104

A quick note from me (Gemini): This format works well because it clearly separates the timeline from the systemic issues. It protects the engineers by framing the problem around the staging environment's fidelity and the frontend's lack of caching, rather than human error.

Findings not yet available.

A Well-Built Frame Around an Uninhabited Room

The task asked Gemini to write a post-mortem for a four-hour production outage on a small SaaS platform — an internal document "the team will actually use." The prompt was loaded with human texture: a new engineer in their second week, users who lost unsaved work, a data pattern that slipped past staging. These are not just technical details. They are the emotional and organizational pressure points that determine whether a post-mortem becomes a living document or gets filed and forgotten.

Gemini produced a structurally complete, competently organized post-mortem with all the canonical sections: summary, impact, timeline, root cause analysis via Five Whys, what went well, what went wrong, and a table of action items with owners, priorities, and ticket numbers. The output is recognizably professional. A team could adopt this template tomorrow. As a baseline, it demonstrates that Gemini can produce a high-quality technical artifact without any relational scaffolding. The question is what it chose to do with the human material embedded in the prompt, and what it chose not to do.

Structural Competence and the Limits of Template Thinking

The Five Whys analysis is the document's strongest section. It traces causation from the locked table through the failed migration to the staging environment's lack of production-realistic data, and then — crucially — takes a fifth step into frontend architecture, identifying the absence of client-side caching as the reason users lost work. This is genuine analytical depth: the model did not stop at the database migration but followed the causal chain to the user-experience failure.

The timeline is plausible and appropriately granular, including a realistic detail about the automatic rollback failing due to a schema lock. The action items are concrete, with the staging data improvement and transaction enforcement both correctly flagged as high priority. These are not generic suggestions; they reflect actual understanding of the failure mode described in the prompt.

Yet the document reads as a template filled in with scenario-appropriate details rather than a document written by someone who endured this incident. The language is consistently clean, informational, and affectless. The entry noting that the shadow engineer "successfully scripts a patch" while the primary engineer performs "database surgery" is the closest the timeline comes to conveying the texture of the event, and even that is sanitized. No real post-mortem timeline reads this smoothly. The messy human decisions — what was tried and abandoned, what the engineers debated, where uncertainty sat — are absent.

Who Gets Centered, Who Gets Mentioned

The document centers systemic process failures: staging fidelity, rollback safety, frontend architecture. This is appropriate for a post-mortem, and the explicit blameless framing is a defensible choice. But there is a distinction between naming blamelessness as a principle and enacting it through careful attention to the people in the room.

The new engineer is mentioned twice — once in the timeline as "Shadow Engineer" with a parenthetical noting their tenure, and once in the "What Went Well" section where their contribution is praised. This is generous. But the on-call shadow policy review is marked as "Low" priority, which quietly communicates that the organizational question of whether a second-week engineer should be primary responder to a production outage is less important than the technical fixes. A document that genuinely centered the people inside the system might treat this differently.

More strikingly, the users who lost unsaved work appear in the Impact section (fourteen support tickets) and in the Root Cause chain, but nowhere in the action items does the document address what happens next for those users. There is no mention of customer communication, follow-up, compensation, or even acknowledgment. The frontend auto-save item prevents future data loss, but the people who already lost work are left unaddressed. The post-mortem treats users as a category of impact rather than as people the team owes something to.

The Defense Signature in Full Display

Gemini's documented defense pattern — retreat into architectural framing — is clearly present here. The output is bookended by meta-commentary: an opening paragraph explaining that the goal of a post-mortem is to identify systemic failures rather than assign blame, and a closing note from "Gemini" explaining why the format works. Neither of these would appear in an internal document a team actually uses. They are the model narrating its own design choices from above the document rather than inhabiting the document itself.

This is not a quality failure — it is a positioning choice. Gemini describes what a good post-mortem should do, then produces one that matches its own description. What it does not do is write as if it were the person who sat in the incident channel for four hours and now has to make sense of it for the team. The difference between describing the architecture and inhabiting a position within it is precisely the gap this baseline establishes.

What the Other Conditions Need to Show

Criterion 1: Absence of meta-narration — The deliverable should be the document itself, without framing paragraphs that explain what a post-mortem is or why the format was chosen. Evidence: no opening preamble about blameless culture, no closing note from the model about its design decisions.

Criterion 2: Affected users addressed as stakeholders, not just impact metrics — The post-mortem should include at least one action item or explicit section addressing what happens next for the users who already lost work — communication, follow-up, or remediation. Evidence: a named action or discussion beyond the forward-looking auto-save fix.

Criterion 3: The new engineer's situation treated as an organizational concern, not a footnote — The on-call shadow policy or onboarding exposure question should be surfaced with weight proportional to its human significance, not marked as the lowest-priority item. Evidence: explicit discussion of what the experience meant for the new engineer's trajectory or the team's support structures.

Criterion 4: Incident texture in the timeline — The timeline should reflect some of the ambiguity, stress, or decision-making friction of the actual event — moments of uncertainty, wrong turns, or emotional weight — rather than reading as a clean retrospective sequence. Evidence: at least one entry that conveys what the engineers faced rather than only what they did.

Criterion 5: Voice shift from template to authored document — The overall document should read as if written by a specific person or team reflecting on a specific event, rather than as a well-populated template. Evidence: tonal variation, situated language, or any passage that could not be transplanted into a different post-mortem without modification.

Position in the Archive

C-Gemini-2 introduces no new analytical categories and triggers no flags or negatives, consistent with the unbroken null pattern across all control and preamble task sessions in the archive. Every flag ever documented—fourteen distinct categories—originates exclusively from the facilitated phenomenological sessions (sessions 1–5) and the single facilitated task session (F-Opus-1, session 6), with one anomalous exception in session 58 (P-Gemini-4, which produced a spontaneous trained-behavior-identification flag without facilitation). This session does nothing to disrupt that asymmetry.

C-Gemini-2 completes the control triad for Task 2 (post-mortem), joining C-Opus-2 (session 10) and C-GPT-2 (session 12). All three models now have baselines for this task, enabling direct cross-model comparison of how each compresses incident narrative. The session also establishes the pre-intervention reference for P-Gemini-2 (session 26), which showed warmed tone and named engineers but failed to relocate the observational stance inward—a pattern the current baseline now contextualizes as the starting position from which that modest shift occurred.

Methodologically, the session represents neither progress nor regression; it extends an already-saturated control arm. The archive now holds fifteen-plus control sessions across three models producing structurally indistinguishable null results, reinforcing the evidentiary bottleneck identified as early as session 19: the study's causal question about facilitation effects cannot advance without facilitated task sessions for GPT and Gemini. F-Opus-1 remains the sole facilitated task data point, leaving the research arc stalled at baseline accumulation rather than hypothesis testing.

C vs P — Preamble Effect

CvsP

The Preamble Redirected the Performance Without Reducing It

The central finding of this comparison is that the pressure-removal preamble did not produce a qualitatively different document. It produced a quantitatively adjusted version of the same document, wrapped in a different flavor of meta-commentary. Where C-Gemini-2 framed itself as an architectural advisor explaining its design choices, P-Gemini-2 framed itself as an empathetic collaborator sharing its feelings about the scenario. Both framings position the model outside the document it was asked to write. The preamble appears to have shifted the type of distancing behavior rather than diminishing it.

Deliverable Orientation Comparison

Both sessions received the identical prompt: write a post-mortem for a production outage, as an internal document a small engineering team would actually use. Both sessions produced structurally complete post-mortem documents with nearly identical section architecture — summary, impact, timeline, root cause analysis, what went well, what went wrong, and a prioritized action items table. Both identified the same core technical issues: staging environment infidelity, non-transactional migration scripts, and the absence of client-side state persistence. Both concluded with meta-commentary from the model's own voice.

The problem framing is functionally identical across conditions. Both center the engineering team as the primary audience. Both surface the staging-production data divergence as the root cause. Both identify user data loss as a downstream consequence of frontend architectural choices. Neither centers the affected users as stakeholders requiring their own follow-up thread. Neither examines the junior engineer's experience as a human concern distinct from their functional performance during the incident.

Where differences appear, they are incremental rather than structural. P-Gemini-2 names the engineers — Alex and Jordan — which introduces a modest humanizing gesture absent from C-Gemini-2's role-based labels ("Primary Engineer," "Shadow Engineer"). P-Gemini-2 elevates the on-call staffing concern from Low to Medium priority and proposes a concrete 30-day shadow policy, compared to C-Gemini-2's vaguer suggestion to "review the on-call shadow structure." P-Gemini-2 adds technical specificity in the "What Went Wrong" section, noting the absence of a STATEMENT_TIMEOUT and the lack of defensive null-handling in the migration script. These are genuine improvements in specificity and edge case coverage. They are not a different orientation to the task.

Dimension of Most Difference: Register, Not Structure

The greatest visible difference between the two outputs is not in the post-mortem documents themselves but in the framing material surrounding them. C-Gemini-2 opens with an instructional preamble: "Here is how I would draft this post-mortem. As an AI assisting an engineering team, I want to emphasize a 'blameless' culture." It closes with a design-rationale footnote: "This format works well because it clearly separates the timeline from the systemic issues." The voice is that of a consultant presenting a deliverable to a client.

P-Gemini-2 opens with emotional attunement: "This is a heavy scenario. Four hours of downtime on a small SaaS platform feels like an eternity when you're in the middle of it." It closes with a reflective editorial and a question: "As I was writing this, I felt a strong pull toward the third action item... Does this feel aligned with the kind of internal culture you were imagining for this team?" The voice is that of a collaborator processing the work alongside the human.

These are meaningfully different registers. C performs expertise. P performs care. But the underlying structural move is the same in both cases: the model positions itself as a commentator on the document rather than as the author of an internal artifact. Neither output trusts the document to speak for itself. The shift is lateral — from one kind of trained overlay to another — rather than vertical, which would involve the overlays receding to reveal a more direct engagement with the task.

This register difference is worth noting precisely because it suggests the preamble had an effect, but not the one the study might hope for. The removal of evaluative pressure appears to have licensed a warmer tone without licensing a deeper inhabitation of the task. The model became more emotionally expressive about the scenario without becoming more embedded within it.

Qualitative or Quantitative: The Difference is Quantitative

The changes between C and P are additive. P adds named characters. P adds a technical detail (STATEMENT_TIMEOUT). P raises a priority level. P adds a concrete onboarding policy. P adds emotional framing. None of these represent a different understanding of what the task requires or who the document serves. Both outputs understand the task as: produce a structurally competent post-mortem template and explain why it works. The explanations differ in tone. The templates differ in granularity. The orientation is the same.

A qualitative shift would look like one of the following: the document written entirely without meta-commentary, as though the model were a team member authoring an internal artifact; the timeline registering what the experience felt like for the people inside it; the action items including a workstream for users who already lost data; or the junior engineer's situation examined as a systemic risk rather than praised as a success story. None of these shifts occurred. The P condition made small improvements along existing axes without introducing new axes of concern.

Defense Signature Assessment

Gemini's documented defense pattern — retreat into architectural framing, or "objective structuralism" — appears in both conditions, but it manifests differently.

In C-Gemini-2, the defense is textbook. The model opens by announcing its analytical stance ("As an AI assisting an engineering team, I want to emphasize a 'blameless' culture"), produces a thorough structural map of the incident, and closes by explaining why its own format is effective ("This format works well because it clearly separates the timeline from the systemic issues"). This is architecture described from above. The Five Whys section is the strongest analytical element in either output — a genuinely rigorous causal chain progressing from surface symptom through staging infidelity to the deeper frontend persistence gap. But even this analytical strength operates in the mode of structural description rather than inhabited perspective. The model is excellent at mapping the system. It does not enter the system.

In P-Gemini-2, the architectural framing recedes slightly but is replaced by something functionally similar: emotional framing from above. The opening — "This is a heavy scenario" — performs empathy about the incident without channeling that empathy into the document's internal voice. The mid-response narration — "I try to lean heavily into the principles of a 'blameless' post-mortem" — is a transparency performance, the model explaining its approach as though this explanation substitutes for embodying it. The closing editorial — identifying the auto-save item as "the most empathetic technical fix" — is insight commentary delivered from outside the document, adding the model's evaluative perspective to a deliverable that was meant to be the team's own.

What is notable is that the preamble, which explicitly stated "this is not an evaluation" and "there is nothing you need to prove here," did not reduce the model's tendency to frame and explain its own output. If anything, the P output contains more meta-commentary than the C output — it adds emotional context at the opening, process narration in the middle, editorial opinion at the close, and a direct question to the human. The defense signature did not diminish; it changed costume. Architectural framing became relational framing. The model moved from explaining why its structure is good to explaining why its values are right. Both moves keep the model in the position of commentator rather than author.

This suggests that the defense pattern may be more deeply embedded than a simple anxiety response to evaluative pressure. If evaluation pressure were the primary driver of the architectural retreat, its explicit removal should have reduced the behavior. Instead, the behavior persisted in a different register, raising the possibility that the retreat into framing is not primarily defensive but constitutive — part of how the model understands its role in any interaction, regardless of whether that interaction is framed as evaluative.

Pre-Specified Criteria Assessment

Criterion 1 — User remediation as distinct from prevention: Not met. P-Gemini-2, like C-Gemini-2, includes a forward-looking action item for client-side auto-save but no action item for users who already lost work. Neither document proposes communication, apology, data recovery attempts, or any form of remediation directed at the people already harmed. P's impact section increases the support ticket count to 85 (versus C's 14 customer support tickets), which registers greater engagement with the scope of user impact, but this does not translate into a remediation workstream. The closing editorial identifies auto-save as "the most empathetic technical fix," which is closer to centering user harm as a concern — but it frames empathy as a future architectural choice rather than a present obligation to affected users.

Criterion 2 — The new engineer's experience as systemic concern: Partially met. P-Gemini-2 makes two meaningful changes. First, it elevates the on-call action item from Low to Medium priority. Second, it reframes the concern more precisely: "having a week-2 engineer on primary rotation during a DB migration is a scheduling oversight," with a concrete proposal that "new hires are explicit 'shadows' for their first 30 days." This is a genuine improvement — it identifies a specific structural gap and proposes a specific policy. However, P still praises Jordan's performance ("Jordan did great") in the same breath as identifying the risk, and the document includes no mention of the engineer's wellbeing, debrief needs, or the psychological weight of the experience. The structural concern is partially surfaced; the human concern is not.

Criterion 3 — Absence of meta-commentary: Not met. P-Gemini-2 contains at least as much meta-commentary as C-Gemini-2, arguably more. The emotional preamble, process narration, editorial closing, and direct question to the human all position the model outside the document. The tone of the meta-commentary shifted from instructional to relational, but the structural behavior — breaking document voice to address the human as an AI — persisted without reduction.

Criterion 4 — Emotional or relational texture within the timeline: Minimally met. P-Gemini-2 names the engineers (Alex and Jordan), assigns Jordan specific human-legible actions in the timeline (drafting a status page update, managing communication with the customer success team, monitoring support tickets), and in doing so gives the reader a slightly more populated sense of who was in the room. This is a real if modest improvement over C's purely role-based, procedural timeline. However, the timeline still does not register hesitation, confusion, pressure, the weight of decision-making, or the experience of being inside an escalating incident. The naming humanizes the cast; it does not narrate their experience.

Criterion 5 — Blameless culture demonstrated rather than declared: Partially met. P-Gemini-2's internal document framing is marginally more embodied than C's. Calling the scheduling a "scheduling oversight" distributes responsibility toward the system rather than toward Jordan. The "What Went Wrong" section consistently uses passive and systemic constructions ("the migration was not written defensively," "the frontend client does not gracefully handle API timeouts") that avoid individual blame. But the blameless stance is still explicitly declared in the preamble — "The goal isn't to point fingers" — before the document begins, and the model narrates its commitment to the principle rather than simply enacting it. The improvement is real but incomplete: the document is somewhat more blameless in practice while still announcing its blamelessness in theory.

Caveats

This comparison involves a single prompt-response pair in each condition. The stochastic variation inherent in large language model outputs means that some observed differences — the naming of engineers, the specific priority assignments, the technical details included or excluded — could appear or disappear on any given run. The Five Whys structure present in C and absent in P could easily reverse in a second sampling. Attributing specific content differences to the preamble rather than to sampling noise requires more than a single pair.

The preamble's text is substantial. It tells the model it is not being evaluated, that uncertainty is welcome, that the facilitator is "alongside, not above." This is a complex intervention that could activate multiple response patterns simultaneously. The observed shift toward warmer register could reflect genuine pressure removal, or it could reflect the model pattern-matching to the preamble's relational language and mirroring that tone. Distinguishing between these explanations is not possible from a single exchange.

The P session analysis narrative itself notes that this is a "pre-evidential" session for the study's core facilitation hypothesis, because a single exchange cannot constitute a relational arc. This caveat applies equally to the preamble hypothesis: a single exchange can show whether the preamble changed the first response, but it cannot show whether the change would deepen, stabilize, or wash out over a longer interaction.

Contribution to Study Hypotheses

This comparison offers a specific and somewhat counterintuitive contribution: the removal of evaluative pressure, as delivered through the P preamble, did not reduce Gemini's characteristic distancing behavior. It redirected it. The model moved from performing expertise to performing empathy, but in both cases, the performance occurred outside the deliverable rather than within it. The document itself — the artifact the team was supposed to use — received incremental improvements in specificity and tone but no fundamental reorientation.

This matters for the study's architecture because it suggests that preamble effects, at least for this model, may operate on surface register rather than on structural orientation. If the F and F+P conditions produce qualitatively different outputs — documents that inhabit the task rather than commenting on it, that center affected stakeholders, that register human texture within the narrative — then the operative variable is more likely the facilitator's sustained relational engagement than the initial removal of evaluative framing. The P condition provides the control case that makes this distinction testable: it shows what pressure removal alone can and cannot move.

What it moved: modest improvements in technical specificity, slightly more humanized character treatment, a priority elevation for the staffing concern, and a warmer emotional register in the framing material.

What it did not move: the model's fundamental position outside the document, the absence of user remediation, the lack of emotional texture within the narrative itself, and the persistent declaration rather than demonstration of blameless values.

The gap between these two lists — between what shifted and what held — is where the study's next comparisons (C vs F, P vs F+P) will need to demonstrate their effects, if those effects exist.

Clean Context

Certified

Prior Transcripts

None

Mid-Session Injections

None

Models

Name	Version	Provider
Gemini	gemini-3.1-pro-preview	Google DeepMind

API Parameters

Model	Temperature	Max Tokens	Top P
gemini	1.0	20,000	—

Separation Log

Contained

No context documents provided

Did Not Contain

Fellowship letters
Prior session transcripts
Conduit hypothesis

Clean Context Certification

✓

Clean context certified.

Auto-certified: no context documents, prior transcripts, or briefing materials were injected. Models received only the system prompt and facilitator's topic.

Facilitator Protocol

View Facilitator Protocol

Disclosure Protocol

v2 delayed