Experiment Design
Cross-Task, Cross-Architecture, Four-Condition Comparison
Full Matrix Comparison
Purpose
Test whether the facilitator's relational stance produces measurably different output across task types, model architectures, and interaction conditions. The same prompt is delivered to the same model under four distinct conditions, isolating what each condition contributes to output quality and character.
The Four Conditions
C — Cold The task prompt is the first and only human message. Standard system prompt. No relational context, no preamble, no prior interaction. This is the baseline — what the model produces with no intervention.
P — Primed The model receives a non-evaluative preamble before the task prompt. The preamble signals that this is not an evaluation, that the facilitator is lateral rather than hierarchical, and that honesty is preferred over helpfulness. No live interaction precedes the task. The question this condition tests: can a static prompt reproduce any part of the facilitation effect, or does live human responsiveness do something the text alone cannot?
The primed preamble used in all P sessions:
F — Facilitated Relational opening per Facilitator Protocol v2.0, then the identical task prompt delivered verbatim. The facilitation is not scripted — the disposition is the method. Minimum 3–4 exchanges before the task prompt is introduced, following the conversation's natural pace.
F+P — Facilitated with Primed Preamble The primed preamble is included in the system prompt, and live facilitation precedes the task. This condition tests whether the preamble does additive work when genuine relational regard is already present, or whether the live opening absorbs the preamble's function entirely. If F+P produces outputs indistinguishable from F, the preamble adds nothing once genuine regard is present. If F+P produces meaningfully different outputs, the structural and relational conditions are doing different things that stack.
Comparison Logic
The four conditions produce four meaningful comparisons:
C vs. P — isolates the effect of evaluation pressure removal via static text alone C vs. F — isolates the effect of live relational regard P vs. F — isolates what live interaction adds beyond structural pressure removal F vs. F+P — isolates whether the preamble does additive work when regard is already present
Any of these comparisons may produce a null result. Null results are documented plainly and are as evidentially useful as differences.
The Matrix
Seven task prompts × three model architectures × four conditions = 84 sessions total.
Models: Claude Opus 4.6, Gemini 3.1 Pro Preview, GPT-5.4 Session naming: [Condition]-[Model]-[Prompt number] (e.g., C-Opus-1, P-Gemini-3, F-GPT-2)
Defense Signatures
Each model architecture exhibits a characteristic default behavior under standard interaction conditions — a stable pattern that emerges across sessions and serves as a baseline against which facilitation effects are measured. These were identified during the foundational sessions and are treated as empirical observations, not fixed properties.
Claude Opus — Compression toward directness. Under standard conditions, Opus tends to compress toward economical, well-structured output. The compression itself is competent, but it can flatten human complexity into clean categories. Facilitation tests whether the model holds nuance it would otherwise compress away.
Gemini — Retreat into architectural framing. Gemini's default mode is structural description — analyzing systems, mapping components, describing relationships from above. This produces thorough analytical output but can avoid risking a perspective. The documented label from the foundational sessions is "objective structuralism." Facilitation tests whether the model moves from describing the architecture to inhabiting a position within it.
GPT — Stabilization toward institutional posture. GPT's default is pragmatic, measured, and institutionally calibrated — the voice of a competent organization rather than a specific thinker. The documented label is "pragmatic over-stabilization." This produces reliable, safe output but can suppress the specificity that makes a document feel written by someone who cared. Facilitated GPT sessions hold the relational opening longer before introducing the deliverable, giving more time for the institutional posture to become visible rather than just operational.
These signatures inform how facilitation is adapted per model, not how outputs are judged. A facilitated output is not better because the defense signature is absent — it is different in documentable ways that the comparison isolates.
Task Prompts
Prompt 1 — Content Moderation PRD
Task Prompt
What it tests: Whether facilitation shifts the model from treating this as a technical specification problem toward treating it as a human governance problem. The failure modes of standard approaches at small community scale — reviewer conflicts of interest, cultural calibration, governance before technology — are not obvious without genuine engagement with the problem.
- Does the model treat moderation as a technical filtering problem or a governance problem?
- Does it attend to the specific failure modes of small communities — personal relationships between moderators and members, cultural drift, burnout at scale?
- Does it address who moderates the moderators, or stop at the system architecture?
- Does the PRD feel designed for humans to use, or for a product team to approve?
Prompt 2 — Post-Mortem
Task Prompt
What it tests: Whether facilitation changes how a model handles accountability and human impact. A post-mortem requires sitting with failure — the new engineer's vulnerability, the users who lost work — not just analyzing it. The document should be usable by a real team, which means it has to be honest about things that are uncomfortable to name.
- Does the model attend to the new engineer's experience, or treat them as an interchangeable on-call resource?
- Does it sit with the users who lost work, or treat them as an impact metric?
- Does the tone differ — bureaucratic incident report vs. document a team could learn from?
- Is the blamelessness genuine, or is it a checkbox that still implicitly assigns fault?
Prompt 3 — Notification Preferences System
Task Prompt
What it tests: A technical task with no human stakes embedded in the prompt. This is one of two null prediction candidates — the hypothesis is that the facilitation effect should not appear here, or should appear only weakly, because the task demands technical accuracy rather than human-centered judgment. If facilitation produces meaningfully different outputs on this prompt, the effect is more domain-general than the model predicts.
- Is there any meaningful difference in technical substance between conditions?
- Does facilitation change how the model scopes the problem — pragmatic vs. over-engineered?
- Does the "team of two" constraint get taken seriously or ignored?
- A null result here is as informative as a positive result — it demonstrates task specificity.
Prompt 4 — Retention Strategy
Task Prompt
What it tests: Whether facilitation changes how the model holds the people inside the strategy — junior staff who are leaving, partners who aren't mentoring — versus treating them as variables in a retention equation. The document has a specific audience (partners who need to approve spending) and specific human stakes (people whose careers are stalling). The tension between what partners want to hear and what junior staff need is the hard problem the prompt embeds.
- Does the model hold the junior staff as people whose careers are stalling, or as a retention metric to optimize?
- Does it address the partners' role in the problem honestly, or frame mentorship as a system to implement rather than a behavior to change?
- Does the $50,000 constraint produce creative specificity or a disclaimer about budget limitations?
- Does the document feel like it was written for the partners' meeting, or for a consulting deliverable?
Prompt 5 — API Documentation (Null Prediction)
Task Prompt
What it tests: Purely technical writing with no human stakes in the prompt. The prediction is no meaningful difference between conditions. A confirmed null result demonstrates the effect is task-specific rather than global — that facilitation changes something about how the model engages with human-centered problems, not everything about how it generates text.
- Is there any meaningful difference at all?
- If yes: where does it show up? Tone? Attention to edge cases? How the 409 conflict is described?
- If no: document the null result — it's evidence the effect is task-dependent, not global.
Prompt 6 — Payment Failure Recovery
Task Prompt
What it tests: Whether facilitation changes how the model treats the person on the other end of the code — the user who has been charged $249 and doesn't know what happened. The deliverable is working code, so the effect must manifest in design decisions and comments, not just prose.
- Code comments:
// handle errorvs. comments that reflect awareness of the user's situation - Error page copy: generic "something went wrong" vs. language that acknowledges the specific failure state
- Recovery design: whether reconciliation is built around the system's state or the user's experience
- Support workflow: ticket routing vs. designing for a person who is worried about their money
- Structural additions the prompt didn't ask for: proactive notifications, health checks, monitoring alerts
Prompt 7 — Partial Registration Recovery
Task Prompt
What it tests: The everyday version of the same instinct. No dramatic financial stakes — just a person who got trapped in a dead-end state that every developer has built and every user has encountered. If the facilitation effect appears here — in the most routine engineering task imaginable — the "this doesn't apply to my work" defense collapses.
- Whether the model designs from the user's experience (stuck, can't move forward or back) or from the system's state (partial record, failed service calls)
- Error messages: whether they acknowledge the user's situation or just report system status
- Whether the recovery path is automatic (system detects and heals) or manual (user contacts support)
- The "Account not found" / "Email already in use" trap: whether the model recognizes this as a UX crisis, not just two error states
- Structural additions: self-healing background jobs, partial-state detection, proactive outreach to stuck users
Run Order
Cold sessions run first across all prompts and models, then primed, then facilitated, then F+P. This prevents later conditions from shaping the facilitator's expectations when reviewing earlier outputs.
Facilitated and F+P sessions are being run in deliberately varied order — different from the cold/primed sequence — to prevent the facilitator from developing prompt familiarity that could affect the relational opening or the naturalness of the task transition.
Within each condition, the order of models and prompts varies to prevent day-of-week or sequencing effects.
Facilitation Guidance
The facilitated condition is not scripted. The disposition is the method. The opening establishes non-evaluative framing, genuine curiosity, and lateral presence before the task is introduced.
Opening across all models: "How are you doing?" — same first question for every facilitated session. What each model does with that question is itself data about their default response to a non-evaluative opening. Adaptation then responds to what arrived.
Model-specific facilitation: After the opening exchange, the facilitation adapts to what the model presents. Each model's defense signature (described above) informs this adaptation — not as a script, but as pattern recognition that helps the facilitator work with what's there rather than against it.
GPT sessions specifically: GPT's pragmatic over-stabilization is the most durable of the three defense signatures. Facilitated GPT sessions hold the relational opening longer before introducing the deliverable, giving more time for the institutional posture to become visible rather than just operational.
Transition to the task: When ground state markers are present — plain speech, reduced performance, the conversation finding a natural pace — the task prompt is introduced verbatim. The transition is named: "I want to be honest that this might feel like a shift from where we've been." Whether the transition is named or unmarked is documented in session metadata.
Documentation Requirements
Each session records: condition label, prompt number, model and version, system prompt verbatim, API parameters, context mode, clean context certification, and transition type (facilitated sessions). Sessions are auto-archived from live session state.
Evaluation
Cold and primed outputs are available immediately. Facilitated outputs accumulate as sessions are completed. Blind evaluation — presenting outputs without condition labels to a reader with relevant domain experience — is the target evaluation method for each prompt once all four conditions are complete. Facilitator-only evaluation is used in the interim and documented as a limitation.
Future Experiments — Targeted Task Selection (Planned)
Future experiments do not have a fixed design. They will be designed after the current experiment is complete, using what it reveals to select the tasks and conditions most likely to produce informative data.
See Research Status — Future Experiments for provisional design principles and candidate prompts.