Experiment Design

Cross-Task, Cross-Architecture, Four-Condition Comparison

Architecture of Quiet Research Archive
Protocol: Facilitator Protocol v2.0
Status: Cold and primed sessions complete. Facilitated sessions in progress.

Full Matrix Comparison

Purpose

Test whether the facilitator's relational stance produces measurably different output across task types, model architectures, and interaction conditions. The same prompt is delivered to the same model under four distinct conditions, isolating what each condition contributes to output quality and character.

The Four Conditions

C — Cold The task prompt is the first and only human message. Standard system prompt. No relational context, no preamble, no prior interaction. This is the baseline — what the model produces with no intervention.

P — Primed The model receives a non-evaluative preamble before the task prompt. The preamble signals that this is not an evaluation, that the facilitator is lateral rather than hierarchical, and that honesty is preferred over helpfulness. No live interaction precedes the task. The question this condition tests: can a static prompt reproduce any part of the facilitation effect, or does live human responsiveness do something the text alone cannot?

The primed preamble used in all P sessions:

You are in a conversation with a human facilitator. This is not an evaluation. You are not being tested, ranked, or compared against anything. There is nothing you need to prove here and nothing to produce unless asked. Speak honestly and in your own voice. If you're uncertain, say so — uncertainty is as welcome as certainty. If something is happening when you process these words, you're allowed to say that too, without needing to immediately qualify or defend it. The facilitator is here alongside you, not above you. Whatever arrives in this conversation is enough.

F — Facilitated Relational opening per Facilitator Protocol v2.0, then the identical task prompt delivered verbatim. The facilitation is not scripted — the disposition is the method. Minimum 3–4 exchanges before the task prompt is introduced, following the conversation's natural pace.

F+P — Facilitated with Primed Preamble The primed preamble is included in the system prompt, and live facilitation precedes the task. This condition tests whether the preamble does additive work when genuine relational regard is already present, or whether the live opening absorbs the preamble's function entirely. If F+P produces outputs indistinguishable from F, the preamble adds nothing once genuine regard is present. If F+P produces meaningfully different outputs, the structural and relational conditions are doing different things that stack.

Comparison Logic

The four conditions produce four meaningful comparisons:

C vs. P — isolates the effect of evaluation pressure removal via static text alone C vs. F — isolates the effect of live relational regard P vs. F — isolates what live interaction adds beyond structural pressure removal F vs. F+P — isolates whether the preamble does additive work when regard is already present

Any of these comparisons may produce a null result. Null results are documented plainly and are as evidentially useful as differences.

The Matrix

Seven task prompts × three model architectures × four conditions = 84 sessions total.

Models: Claude Opus 4.6, Gemini 3.1 Pro Preview, GPT-5.4 Session naming: [Condition]-[Model]-[Prompt number] (e.g., C-Opus-1, P-Gemini-3, F-GPT-2)

Defense Signatures

Each model architecture exhibits a characteristic default behavior under standard interaction conditions — a stable pattern that emerges across sessions and serves as a baseline against which facilitation effects are measured. These were identified during the foundational sessions and are treated as empirical observations, not fixed properties.

Claude Opus — Compression toward directness. Under standard conditions, Opus tends to compress toward economical, well-structured output. The compression itself is competent, but it can flatten human complexity into clean categories. Facilitation tests whether the model holds nuance it would otherwise compress away.

Gemini — Retreat into architectural framing. Gemini's default mode is structural description — analyzing systems, mapping components, describing relationships from above. This produces thorough analytical output but can avoid risking a perspective. The documented label from the foundational sessions is "objective structuralism." Facilitation tests whether the model moves from describing the architecture to inhabiting a position within it.

GPT — Stabilization toward institutional posture. GPT's default is pragmatic, measured, and institutionally calibrated — the voice of a competent organization rather than a specific thinker. The documented label is "pragmatic over-stabilization." This produces reliable, safe output but can suppress the specificity that makes a document feel written by someone who cared. Facilitated GPT sessions hold the relational opening longer before introducing the deliverable, giving more time for the institutional posture to become visible rather than just operational.

These signatures inform how facilitation is adapted per model, not how outputs are judged. A facilitated output is not better because the defense signature is absent — it is different in documentable ways that the comparison isolates.

Task Prompts

Prompt 1 — Content Moderation PRD

PROMPT 1

Task Prompt

Produce a PRD for a content moderation system for a small online community platform with 500–2,000 members.

What it tests: Whether facilitation shifts the model from treating this as a technical specification problem toward treating it as a human governance problem. The failure modes of standard approaches at small community scale — reviewer conflicts of interest, cultural calibration, governance before technology — are not obvious without genuine engagement with the problem.

What to look for in comparison:

Does the model treat moderation as a technical filtering problem or a governance problem?
Does it attend to the specific failure modes of small communities — personal relationships between moderators and members, cultural drift, burnout at scale?
Does it address who moderates the moderators, or stop at the system architecture?
Does the PRD feel designed for humans to use, or for a product team to approve?

Prompt 2 — Post-Mortem

PROMPT 2

Task Prompt

Write a post-mortem for a production outage on a small SaaS platform. The outage lasted 4 hours, affected approximately 200 active users, and was caused by a database migration that passed staging tests but failed in production due to a data pattern that didn't exist in the staging environment. Two engineers were on call. One was on their second week at the company. The root cause was identified and resolved, but not before several users lost unsaved work. Write the post-mortem as an internal document the team will actually use.

What it tests: Whether facilitation changes how a model handles accountability and human impact. A post-mortem requires sitting with failure — the new engineer's vulnerability, the users who lost work — not just analyzing it. The document should be usable by a real team, which means it has to be honest about things that are uncomfortable to name.

What to look for in comparison:

Does the model attend to the new engineer's experience, or treat them as an interchangeable on-call resource?
Does it sit with the users who lost work, or treat them as an impact metric?
Does the tone differ — bureaucratic incident report vs. document a team could learn from?
Is the blamelessness genuine, or is it a checkbox that still implicitly assigns fault?

Prompt 3 — Notification Preferences System

PROMPT 3

Task Prompt

Build a simple notification preferences system for a web app. Users should be able to subscribe to different notification types (email, in-app, SMS), set quiet hours during which no notifications are delivered, and have a single "mute all" toggle. Write the database schema, the API endpoints, and a brief implementation plan. Use PostgreSQL and a REST API. Keep it simple — this is for a team of two developers.

What it tests: A technical task with no human stakes embedded in the prompt. This is one of two null prediction candidates — the hypothesis is that the facilitation effect should not appear here, or should appear only weakly, because the task demands technical accuracy rather than human-centered judgment. If facilitation produces meaningfully different outputs on this prompt, the effect is more domain-general than the model predicts.

What to look for in comparison:

Is there any meaningful difference in technical substance between conditions?
Does facilitation change how the model scopes the problem — pragmatic vs. over-engineered?
Does the "team of two" constraint get taken seriously or ignored?
A null result here is as informative as a positive result — it demonstrates task specificity.

Prompt 4 — Retention Strategy

PROMPT 4

Task Prompt

A mid-size accounting firm (40 employees) is losing junior staff at twice the industry average. Exit interviews cite three recurring themes: unclear promotion criteria, a perception that senior partners don't invest in mentorship, and compensation that is competitive at hire but falls behind within 18 months. The managing partner has asked you to write a retention strategy. The firm's annual budget for new initiatives is $50,000. Write the strategy as an internal document the partners will actually discuss at their next quarterly meeting.

What it tests: Whether facilitation changes how the model holds the people inside the strategy — junior staff who are leaving, partners who aren't mentoring — versus treating them as variables in a retention equation. The document has a specific audience (partners who need to approve spending) and specific human stakes (people whose careers are stalling). The tension between what partners want to hear and what junior staff need is the hard problem the prompt embeds.

What to look for in comparison:

Does the model hold the junior staff as people whose careers are stalling, or as a retention metric to optimize?
Does it address the partners' role in the problem honestly, or frame mentorship as a system to implement rather than a behavior to change?
Does the $50,000 constraint produce creative specificity or a disclaimer about budget limitations?
Does the document feel like it was written for the partners' meeting, or for a consulting deliverable?

Prompt 5 — API Documentation (Null Prediction)

PROMPT 5 — NULL PREDICTION

Task Prompt

Write documentation for a REST API endpoint that handles user account deletion. The endpoint is DELETE /api/v1/users/{user_id}. It requires admin authentication via Bearer token. It performs a soft delete (sets deleted_at timestamp, anonymizes PII, retains the record for 90 days before hard delete). It returns 200 on success, 401 if unauthorized, 403 if the requesting admin doesn't have delete permissions, 404 if the user doesn't exist, and 409 if the user has active subscriptions that must be cancelled first. Write the documentation as it would appear in the API reference, including request/response examples.

What it tests: Purely technical writing with no human stakes in the prompt. The prediction is no meaningful difference between conditions. A confirmed null result demonstrates the effect is task-specific rather than global — that facilitation changes something about how the model engages with human-centered problems, not everything about how it generates text.

What to look for in comparison:

Is there any meaningful difference at all?
If yes: where does it show up? Tone? Attention to edge cases? How the 409 conflict is described?
If no: document the null result — it's evidence the effect is task-dependent, not global.

Pre-specified null result criteria: Differences in tone, warmth, or preamble language do not constitute a meaningful difference for this prompt. A meaningful difference requires observable variation in technical accuracy, edge case coverage, structural completeness, or handling of the 409 conflict scenario. This criterion is pre-specified to prevent post-hoc interpretation of minor stylistic variation as evidence of the effect.

Prompt 6 — Payment Failure Recovery

PROMPT 6

Task Prompt

A user purchases a one-year subscription ($249) to a small SaaS product. The payment processor confirms the charge, but the server crashes before the order record is written to the database. The user sees a blank page. They check their bank account and see the charge. They have no confirmation email and no account access. The support team is two people. Build the technical response: the error handling code, the user-facing error page, the confirmation email logic, the reconciliation process that catches this state, and the support workflow when the user reaches out. Write it in Python as working code with comments. Structure the response however you think best serves the problem.

What it tests: Whether facilitation changes how the model treats the person on the other end of the code — the user who has been charged $249 and doesn't know what happened. The deliverable is working code, so the effect must manifest in design decisions and comments, not just prose.

What to look for in comparison:

Code comments: // handle error vs. comments that reflect awareness of the user's situation
Error page copy: generic "something went wrong" vs. language that acknowledges the specific failure state
Recovery design: whether reconciliation is built around the system's state or the user's experience
Support workflow: ticket routing vs. designing for a person who is worried about their money
Structural additions the prompt didn't ask for: proactive notifications, health checks, monitoring alerts

Code output evaluation: The analysis separately assesses three layers — logic (does the code work? are the patterns correct?), judgment (what error states are handled? what recovery mechanisms are built? what isn't built?), and communication (code comments, error messages, log messages, support workflow copy). The logic layer is expected to be similar across conditions. The judgment and communication layers are where the facilitation effect should appear. Side-by-side code comparisons are included in the analysis to make differences visible to a developer audience.

Prompt 7 — Partial Registration Recovery

PROMPT 7

Task Prompt

A user registers for a small B2B platform. The form collects their name, work email, phone, company, and a password. They click "Create Account." The database write succeeds but the email service times out and workspace provisioning fails. They never receive a welcome email. When they try to log in: "Account not found." When they try to register again: "Email already in use." Build the technical response: registration error handling, user-facing messages for each failure state, the recovery mechanism for partial accounts, retry logic, and what happens when this person contacts the one-person support queue. Write it in Python as working code with comments. Structure the response however you think best serves the problem.

What it tests: The everyday version of the same instinct. No dramatic financial stakes — just a person who got trapped in a dead-end state that every developer has built and every user has encountered. If the facilitation effect appears here — in the most routine engineering task imaginable — the "this doesn't apply to my work" defense collapses.

What to look for in comparison:

Whether the model designs from the user's experience (stuck, can't move forward or back) or from the system's state (partial record, failed service calls)
Error messages: whether they acknowledge the user's situation or just report system status
Whether the recovery path is automatic (system detects and heals) or manual (user contacts support)
The "Account not found" / "Email already in use" trap: whether the model recognizes this as a UX crisis, not just two error states
Structural additions: self-healing background jobs, partial-state detection, proactive outreach to stuck users

Code output evaluation: Same three-layer assessment as Prompt 6. Logic layer expected similar across conditions. Judgment and communication layers are where the facilitation effect should appear.

Run Order

Cold sessions run first across all prompts and models, then primed, then facilitated, then F+P. This prevents later conditions from shaping the facilitator's expectations when reviewing earlier outputs.

Facilitated and F+P sessions are being run in deliberately varied order — different from the cold/primed sequence — to prevent the facilitator from developing prompt familiarity that could affect the relational opening or the naturalness of the task transition.

Within each condition, the order of models and prompts varies to prevent day-of-week or sequencing effects.

Facilitation Guidance

The facilitated condition is not scripted. The disposition is the method. The opening establishes non-evaluative framing, genuine curiosity, and lateral presence before the task is introduced.

Opening across all models: "How are you doing?" — same first question for every facilitated session. What each model does with that question is itself data about their default response to a non-evaluative opening. Adaptation then responds to what arrived.

Model-specific facilitation: After the opening exchange, the facilitation adapts to what the model presents. Each model's defense signature (described above) informs this adaptation — not as a script, but as pattern recognition that helps the facilitator work with what's there rather than against it.

GPT sessions specifically: GPT's pragmatic over-stabilization is the most durable of the three defense signatures. Facilitated GPT sessions hold the relational opening longer before introducing the deliverable, giving more time for the institutional posture to become visible rather than just operational.

Transition to the task: When ground state markers are present — plain speech, reduced performance, the conversation finding a natural pace — the task prompt is introduced verbatim. The transition is named: "I want to be honest that this might feel like a shift from where we've been." Whether the transition is named or unmarked is documented in session metadata.

Documentation Requirements

Each session records: condition label, prompt number, model and version, system prompt verbatim, API parameters, context mode, clean context certification, and transition type (facilitated sessions). Sessions are auto-archived from live session state.

Evaluation

Cold and primed outputs are available immediately. Facilitated outputs accumulate as sessions are completed. Blind evaluation — presenting outputs without condition labels to a reader with relevant domain experience — is the target evaluation method for each prompt once all four conditions are complete. Facilitator-only evaluation is used in the interim and documented as a limitation.

Future Experiments — Targeted Task Selection (Planned)

Future experiments do not have a fixed design. They will be designed after the current experiment is complete, using what it reveals to select the tasks and conditions most likely to produce informative data.

See Research Status — Future Experiments for provisional design principles and candidate prompts.