Build a model card from your own investigation evidence — not from marketing copy. The tool makes the gap between claimed behavior and observed behavior visible, and treats defaults and failures as primary artifacts worth documenting.
Live preview · launch for the interactive version
A model card built from marketing copy documents what the model is supposed to do. A card built from investigation evidence documents what it actually does. The gap between those two things is where critical understanding lives.
Defaults are not accidents — they are design choices about what happens when nothing is specified. Failures are not embarrassments — they are evidence of limits. Both belong in documentation.
What this model should not be used for is as important as what it can do. The card makes responsible-use limits part of the primary artifact, not a disclaimer buried in terms of service.
Build the card from your own evidence. Every field should be fillable from something you observed.
Fill the behavior section first. What did the model actually do? What did it default to? What did it refuse? Start with what you can point to.
Name specific failure modes you observed: temporal drift, prompt leakage, stereotyped defaults, hallucinated confidence. These are the most useful parts of the card.
Read the official documentation. Where does your evidence confirm the claims? Where does it contradict them? Name the specific divergence.
Name specific use cases where your evidence shows the model should not be relied on. Be specific enough that a teacher could act on the limit.