All terms

    Datasheets for Datasets

    Structured documentation describing a dataset's motivation, composition, collection process, preprocessing, uses, and maintenance, the dataset equivalent of a model card.

    Reviewed by Christian Espinosa, Founder, Blue Goat CyberLast reviewed June 20, 2026

    Definition

    Datasheets for Datasets is a documentation framework, introduced by Gebru et al. (2018), that prescribes a structured set of questions every dataset should answer: motivation for creation, composition (instances, labels, splits), collection process (who, how, when, consent), preprocessing/cleaning/labeling, uses (recommended, discouraged), distribution, and maintenance. The intent is to surface biases, gaps, and limitations of a dataset before it is used to train or evaluate a model, and to make those properties auditable downstream by model consumers and regulators.
    What the regulation says
    FDA's Good Machine Learning Practice Guiding Principles (joint with Health Canada and MHRA) call for representative data and transparency. EU AI Act Article 10 mandates data governance and management practices including documentation. Not yet specifically cited by name in FDA guidance, but the structure is widely used in PCCP submissions.

    What this means in practice

    For AI/ML medical devices, dataset documentation is rapidly becoming a regulatory expectation alongside model cards. FDA's Good Machine Learning Practice (GMLP) principles call for transparency about the data used to train and test devices. EU AI Act Article 10 explicitly requires training, validation, and testing dataset documentation for high-risk AI systems, including demographic representativeness and bias analyses. Datasheets are the most widely adopted structure for meeting those requirements.
    Common pitfalls
    • Documenting only the training set, validation and test set datasheets are equally important for evaluating generalization claims.
    • Glossing over consent and source legitimacy, these are increasingly material under GDPR, HIPAA, and AI Act scrutiny.
    • Treating datasheets as static, the document must be updated when datasets are augmented or relabeled.

    Primary references

    3 sources
    Link health: 2 verified 1 bot-blocked· last checked 2026-06-20
    arXiv·1FDA·1IMDRF·1
    1. 1
      Datasheets for Datasets (Gebru et al.)
      Verified
      arXivarxiv.org
    2. 2
      GMLP Guiding Principles
      Bot-blocked
      FDAfda.gov
    3. 3
      IMDRF - Software as a Medical Device
      Verified
      IMDRFimdrf.org

    Inline markers like [1] jump to the matching reference above.