YOLO Dataset & Annotation Best Practices Guide 2024

Complete guide to YOLO dataset creation and computer vision annotation best practices. Learn professional techniques for object detection annotation, dataset preparation, and model training optimization for machine learning projects.

1. Up-front Planning

WhatWhy
Lock the taxonomyDecide IDs + names before anyone draws. Store in data.yaml & README.
Define acceptance targetse.g. "≥ 0.90 [email protected] on checkboxes; ≥ 0.80 on text boxes."
Pick one DPI / max edge300 dpi for PDFs / scans; < 2×-imgsz (e.g. 1280 px) for camera images.

2. Image & Resolution Guidelines

YOLO auto-letterboxes, so mixed sizes work, but equal or similar sizes give:

  • Cleaner, tighter annotations
  • Faster convergence (fewer scale extremes)
  • Simpler PDF post-processing (single scale factor)

If legacy data exists, don't throw it away—just ensure tiny objects remain > 3-4 px after resize.

3. Annotation Rules

  • Tight, axis-aligned boxes—cover the entire object, nothing extra.
  • Centre-based YOLO format in every .txt: class xc yc w h (floats normalised 0-1).
  • Label the tough stuff first—blur, skew, low-contrast, occlusion.
  • Class balance goal: ≈ 200 instances per class minimum (use synthetic augmentation if one class is rare).
  • Box every visible instance, even if partially occluded—YOLO learns objectness better that way.

4. Workflow & Quality Control

PhaseBest practice
DrawAnnotator 1 creates boxes.
ReviewAnnotator 2 (or model-assist) approves / tweaks / rejects.
AutomateRun label-verification script on every pull-request.
VersionGit-LFS or zipped batches; keep a CHANGELOG.md for taxonomy edits.
Freeze splitsEstablish train / val / test once—never reshuffle after first cut.
Documentdocs/dataset.md — source, DPI, class list, box convention, augmentation pipeline.

5. Folder & File Layout (YOLO v5 / v8 canonical)

dataset/
├─ images/
│  ├─ train/  001.jpg …
│  └─ val/    …
├─ labels/
│  ├─ train/  001.txt …
│  └─ val/    …
└─ data.yaml

data.yaml example:

path: .
train: images/train
val:   images/val
names:
  0: text_input
  1: checkbox
  2: radio
  3: signature

Every image file has a .txt twin with identical basename.

6. Automated Sanity-Checks (CI snippet)

# Ultralytics tool (v8+)
yolo labels verify data=data.yaml imgsz=640

# Custom quick-lint in bash
python scripts/check_yolo_labels.py  # zero-area boxes, class-ID out of range, orphan images

Add as a pre-commit hook or GitHub Action to block bad labels early.

7. Common Pitfalls & Quick Fixes

PitfallFix
Class-ID driftLock taxonomy; review any data.yaml PR diff.
Duplicate image in train and valScript hash-based duplicate detection.
Loose / chopped boxesEnforce visual QA checklist; annotate at ≥ 200 % zoom.
Mixed coordinate conventionsAll exports go through one script that outputs centre-based format—no manual edits.
Rare class underperformsAugment: copy-paste, synthetic generation, or oversample in training loader.

8. FAQ Corner

QA
"Does YOLO label format support relations (question → answer)?"No. Store relations in a parallel JSON or switch to FUNSD-style JSON + LayoutLMv3 for KIE tasks.
"Must every image be the same size?"No, but fixing DPI/resolution improves annotation consistency and recall on small objects.
"Can I mix scans and photos?"Yes—just ensure small objects remain visible after resize and balance each domain in train/val.
"Best free labeling tool?"CVAT for pure boxes (fast hotkeys), Label Studio if you need box + OCR text in one pass.
"Minimum number of images?"Target ≥ 2k total and ≥ 200 per class for a robust detector; fewer if you leverage heavy augmentation or pre-training.