YOLO Dataset & Annotation Best Practices Guide 2024

Complete guide to YOLO dataset creation and computer vision annotation best practices. Learn professional techniques for object detection annotation, dataset preparation, and model training optimization for machine learning projects.

1. Up-front Planning

What	Why
Lock the taxonomy	Decide IDs + names before anyone draws. Store in data.yaml & README.
Define acceptance targets	e.g. "≥ 0.90 [email protected] on checkboxes; ≥ 0.80 on text boxes."
Pick one DPI / max edge	300 dpi for PDFs / scans; < 2×-imgsz (e.g. 1280 px) for camera images.

2. Image & Resolution Guidelines

YOLO auto-letterboxes, so mixed sizes work, but equal or similar sizes give:

Cleaner, tighter annotations
Faster convergence (fewer scale extremes)
Simpler PDF post-processing (single scale factor)

If legacy data exists, don't throw it away—just ensure tiny objects remain > 3-4 px after resize.

3. Annotation Rules

Tight, axis-aligned boxes—cover the entire object, nothing extra.
Centre-based YOLO format in every .txt: class xc yc w h (floats normalised 0-1).
Label the tough stuff first—blur, skew, low-contrast, occlusion.
Class balance goal: ≈ 200 instances per class minimum (use synthetic augmentation if one class is rare).
Box every visible instance, even if partially occluded—YOLO learns objectness better that way.

4. Workflow & Quality Control

Phase	Best practice
Draw	Annotator 1 creates boxes.
Review	Annotator 2 (or model-assist) approves / tweaks / rejects.
Automate	Run label-verification script on every pull-request.
Version	Git-LFS or zipped batches; keep a CHANGELOG.md for taxonomy edits.
Freeze splits	Establish train / val / test once—never reshuffle after first cut.
Document	docs/dataset.md — source, DPI, class list, box convention, augmentation pipeline.

5. Folder & File Layout (YOLO v5 / v8 canonical)

dataset/
├─ images/
│  ├─ train/  001.jpg …
│  └─ val/    …
├─ labels/
│  ├─ train/  001.txt …
│  └─ val/    …
└─ data.yaml

data.yaml example:

path: .
train: images/train
val:   images/val
names:
  0: text_input
  1: checkbox
  2: radio
  3: signature

Every image file has a .txt twin with identical basename.

6. Automated Sanity-Checks (CI snippet)

# Ultralytics tool (v8+)
yolo labels verify data=data.yaml imgsz=640

# Custom quick-lint in bash
python scripts/check_yolo_labels.py  # zero-area boxes, class-ID out of range, orphan images

Add as a pre-commit hook or GitHub Action to block bad labels early.

7. Common Pitfalls & Quick Fixes

Pitfall	Fix
Class-ID drift	Lock taxonomy; review any data.yaml PR diff.
Duplicate image in train and val	Script hash-based duplicate detection.
Loose / chopped boxes	Enforce visual QA checklist; annotate at ≥ 200 % zoom.
Mixed coordinate conventions	All exports go through one script that outputs centre-based format—no manual edits.
Rare class underperforms	Augment: copy-paste, synthetic generation, or oversample in training loader.

8. FAQ Corner

Q	A
"Does YOLO label format support relations (question → answer)?"	No. Store relations in a parallel JSON or switch to FUNSD-style JSON + LayoutLMv3 for KIE tasks.
"Must every image be the same size?"	No, but fixing DPI/resolution improves annotation consistency and recall on small objects.
"Can I mix scans and photos?"	Yes—just ensure small objects remain visible after resize and balance each domain in train/val.
"Best free labeling tool?"	CVAT for pure boxes (fast hotkeys), Label Studio if you need box + OCR text in one pass.
"Minimum number of images?"	Target ≥ 2k total and ≥ 200 per class for a robust detector; fewer if you leverage heavy augmentation or pre-training.