YOLO Dataset & Annotation Best Practices Guide 2024
Complete guide to YOLO dataset creation and computer vision annotation best practices. Learn professional techniques for object detection annotation, dataset preparation, and model training optimization for machine learning projects.
Table of Contents
1. Up-front Planning
What | Why |
---|---|
Lock the taxonomy | Decide IDs + names before anyone draws. Store in data.yaml & README. |
Define acceptance targets | e.g. "≥ 0.90 [email protected] on checkboxes; ≥ 0.80 on text boxes." |
Pick one DPI / max edge | 300 dpi for PDFs / scans; < 2×-imgsz (e.g. 1280 px) for camera images. |
2. Image & Resolution Guidelines
YOLO auto-letterboxes, so mixed sizes work, but equal or similar sizes give:
- Cleaner, tighter annotations
- Faster convergence (fewer scale extremes)
- Simpler PDF post-processing (single scale factor)
If legacy data exists, don't throw it away—just ensure tiny objects remain > 3-4 px after resize.
3. Annotation Rules
- Tight, axis-aligned boxes—cover the entire object, nothing extra.
- Centre-based YOLO format in every .txt: class xc yc w h (floats normalised 0-1).
- Label the tough stuff first—blur, skew, low-contrast, occlusion.
- Class balance goal: ≈ 200 instances per class minimum (use synthetic augmentation if one class is rare).
- Box every visible instance, even if partially occluded—YOLO learns objectness better that way.
4. Workflow & Quality Control
Phase | Best practice |
---|---|
Draw | Annotator 1 creates boxes. |
Review | Annotator 2 (or model-assist) approves / tweaks / rejects. |
Automate | Run label-verification script on every pull-request. |
Version | Git-LFS or zipped batches; keep a CHANGELOG.md for taxonomy edits. |
Freeze splits | Establish train / val / test once—never reshuffle after first cut. |
Document | docs/dataset.md — source, DPI, class list, box convention, augmentation pipeline. |
5. Folder & File Layout (YOLO v5 / v8 canonical)
dataset/
├─ images/
│ ├─ train/ 001.jpg …
│ └─ val/ …
├─ labels/
│ ├─ train/ 001.txt …
│ └─ val/ …
└─ data.yaml
data.yaml example:
path: .
train: images/train
val: images/val
names:
0: text_input
1: checkbox
2: radio
3: signature
Every image file has a .txt twin with identical basename.
6. Automated Sanity-Checks (CI snippet)
# Ultralytics tool (v8+)
yolo labels verify data=data.yaml imgsz=640
# Custom quick-lint in bash
python scripts/check_yolo_labels.py # zero-area boxes, class-ID out of range, orphan images
Add as a pre-commit hook or GitHub Action to block bad labels early.
7. Common Pitfalls & Quick Fixes
Pitfall | Fix |
---|---|
Class-ID drift | Lock taxonomy; review any data.yaml PR diff. |
Duplicate image in train and val | Script hash-based duplicate detection. |
Loose / chopped boxes | Enforce visual QA checklist; annotate at ≥ 200 % zoom. |
Mixed coordinate conventions | All exports go through one script that outputs centre-based format—no manual edits. |
Rare class underperforms | Augment: copy-paste, synthetic generation, or oversample in training loader. |
8. FAQ Corner
Q | A |
---|---|
"Does YOLO label format support relations (question → answer)?" | No. Store relations in a parallel JSON or switch to FUNSD-style JSON + LayoutLMv3 for KIE tasks. |
"Must every image be the same size?" | No, but fixing DPI/resolution improves annotation consistency and recall on small objects. |
"Can I mix scans and photos?" | Yes—just ensure small objects remain visible after resize and balance each domain in train/val. |
"Best free labeling tool?" | CVAT for pure boxes (fast hotkeys), Label Studio if you need box + OCR text in one pass. |
"Minimum number of images?" | Target ≥ 2k total and ≥ 200 per class for a robust detector; fewer if you leverage heavy augmentation or pre-training. |