bge_finetune/docs/directory_structure.md
2025-07-22 16:55:25 +08:00

5.4 KiB

BGE Fine-tuning Directory Structure

This document clarifies the directory structure and cache usage in the BGE fine-tuning project.

📂 Project Directory Structure

bge_finetune/
├── 📁 cache/                    # Cache directories
│   ├── 📁 data/               # Preprocessed dataset cache
│   │   ├── 📄 *.jsonl          # Preprocessed JSONL files
│   │   └── 📄 *.json           # Preprocessed JSON files
│   └── 📁 models/              # Downloaded model cache (HuggingFace/ModelScope)
│       ├── 📁 tokenizers/      # Cached tokenizer files
│       ├── 📁 config/          # Cached model config files
│       └── 📁 downloads/       # Temporary download files
│
├── 📁 models/                   # Local model storage
│   ├── 📁 bge-m3/              # Local BGE-M3 model files
│   └── 📁 bge-reranker-base/   # Local BGE-reranker model files
│
├── 📁 data/                     # Raw training data
│   ├── 📄 train.jsonl          # Input training data
│   ├── 📄 eval.jsonl           # Input evaluation data
│   └── 📁 processed/           # Preprocessed (but not cached) data
│
├── 📁 output/                   # Training outputs
│   ├── 📁 bge-m3/              # M3 training results
│   ├── 📁 bge-reranker/        # Reranker training results
│   └── 📁 checkpoints/         # Model checkpoints
│
└── 📁 logs/                     # Training logs
    ├── 📁 tensorboard/         # TensorBoard logs
    └── 📄 training.log         # Text logs

🎯 Directory Purposes

1. ./cache/models/ - Model Downloads Cache

  • Purpose: HuggingFace/ModelScope model download cache
  • Contents: Tokenizers, config files, temporary downloads
  • Usage: Automatic caching of remote model downloads
  • Config: model_paths.cache_dir = "./cache/models"

2. ./cache/data/ - Processed Dataset Cache

  • Purpose: Preprocessed and cleaned dataset files
  • Contents: Validated JSONL/JSON files
  • Usage: Store cleaned and validated training data
  • Config: data.cache_dir = "./cache/data"

3. ./models/ - Local Model Storage

  • Purpose: Complete local model files (when available)
  • Contents: Full model directories with all files
  • Usage: Direct loading without downloads
  • Config: model_paths.bge_m3 = "./models/bge-m3"

4. ./data/ - Raw Input Data

  • Purpose: Original training/evaluation datasets
  • Contents: .jsonl files in various formats
  • Usage: Input to preprocessing and optimization
  • Config: User-specified paths in training scripts

5. ./output/ - Training Results

  • Purpose: Fine-tuned models and training artifacts
  • Contents: Saved models, checkpoints, metrics
  • Usage: Results of training scripts
  • Config: --output_dir in training scripts

⚙️ Configuration Mapping

Config File Settings

[model_paths]
cache_dir = "./cache/models"      # Model download cache
bge_m3 = "./models/bge-m3"        # Local BGE-M3 path
bge_reranker = "./models/bge-reranker-base"  # Local reranker path

[data] 
cache_dir = "./cache/data"        # Dataset cache
optimization_output_dir = "./cache/data"  # Optimization default

Script Defaults

# Data optimization (processed datasets)
python -m data.optimization bge-m3 data/train.jsonl --output_dir ./cache/data

# Model training (output models)  
python scripts/train_m3.py --output_dir ./output/bge-m3 --cache_dir ./cache/models

# Model evaluation (results)
python scripts/evaluate.py --output_dir ./evaluation_results

🔄 Data Flow

Raw Data → Preprocessing → Optimization → Training → Output
data/     →      -       → cache/data/ → models → output/
  ↓                           ↓           ↓
Input       Pre-tokenized    Model     Fine-tuned
JSONL   →    .pkl files  →  Loading  →   Models

🛠️ Usage Examples

Model Cache (Download Cache)

# Automatic model caching
tokenizer = AutoTokenizer.from_pretrained(
    "BAAI/bge-m3",
    cache_dir="./cache/models"  # Downloads cached here
)

Data Cache (Optimization Cache)

# Create optimized datasets
python -m data.optimization bge-m3 data/train.jsonl \
  --output_dir ./cache/data  # .pkl files created here

# Use cached datasets in training
python scripts/train_m3.py \
  --train_data ./cache/data/train_m3_cached.pkl

Local Models (No Downloads)

# Use local model (no downloads)
model = BGEM3Model.from_pretrained("./models/bge-m3")

📋 Best Practices

  1. Separate Concerns: Keep model cache, data cache, and outputs separate
  2. Clear Naming: Use descriptive suffixes (.pkl for cache, _cached for optimized)
  3. Consistent Paths: Use config.toml for centralized path management
  4. Cache Management:
    • Model cache: Can be shared across projects
    • Data cache: Project-specific optimized datasets
    • Output: Training results and checkpoints

🧹 Cleanup Commands

# Clean data cache (re-optimization needed)
rm -rf ./cache/data/

# Clean model cache (re-download needed)  
rm -rf ./cache/models/

# Clean training outputs
rm -rf ./output/

# Clean all caches
rm -rf ./cache/

This structure ensures clear separation of concerns and efficient caching for both model downloads and processed datasets.