5.4 KiB
5.4 KiB
BGE Fine-tuning Directory Structure
This document clarifies the directory structure and cache usage in the BGE fine-tuning project.
📂 Project Directory Structure
bge_finetune/
├── 📁 cache/ # Cache directories
│ ├── 📁 data/ # Preprocessed dataset cache
│ │ ├── 📄 *.jsonl # Preprocessed JSONL files
│ │ └── 📄 *.json # Preprocessed JSON files
│ └── 📁 models/ # Downloaded model cache (HuggingFace/ModelScope)
│ ├── 📁 tokenizers/ # Cached tokenizer files
│ ├── 📁 config/ # Cached model config files
│ └── 📁 downloads/ # Temporary download files
│
├── 📁 models/ # Local model storage
│ ├── 📁 bge-m3/ # Local BGE-M3 model files
│ └── 📁 bge-reranker-base/ # Local BGE-reranker model files
│
├── 📁 data/ # Raw training data
│ ├── 📄 train.jsonl # Input training data
│ ├── 📄 eval.jsonl # Input evaluation data
│ └── 📁 processed/ # Preprocessed (but not cached) data
│
├── 📁 output/ # Training outputs
│ ├── 📁 bge-m3/ # M3 training results
│ ├── 📁 bge-reranker/ # Reranker training results
│ └── 📁 checkpoints/ # Model checkpoints
│
└── 📁 logs/ # Training logs
├── 📁 tensorboard/ # TensorBoard logs
└── 📄 training.log # Text logs
🎯 Directory Purposes
1. ./cache/models/ - Model Downloads Cache
- Purpose: HuggingFace/ModelScope model download cache
- Contents: Tokenizers, config files, temporary downloads
- Usage: Automatic caching of remote model downloads
- Config:
model_paths.cache_dir = "./cache/models"
2. ./cache/data/ - Processed Dataset Cache
- Purpose: Preprocessed and cleaned dataset files
- Contents: Validated JSONL/JSON files
- Usage: Store cleaned and validated training data
- Config:
data.cache_dir = "./cache/data"
3. ./models/ - Local Model Storage
- Purpose: Complete local model files (when available)
- Contents: Full model directories with all files
- Usage: Direct loading without downloads
- Config:
model_paths.bge_m3 = "./models/bge-m3"
4. ./data/ - Raw Input Data
- Purpose: Original training/evaluation datasets
- Contents:
.jsonlfiles in various formats - Usage: Input to preprocessing and optimization
- Config: User-specified paths in training scripts
5. ./output/ - Training Results
- Purpose: Fine-tuned models and training artifacts
- Contents: Saved models, checkpoints, metrics
- Usage: Results of training scripts
- Config:
--output_dirin training scripts
⚙️ Configuration Mapping
Config File Settings
[model_paths]
cache_dir = "./cache/models" # Model download cache
bge_m3 = "./models/bge-m3" # Local BGE-M3 path
bge_reranker = "./models/bge-reranker-base" # Local reranker path
[data]
cache_dir = "./cache/data" # Dataset cache
optimization_output_dir = "./cache/data" # Optimization default
Script Defaults
# Data optimization (processed datasets)
python -m data.optimization bge-m3 data/train.jsonl --output_dir ./cache/data
# Model training (output models)
python scripts/train_m3.py --output_dir ./output/bge-m3 --cache_dir ./cache/models
# Model evaluation (results)
python scripts/evaluate.py --output_dir ./evaluation_results
🔄 Data Flow
Raw Data → Preprocessing → Optimization → Training → Output
data/ → - → cache/data/ → models → output/
↓ ↓ ↓
Input Pre-tokenized Model Fine-tuned
JSONL → .pkl files → Loading → Models
🛠️ Usage Examples
Model Cache (Download Cache)
# Automatic model caching
tokenizer = AutoTokenizer.from_pretrained(
"BAAI/bge-m3",
cache_dir="./cache/models" # Downloads cached here
)
Data Cache (Optimization Cache)
# Create optimized datasets
python -m data.optimization bge-m3 data/train.jsonl \
--output_dir ./cache/data # .pkl files created here
# Use cached datasets in training
python scripts/train_m3.py \
--train_data ./cache/data/train_m3_cached.pkl
Local Models (No Downloads)
# Use local model (no downloads)
model = BGEM3Model.from_pretrained("./models/bge-m3")
📋 Best Practices
- Separate Concerns: Keep model cache, data cache, and outputs separate
- Clear Naming: Use descriptive suffixes (
.pklfor cache,_cachedfor optimized) - Consistent Paths: Use config.toml for centralized path management
- Cache Management:
- Model cache: Can be shared across projects
- Data cache: Project-specific optimized datasets
- Output: Training results and checkpoints
🧹 Cleanup Commands
# Clean data cache (re-optimization needed)
rm -rf ./cache/data/
# Clean model cache (re-download needed)
rm -rf ./cache/models/
# Clean training outputs
rm -rf ./output/
# Clean all caches
rm -rf ./cache/
This structure ensures clear separation of concerns and efficient caching for both model downloads and processed datasets.