# BGE Fine-tuning Directory Structure This document clarifies the directory structure and cache usage in the BGE fine-tuning project. ## ๐Ÿ“‚ Project Directory Structure ``` bge_finetune/ โ”œโ”€โ”€ ๐Ÿ“ cache/ # Cache directories โ”‚ โ”œโ”€โ”€ ๐Ÿ“ data/ # Preprocessed dataset cache โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ *.jsonl # Preprocessed JSONL files โ”‚ โ”‚ โ””โ”€โ”€ ๐Ÿ“„ *.json # Preprocessed JSON files โ”‚ โ””โ”€โ”€ ๐Ÿ“ models/ # Downloaded model cache (HuggingFace/ModelScope) โ”‚ โ”œโ”€โ”€ ๐Ÿ“ tokenizers/ # Cached tokenizer files โ”‚ โ”œโ”€โ”€ ๐Ÿ“ config/ # Cached model config files โ”‚ โ””โ”€โ”€ ๐Ÿ“ downloads/ # Temporary download files โ”‚ โ”œโ”€โ”€ ๐Ÿ“ models/ # Local model storage โ”‚ โ”œโ”€โ”€ ๐Ÿ“ bge-m3/ # Local BGE-M3 model files โ”‚ โ””โ”€โ”€ ๐Ÿ“ bge-reranker-base/ # Local BGE-reranker model files โ”‚ โ”œโ”€โ”€ ๐Ÿ“ data/ # Raw training data โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ train.jsonl # Input training data โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ eval.jsonl # Input evaluation data โ”‚ โ””โ”€โ”€ ๐Ÿ“ processed/ # Preprocessed (but not cached) data โ”‚ โ”œโ”€โ”€ ๐Ÿ“ output/ # Training outputs โ”‚ โ”œโ”€โ”€ ๐Ÿ“ bge-m3/ # M3 training results โ”‚ โ”œโ”€โ”€ ๐Ÿ“ bge-reranker/ # Reranker training results โ”‚ โ””โ”€โ”€ ๐Ÿ“ checkpoints/ # Model checkpoints โ”‚ โ””โ”€โ”€ ๐Ÿ“ logs/ # Training logs โ”œโ”€โ”€ ๐Ÿ“ tensorboard/ # TensorBoard logs โ””โ”€โ”€ ๐Ÿ“„ training.log # Text logs ``` ## ๐ŸŽฏ Directory Purposes ### 1. **`./cache/models/`** - Model Downloads Cache - **Purpose**: HuggingFace/ModelScope model download cache - **Contents**: Tokenizers, config files, temporary downloads - **Usage**: Automatic caching of remote model downloads - **Config**: `model_paths.cache_dir = "./cache/models"` ### 2. **`./cache/data/`** - Processed Dataset Cache - **Purpose**: Preprocessed and cleaned dataset files - **Contents**: Validated JSONL/JSON files - **Usage**: Store cleaned and validated training data - **Config**: `data.cache_dir = "./cache/data"` ### 3. **`./models/`** - Local Model Storage - **Purpose**: Complete local model files (when available) - **Contents**: Full model directories with all files - **Usage**: Direct loading without downloads - **Config**: `model_paths.bge_m3 = "./models/bge-m3"` ### 4. **`./data/`** - Raw Input Data - **Purpose**: Original training/evaluation datasets - **Contents**: `.jsonl` files in various formats - **Usage**: Input to preprocessing and optimization - **Config**: User-specified paths in training scripts ### 5. **`./output/`** - Training Results - **Purpose**: Fine-tuned models and training artifacts - **Contents**: Saved models, checkpoints, metrics - **Usage**: Results of training scripts - **Config**: `--output_dir` in training scripts ## โš™๏ธ Configuration Mapping ### Config File Settings ```toml [model_paths] cache_dir = "./cache/models" # Model download cache bge_m3 = "./models/bge-m3" # Local BGE-M3 path bge_reranker = "./models/bge-reranker-base" # Local reranker path [data] cache_dir = "./cache/data" # Dataset cache optimization_output_dir = "./cache/data" # Optimization default ``` ### Script Defaults ```bash # Data optimization (processed datasets) python -m data.optimization bge-m3 data/train.jsonl --output_dir ./cache/data # Model training (output models) python scripts/train_m3.py --output_dir ./output/bge-m3 --cache_dir ./cache/models # Model evaluation (results) python scripts/evaluate.py --output_dir ./evaluation_results ``` ## ๐Ÿ”„ Data Flow ``` Raw Data โ†’ Preprocessing โ†’ Optimization โ†’ Training โ†’ Output data/ โ†’ - โ†’ cache/data/ โ†’ models โ†’ output/ โ†“ โ†“ โ†“ Input Pre-tokenized Model Fine-tuned JSONL โ†’ .pkl files โ†’ Loading โ†’ Models ``` ## ๐Ÿ› ๏ธ Usage Examples ### Model Cache (Download Cache) ```python # Automatic model caching tokenizer = AutoTokenizer.from_pretrained( "BAAI/bge-m3", cache_dir="./cache/models" # Downloads cached here ) ``` ### Data Cache (Optimization Cache) ```bash # Create optimized datasets python -m data.optimization bge-m3 data/train.jsonl \ --output_dir ./cache/data # .pkl files created here # Use cached datasets in training python scripts/train_m3.py \ --train_data ./cache/data/train_m3_cached.pkl ``` ### Local Models (No Downloads) ```python # Use local model (no downloads) model = BGEM3Model.from_pretrained("./models/bge-m3") ``` ## ๐Ÿ“‹ Best Practices 1. **Separate Concerns**: Keep model cache, data cache, and outputs separate 2. **Clear Naming**: Use descriptive suffixes (`.pkl` for cache, `_cached` for optimized) 3. **Consistent Paths**: Use config.toml for centralized path management 4. **Cache Management**: - Model cache: Can be shared across projects - Data cache: Project-specific optimized datasets - Output: Training results and checkpoints ## ๐Ÿงน Cleanup Commands ```bash # Clean data cache (re-optimization needed) rm -rf ./cache/data/ # Clean model cache (re-download needed) rm -rf ./cache/models/ # Clean training outputs rm -rf ./output/ # Clean all caches rm -rf ./cache/ ``` This structure ensures clear separation of concerns and efficient caching for both model downloads and processed datasets.