bge_finetune/docs/directory_structure.md
2025-07-22 16:55:25 +08:00

159 lines
5.4 KiB
Markdown

# BGE Fine-tuning Directory Structure
This document clarifies the directory structure and cache usage in the BGE fine-tuning project.
## 📂 Project Directory Structure
```
bge_finetune/
├── 📁 cache/ # Cache directories
│ ├── 📁 data/ # Preprocessed dataset cache
│ │ ├── 📄 *.jsonl # Preprocessed JSONL files
│ │ └── 📄 *.json # Preprocessed JSON files
│ └── 📁 models/ # Downloaded model cache (HuggingFace/ModelScope)
│ ├── 📁 tokenizers/ # Cached tokenizer files
│ ├── 📁 config/ # Cached model config files
│ └── 📁 downloads/ # Temporary download files
├── 📁 models/ # Local model storage
│ ├── 📁 bge-m3/ # Local BGE-M3 model files
│ └── 📁 bge-reranker-base/ # Local BGE-reranker model files
├── 📁 data/ # Raw training data
│ ├── 📄 train.jsonl # Input training data
│ ├── 📄 eval.jsonl # Input evaluation data
│ └── 📁 processed/ # Preprocessed (but not cached) data
├── 📁 output/ # Training outputs
│ ├── 📁 bge-m3/ # M3 training results
│ ├── 📁 bge-reranker/ # Reranker training results
│ └── 📁 checkpoints/ # Model checkpoints
└── 📁 logs/ # Training logs
├── 📁 tensorboard/ # TensorBoard logs
└── 📄 training.log # Text logs
```
## 🎯 Directory Purposes
### 1. **`./cache/models/`** - Model Downloads Cache
- **Purpose**: HuggingFace/ModelScope model download cache
- **Contents**: Tokenizers, config files, temporary downloads
- **Usage**: Automatic caching of remote model downloads
- **Config**: `model_paths.cache_dir = "./cache/models"`
### 2. **`./cache/data/`** - Processed Dataset Cache
- **Purpose**: Preprocessed and cleaned dataset files
- **Contents**: Validated JSONL/JSON files
- **Usage**: Store cleaned and validated training data
- **Config**: `data.cache_dir = "./cache/data"`
### 3. **`./models/`** - Local Model Storage
- **Purpose**: Complete local model files (when available)
- **Contents**: Full model directories with all files
- **Usage**: Direct loading without downloads
- **Config**: `model_paths.bge_m3 = "./models/bge-m3"`
### 4. **`./data/`** - Raw Input Data
- **Purpose**: Original training/evaluation datasets
- **Contents**: `.jsonl` files in various formats
- **Usage**: Input to preprocessing and optimization
- **Config**: User-specified paths in training scripts
### 5. **`./output/`** - Training Results
- **Purpose**: Fine-tuned models and training artifacts
- **Contents**: Saved models, checkpoints, metrics
- **Usage**: Results of training scripts
- **Config**: `--output_dir` in training scripts
## ⚙️ Configuration Mapping
### Config File Settings
```toml
[model_paths]
cache_dir = "./cache/models" # Model download cache
bge_m3 = "./models/bge-m3" # Local BGE-M3 path
bge_reranker = "./models/bge-reranker-base" # Local reranker path
[data]
cache_dir = "./cache/data" # Dataset cache
optimization_output_dir = "./cache/data" # Optimization default
```
### Script Defaults
```bash
# Data optimization (processed datasets)
python -m data.optimization bge-m3 data/train.jsonl --output_dir ./cache/data
# Model training (output models)
python scripts/train_m3.py --output_dir ./output/bge-m3 --cache_dir ./cache/models
# Model evaluation (results)
python scripts/evaluate.py --output_dir ./evaluation_results
```
## 🔄 Data Flow
```
Raw Data → Preprocessing → Optimization → Training → Output
data/ → - → cache/data/ → models → output/
↓ ↓ ↓
Input Pre-tokenized Model Fine-tuned
JSONL → .pkl files → Loading → Models
```
## 🛠️ Usage Examples
### Model Cache (Download Cache)
```python
# Automatic model caching
tokenizer = AutoTokenizer.from_pretrained(
"BAAI/bge-m3",
cache_dir="./cache/models" # Downloads cached here
)
```
### Data Cache (Optimization Cache)
```bash
# Create optimized datasets
python -m data.optimization bge-m3 data/train.jsonl \
--output_dir ./cache/data # .pkl files created here
# Use cached datasets in training
python scripts/train_m3.py \
--train_data ./cache/data/train_m3_cached.pkl
```
### Local Models (No Downloads)
```python
# Use local model (no downloads)
model = BGEM3Model.from_pretrained("./models/bge-m3")
```
## 📋 Best Practices
1. **Separate Concerns**: Keep model cache, data cache, and outputs separate
2. **Clear Naming**: Use descriptive suffixes (`.pkl` for cache, `_cached` for optimized)
3. **Consistent Paths**: Use config.toml for centralized path management
4. **Cache Management**:
- Model cache: Can be shared across projects
- Data cache: Project-specific optimized datasets
- Output: Training results and checkpoints
## 🧹 Cleanup Commands
```bash
# Clean data cache (re-optimization needed)
rm -rf ./cache/data/
# Clean model cache (re-download needed)
rm -rf ./cache/models/
# Clean training outputs
rm -rf ./output/
# Clean all caches
rm -rf ./cache/
```
This structure ensures clear separation of concerns and efficient caching for both model downloads and processed datasets.