bge_finetune/docs/directory_structure.md

# BGE Fine-tuning Directory Structure

This document clarifies the directory structure and cache usage in the BGE fine-tuning project.

## 📂 Project Directory Structure

```
bge_finetune/
├── 📁 cache/                    # Cache directories
│   ├── 📁 data/               # Preprocessed dataset cache
│   │   ├── 📄 *.jsonl          # Preprocessed JSONL files
│   │   └── 📄 *.json           # Preprocessed JSON files
│   └── 📁 models/              # Downloaded model cache (HuggingFace/ModelScope)
│       ├── 📁 tokenizers/      # Cached tokenizer files
│       ├── 📁 config/          # Cached model config files
│       └── 📁 downloads/       # Temporary download files
│
├── 📁 models/                   # Local model storage
│   ├── 📁 bge-m3/              # Local BGE-M3 model files
│   └── 📁 bge-reranker-base/   # Local BGE-reranker model files
│
├── 📁 data/                     # Raw training data
│   ├── 📄 train.jsonl          # Input training data
│   ├── 📄 eval.jsonl           # Input evaluation data
│   └── 📁 processed/           # Preprocessed (but not cached) data
│
├── 📁 output/                   # Training outputs
│   ├── 📁 bge-m3/              # M3 training results
│   ├── 📁 bge-reranker/        # Reranker training results
│   └── 📁 checkpoints/         # Model checkpoints
│
└── 📁 logs/                     # Training logs
    ├── 📁 tensorboard/         # TensorBoard logs
    └── 📄 training.log         # Text logs
```

## 🎯 Directory Purposes

### 1. **`./cache/models/`** - Model Downloads Cache
- **Purpose**: HuggingFace/ModelScope model download cache
- **Contents**: Tokenizers, config files, temporary downloads
- **Usage**: Automatic caching of remote model downloads
- **Config**: `model_paths.cache_dir = "./cache/models"`

### 2. **`./cache/data/`** - Processed Dataset Cache
- **Purpose**: Preprocessed and cleaned dataset files
- **Contents**: Validated JSONL/JSON files
- **Usage**: Store cleaned and validated training data
- **Config**: `data.cache_dir = "./cache/data"`

### 3. **`./models/`** - Local Model Storage
- **Purpose**: Complete local model files (when available)
- **Contents**: Full model directories with all files
- **Usage**: Direct loading without downloads
- **Config**: `model_paths.bge_m3 = "./models/bge-m3"`

### 4. **`./data/`** - Raw Input Data
- **Purpose**: Original training/evaluation datasets
- **Contents**: `.jsonl` files in various formats
- **Usage**: Input to preprocessing and optimization
- **Config**: User-specified paths in training scripts

### 5. **`./output/`** - Training Results
- **Purpose**: Fine-tuned models and training artifacts
- **Contents**: Saved models, checkpoints, metrics
- **Usage**: Results of training scripts
- **Config**: `--output_dir` in training scripts

## ⚙️ Configuration Mapping

### Config File Settings
```toml
[model_paths]
cache_dir = "./cache/models"      # Model download cache
bge_m3 = "./models/bge-m3"        # Local BGE-M3 path
bge_reranker = "./models/bge-reranker-base"  # Local reranker path

[data]
cache_dir = "./cache/data"        # Dataset cache
optimization_output_dir = "./cache/data"  # Optimization default
```

### Script Defaults
```bash
# Data optimization (processed datasets)
python -m data.optimization bge-m3 data/train.jsonl --output_dir ./cache/data

# Model training (output models)
python scripts/train_m3.py --output_dir ./output/bge-m3 --cache_dir ./cache/models

# Model evaluation (results)
python scripts/evaluate.py --output_dir ./evaluation_results
```

## 🔄 Data Flow

```
Raw Data → Preprocessing → Optimization → Training → Output
data/     →      -       → cache/data/ → models → output/
  ↓                           ↓           ↓
Input       Pre-tokenized    Model     Fine-tuned
JSONL   →    .pkl files  →  Loading  →   Models
```

## 🛠️ Usage Examples

### Model Cache (Download Cache)
```python
# Automatic model caching
tokenizer = AutoTokenizer.from_pretrained(
    "BAAI/bge-m3",
    cache_dir="./cache/models"  # Downloads cached here
)
```

### Data Cache (Optimization Cache)
```bash
# Create optimized datasets
python -m data.optimization bge-m3 data/train.jsonl \
  --output_dir ./cache/data  # .pkl files created here

# Use cached datasets in training
python scripts/train_m3.py \
  --train_data ./cache/data/train_m3_cached.pkl
```

### Local Models (No Downloads)
```python
# Use local model (no downloads)
model = BGEM3Model.from_pretrained("./models/bge-m3")
```

## 📋 Best Practices

1. **Separate Concerns**: Keep model cache, data cache, and outputs separate
2. **Clear Naming**: Use descriptive suffixes (`.pkl` for cache, `_cached` for optimized)
3. **Consistent Paths**: Use config.toml for centralized path management
4. **Cache Management**:
   - Model cache: Can be shared across projects
   - Data cache: Project-specific optimized datasets
   - Output: Training results and checkpoints

## 🧹 Cleanup Commands

```bash
# Clean data cache (re-optimization needed)
rm -rf ./cache/data/

# Clean model cache (re-download needed)
rm -rf ./cache/models/

# Clean training outputs
rm -rf ./output/

# Clean all caches
rm -rf ./cache/
```

This structure ensures clear separation of concerns and efficient caching for both model downloads and processed datasets.