NaviDiv Submission Files - Data Repository for Paper

This repository contains all data, configurations, and scripts used to generate the results presented in the paper. It serves as a reference for reproducing the experiments and figures.

Table of Contents

  1. Repository Overview
  2. Prerequisites
  3. Initial Setup
  4. Reproducing Paper Results
  5. Directory Structure
  6. Generating Figures
  7. Configuration Details
  8. Troubleshooting

Repository Overview

This repository contains:

  • Configuration files for all experiments (conf_folder/)
  • Complete experimental data used in the paper (Singlet_fission_run/)
  • Scripts for running REINVENT with NaviDiv (run_reinvent*.sh)
  • Figure generation scripts (Figures_script/)
  • Trained models and priors (SF_model/, reinvent_prior/)
  • Path management utilities for easy setup on different systems

Prerequisites

Required Software

  1. Anaconda or Miniconda - for environment management
  2. Python 3.8+ - with conda
  3. CUDA-capable GPU (optional but recommended)
  4. Navi_diversity package - the main NaviDiv codebase

Installing Navi_diversity

Clone the Navi_diversity repository:

git clone https://github.com/your-org/Navi_diversity.git
cd Navi_diversity
# Follow installation instructions in the Navi_diversity repository
conda env create -f environment.yml
conda activate NaviDiv_test

Note: Replace your-org with the actual organization/user hosting the Navi_diversity repository.

Initial Setup

Step 1: Update Paths for Your System

All configuration files contain absolute paths that need to be updated to match your local environment. We provide an automated tool for this:

cd /path/to/NaviDiv_submission_files_2

# Interactive setup (recommended for first-time users)
bash update_all_paths.sh

What it does:

  • Prompts you for the path to your Navi_diversity installation
  • Automatically updates all configuration files (.yaml, .toml, .sh, .py)
  • Creates backups of modified files in .backups/ directory
  • Generates an .env.template file with your configuration

Manual setup (alternative):

python3 update_paths.py --navidiv-path /path/to/Navi_diversity

Step 2: Verify Your Configuration

After running the path update script, check that paths are correct:

  1. Open .env.template and verify:

    • NAVIDIV_PATH points to your Navi_diversity installation
    • WORKSPACE_ROOT points to this repository
  2. Check a sample config file:

    cat conf_folder/test.yaml
    

    Ensure paths are correct for your system.

Step 3: Set Up Environment Variables

# Source the environment template
source .env.template

# Or create a permanent .env file
cp .env.template .env
# Edit .env with your preferred paths
source .env

Reproducing Paper Results

The paper presents results from multiple experimental runs with different diversity scoring configurations. All data is already included in the Singlet_fission_run/ directory.

Paper Experiments Location

The experiments used in the paper are located in:

Singlet_fission_run/
├── experiment_1206/      # First complete run (date: Dec 6)
├── experiment_1306/      # Second complete run (date: Dec 13)
├── experiment_1406_1/    # Third run, replicate 1 (date: Dec 14)
├── experiment_1406_2/    # Third run, replicate 2
├── experiment_1406_3/    # Third run, replicate 3
├── experiment_1406_4/    # Third run, replicate 4
└── experiment_1406_5/    # Third run, replicate 5

Each experiment folder contains subdirectories for different diversity scoring configurations:

  • All_constraints/ - Combined high constraints (all diversity metrics enabled)
  • All_weak_constraints/ - Combined low constraints (relaxed thresholds)
  • fragement_only/ - Fragment-based diversity only
  • ngram_only/ - N-gram-based diversity only
  • scaffold_only/ - Scaffold-based diversity only
  • similarity_only/ - Similarity-based diversity only

Running New Experiments

To reproduce the experiments or run new ones:

Quick Test Run (100 steps)

# Make sure environment is set up
source .env

# Run a test with one diversity scorer
./run_reinvent_updated.sh

This will run REINVENT for 100 steps (quick test) with the first diversity scorer configuration.

Full Paper Reproduction

To run the full experiments as in the paper, modify run_reinvent_updated.sh:

  1. Set maximum steps to 1000 (paper value):

    # In run_reinvent_updated.sh, change:
    reinvent_common.max_steps=100
    # to:
    reinvent_common.max_steps=1000
    
  2. Remove the break statement to run all diversity scorers:

    # In run_reinvent_updated.sh, remove or comment out:
    # break
    
  3. Run multiple replicates:

    # Change RUN_INDEX for each replicate
    for i in {1..5}; do
        # Edit RUN_INDEX in the script or pass as parameter
        RUN_INDEX=$i ./run_reinvent_updated.sh
    done
    

Key Parameters in Run Script

  • ENV_NAME: Conda environment name (default: NaviDiv_test)
  • CONFIG_NAME: Configuration file to use (default: test)
  • WD: Working directory for output
  • RUN_INDEX: Run number for organizing replicates
  • reinvent_common.max_steps: Number of RL steps (100 for test, 1000 for paper)

Configuration Files

All diversity scorer configurations are in:

conf_folder/diversity_scorer/
├── 1_default.yaml              # Baseline configuration
├── All_constraints.yaml        # All metrics with high thresholds
├── All_weak_constraints.yaml   # All metrics with low thresholds
├── fragement_only.yaml         # Fragment diversity only
├── ngram_only.yaml             # N-gram diversity only
├── scaffold_only.yaml          # Scaffold diversity only
└── similarity_only.yaml        # Tanimoto similarity only

To use a specific configuration:

# Edit run_reinvent_updated.sh
# Change the diversity_scorer parameter:
diversity_scorer="All_constraints"  # or any other config name

Directory Structure

NaviDiv_submission_files_2/
├── README.md                          # This file
├── update_paths.py                    # Path update utility
├── update_all_paths.sh                # Interactive path setup
├── run_reinvent_updated.sh            # Main run script (updated paths)
├── run_reinvent.sh                    # Original run script (legacy)
│
├── conf_folder/                       # Configuration files
│   ├── test.yaml                      # Main REINVENT config
│   ├── default_config.toml            # Transfer learning config
│   ├── diversity_scorer/              # Diversity scoring configs
│   │   ├── All_constraints.yaml
│   │   ├── All_weak_constraints.yaml
│   │   ├── fragement_only.yaml
│   │   ├── ngram_only.yaml
│   │   ├── scaffold_only.yaml
│   │   └── similarity_only.yaml
│   └── reinvent_common/               # Common REINVENT settings
│
├── Singlet_fission_run/               # ** PAPER DATA - All experimental results **
│   ├── experiment_1206/               # First complete run
│   │   ├── All_constraints/           # Results for each diversity config
│   │   ├── All_weak_constraints/
│   │   ├── fragement_only/
│   │   ├── ngram_only/
│   │   ├── scaffold_only/
│   │   └── similarity_only/
│   ├── experiment_1306/               # Second complete run
│   └── experiment_1406_[1-5]/         # Five replicates of third run
│
├── Figures_script/                    # ** Figure generation scripts **
│   ├── README.md                      # Detailed plotting guide
│   ├── plot_steps_multi_experiment.py # Generate plots from multiple runs
│   ├── plot_steps_single_experiment.py # Generate plots from single run
│   ├── customize_figure_multi_experiment.py # Customize and filter plots
│   ├── raw/                           # Generated raw plots
│   │   ├── steps_plot_multi_experiment.pkl
│   │   └── steps_plot_multi_experiment.png
│   └── modified/                      # Customized final plots
│       └── steps_plot_multi_experiment_second.png
│
├── SF_model/                          # Singlet fission model
│   ├── formed.prior                   # Pre-trained prior
│   ├── agents/                        # Agent checkpoints
│   └── formed_chemprop/               # ChemProp model files
│
├── reinvent_prior/                    # REINVENT prior model
│   └── formed.prior
│
└── outputs/                           # Test outputs (not used in paper)

Generating Figures

All figures in the paper were generated using the scripts in Figures_script/. See the detailed guide in Figures_script/README.md.

Quick Start - Regenerate Paper Figures

cd Figures_script

# Step 1: Generate raw plots from all paper experiments
python plot_steps_multi_experiment.py

This will:

  • Read data from Singlet_fission_run/experiment_*/
  • Generate plots for all diversity metrics
  • Save outputs to raw/steps_plot_multi_experiment.png and .pkl
# Step 2: Customize plots to show only specific panels
python customize_figure_multi_experiment.py

This will:

  • Load the saved pickle file
  • Select specific axes (subplots) to display
  • Apply custom styling (labels, colors, fonts)
  • Save customized figure to modified/steps_plot_multi_experiment_second.png

Customizing Which Plots to Show

To select different panels for your figure, edit customize_figure_multi_experiment.py:

# Change keep_indices to show different subplots
# Indices correspond to the order in the raw figure
fig_custom = customize_axes(
    axes,
    keep_indices=[12, 13, 4, 5],  # Show only these subplot indices
    # ... other parameters
)

How to find subplot indices:

  1. Open raw/steps_plot_multi_experiment.png
  2. Count subplots from top-left to bottom-right (starting at 0)
  3. Note the indices of the plots you want to keep
  4. Update keep_indices in the customize script

Available Metrics in Plots

The following diversity metrics are plotted:

  • Score - Overall reward score from REINVENT
  • Prior - Negative log-likelihood from prior model
  • Appeared more than 10 times - Structures appearing in >10% of molecules
  • mean_distance - Average Tanimoto distance (diversity)
  • mean_similarity - Average Tanimoto similarity
  • Percentage of Unique Fragments - Fragment diversity ratio
  • Unique Circles (Morgan Fingerprint) - Circular fingerprint diversity
  • 10-gram statistics - N-gram based diversity metrics
  • Scaffold statistics - Scaffold diversity metrics

For detailed plotting instructions, see Figures_script/README.md.

Configuration Details

Main REINVENT Configuration (conf_folder/test.yaml)

Key parameters:

  • run_mode: Type of REINVENT run (e.g., "transfer_learning", "reinforcement_learning")
  • max_steps: Number of RL optimization steps (100 for testing, 1000 for paper)
  • prior_path: Path to pre-trained prior model
  • agent_path: Path to agent checkpoint
  • diversity_scorer: Which diversity configuration to use

Diversity Scorer Configurations

Each YAML file in conf_folder/diversity_scorer/ defines:

  • Enabled metrics: Which diversity metrics to calculate
  • Thresholds: Penalty thresholds for each metric
  • Weights: Relative importance of each metric
  • Scoring mode: How penalties are combined

Example structure:

diversity_metrics:
  fragment_diversity:
    enabled: true
    threshold: 0.7
    weight: 1.0
  ngram_diversity:
    enabled: true
    threshold: 0.8
    weight: 1.0

Model Files

  • Prior Model (SF_model/formed.prior, reinvent_prior/formed.prior):

    • Pre-trained generative model for sampling molecules
    • Used as baseline for RL optimization
  • Agent Checkpoints (SF_model/agents/agent_*.chkpt):

    • Saved agent states during training
    • Can be used to resume training or analyze learning progression
  • Property Predictor (SF_model/formed_chemprop/):

    • ChemProp model for predicting singlet fission properties
    • Used as reward function during RL

Troubleshooting

Common Issues

1. Path Errors

Problem: FileNotFoundError or paths not found

Solution:

# Re-run path update script
bash update_all_paths.sh

# Verify paths in config files
grep -r "/media/mohammed" conf_folder/
# Should return empty if paths are updated correctly

2. Conda Environment Not Found

Problem: Environment 'NaviDiv_test' not found

Solution:

# Create environment from Navi_diversity repository
cd /path/to/Navi_diversity
conda env create -f environment.yml

# Or check existing environments
conda env list

3. CUDA/GPU Errors

Problem: CUDA out of memory or GPU not available

Solution:

# Edit conf_folder/test.yaml to use CPU
device: "cpu"  # instead of "cuda:0"

# Or reduce batch size
batch_size: 50  # instead of 100

4. Import Errors

Problem: ModuleNotFoundError: No module named 'navidiv'

Solution:

# Make sure PYTHONPATH is set
export PYTHONPATH="${PYTHONPATH}:${NAVIDIV_PATH}/src/navidiv/reinvent"

# Or add to .bashrc for permanent fix
echo 'export PYTHONPATH="${PYTHONPATH}:${NAVIDIV_PATH}/src/navidiv/reinvent"' >> ~/.bashrc

5. Permission Errors

Problem: Cannot write to output directory

Solution:

# Ensure output directories exist and are writable
mkdir -p test_case_3
chmod 755 test_case_3

Getting Help

If you encounter issues:

  1. Check the Navi_diversity repository documentation
  2. Verify all paths are correctly updated
  3. Ensure conda environment is activated
  4. Check log files in the output directory

Citation

If you use this data or code in your research, please cite:

@article{your_paper_2025,
  title={Your Paper Title},
  author={Your Name and Others},
  journal={Journal Name},
  year={2025}
}

License

[Add your license information here]

Contact

For questions or issues:

  • Open an issue in the repository
  • Contact: [your-email@example.com]

Acknowledgments

This work was supported by [funding sources]. We thank the developers of REINVENT and the NaviDiv framework.

Last Updated: October 2025