ChemBFN: BFN for Chemistry Tasks

Madmol is a CLI tool to simplify the workflow of training a generative model on a given dataset and/or sampling molecules.

1. Get Version

madmol --version

2. Get Help

madmol --help

3. Check The Settings Before Running A Job

madmol [YOUR_CONFIG.toml] [YOUR_MODEL_CONFIG.toml] --dryrun

This command will give you hints, if any, of misconfigurations that will probably terminate your job, e.g., setting conflicts, missing files, etc.

4. Run Your Job

madmol [YOUR_CONFIG.toml] [YOUR_MODEL_CONFIG.toml]

4.1. Defining task

The first positional argument [YOUR_CONFIG.toml] should be an absolute path pointing to a TOML file defining the runtime configurations. The format should follow the example below.

device = "auto"   # <-- any device supportrd by PyTorch, e.g., "cpu", "cuda:0"
run_name = "qm9"  # <-- job name

[tokeniser]
name = "SMILES"    # <-- "SMILES", "SAFE", "FASTA" or "SELFIES"
vocab = "default"  # <-- it should be a vocabulary file name in absolute path iff name = "SELFIES"

[train]  # <-- remove this table if training is unnecessary
epoch = 100
batch_size = 512
semi_autoregressive = false
enable_lora = false
dynamic_padding = false                  # <-- only set to true when pretraining a model
restart = ""                             # <-- a checkpoint file in absolute path if necessary
dataset = "home/user/project/dataset/qm9.csv"
molecule_tag = "smiles"                  # <-- the header tag under which the molecules are stored
objective_tag = ["homo", "lumo", "gap"]  # <-- the header tag(s) under which the objective values are stored; set to empty array [] if the model is unconditional
enforce_validity = true                  # <-- no effect if SMILES or SAFE is not used
logger_name = "wandb"                    # <-- "wandb", "csv" or "tensorboard"
logger_path = "home/user/project/logs"
checkpoint_save_path = "home/user/project/ckpt"
train_strategy = "auto"                  # <-- any strategy supported by Lightning, e.g., "ddp"
accumulate_grad_batches = 1
enable_progress_bar = false
plugin_script = ""                       # <-- define customised behaviours of dataset, datasetloader, etc in a python script

[inference]  # <-- Remove this table if inference is unnecessary
mini_batch_size = 50
sequence_length = "match dataset"           # <-- must be an integer in an inference-only job
sample_size = 1000                          # <-- the minimum number of samples you want
sample_step = 100
sample_method = "ODE:0.5"                   # <-- meaning ODE-solver with temperature of 0.5; another choice is "BFN"
semi_autoregressive = false
lora_scaling = 1.0                          # <-- adjusting the LoRA effectiveness if applied
guidance_objective = [-0.023, 0.09, 0.113]  # <-- for unconditional jobs set it to empty array []
guidance_objective_strength = 4.0           # <-- unnecessary if guidance_objective = []
guidance_scaffold = "c1ccccc1"              # <-- if no scaffold is used set it to empty string ""
sample_template = ""                        # <-- template for mol2mol task; leave it blank if scaffold is used
unwanted_token = []
exclude_invalid = true                      # <-- whether to only store valid samples
exclude_duplicate = true                    # <-- whether to only store unique samples
result_file = "home/user/project/result/result.csv"

Important notes:

lora_scaling and sample_template were added in version 2.1.0

plugin_script was added in version 2.2.0

4.2. Defining model architecture

The second positional argument [YOUR_MODEL_CONFIG.toml] should be an absolute path pointing to a TOML file defining the model hyperparameters. The following example shows the format.

[ChemBFN]
num_vocab = "match vocabulary size"  # <-- you can set to a specific integer
channel = 512
num_layer = 12
num_head = 8
dropout = 0.01
base_model = []                      # <-- specify a base model checkpoint file in absolute path when necessary; format ["basemodel.pt", "lora.pt" (optional)]

[MLP]  # <-- Reomve this table if MLP is not needed.
size = [3, 256, 512]                 # <-- dimension of the vector goes as 3 --> 256 --> 512
class_input = false                  # <-- set to true if the inputs are class indices
base_model = ""                      # <-- specify a base model checkpoint in absolute path when necessary

4.3. Defining customised behaviours

Since version 2.2.0, it is possible to pass a Python3 script to the program via plugin_script={PATH\TO\YOUR\SCRIPT.py} in [YOUR_CONFIG.toml] to control the behaviours of dataset loading and sequence padding. Recently, the accepted customised values are collate_fn, num_workers, shuffle, max_sequence_length, and CustomData.

For instance, to disable shuffling the batches

shuffle = False

to change the number of workers (default value is 4) in the ~torch.utils.data.DataLoader instance

num_worker = 0

to define a padding length (default is the maximum length in the dataset)

max_sequence_length = 125

to use a customised collating function

import random
from bayesianflow_for_chem.data import collate

def collate_fn(x):
    random.shuffle(x)
    return collate(x)

or to define your own dataset object (e.g., chunked dataset class)

import torch
import pandas as pd
from bayesianflow_for_chem.data import CSVData

class CustomData(CSVData):
    def __init__(self, file, chunksize: int = 100000):
        super().__init__(file)
        ...  # your code

    def __len__(self):
        return ...  # your code
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        ...  # your code

In order to tell the program which customised values should be used, it is necessary to encapsulate them in __all__ variable, e.g., __all__ = ["collate_fn", "num_workers", "shuffle", "max_sequence_length", "CustomData"].

Note that if you define a dataset class not inherited from CSVData, make sure you include the map(...) method. If map(...) method is unnecessary for your CustomData, set it to lambda x: None. A detailed example can be found here.