AF3 ProteinDataset类的初始化方法解读
AlphaFold3 protein_dataset模块 ProteinDataset
类主要负责从结构化的蛋白质数据中构建一个可供模型训练/推理使用的数据集,ProteinDataset
类的 __init__
方法用于初始化一个蛋白质数据集对象。
源代码:
def __init__(
self,
dataset_folder,
features_folder="./data/tmp/",
clustering_dict_path=None,
max_length=None,
rewrite=False,
use_fraction=1,
load_to_ram=False,
debug=False,
interpolate="none",
node_features_type="zeros",
debug_file_path=None,
entry_type="biounit", # biounit, chain, pair
classes_to_exclude=None, # heteromers, homomers, single_chains
shuffle_clusters=True,
min_cdr_length=None,
feature_functions=None,
classes_dict_path=None,
cut_edges=False,
mask_residues=True,
lower_limit=15,
upper_limit=100,
mask_frac=None,
mask_whole_chains=False,
mask_sequential=False,
force_binding_sites_frac=0.15,
mask_all_cdrs=False,
load_ligands=False,
pyg_graph=False,
patch_around_mask=False,
initial_patch_size=128,
antigen_patch_size=128,
require_antigen=False,
require_light_chain=False,
require_no_light_chain=False,
require_heavy_chain=False,
):
"""Initialize the dataset.
Parameters
----------
dataset_folder : str
the path to the folder with proteinflow format x files (assumes that files are named {biounit_id}.pickle)
features_folder : str, default "./data/tmp/"
the path to the folder where the ProteinMPNN features will be saved
clustering_dict_path : str, optional
path to the pickled clustering dictionary (keys are cluster ids, values are (biounit id, chain id) tuples)
max_length : int, optional
entries with total length of chains larger than `max_length` will be disregarded
rewrite : bool, default False
if `False`, existing feature files are not overwritten
use_fraction : float, default 1
the fraction of the clusters to use (first N in alphabetic order)
load_to_ram : bool, default False
if `True`, the data will be stored in RAM (use with caution! if RAM isn'timesteps big enough the machine might crash)
debug : bool, default False
only process 1000 files
interpolate : {"none", "only_middle", "all"}
`"none"` for no interpolation, `"only_middle"` for only linear interpolation in the middle, `"all"` for linear interpolation + ends generation
node_features_type : {"zeros", "dihedral", "sidechain_orientation", "chemical", "secondary_structure" or combinations with "+"}
the type of node features, e.g. `"dihedral"` or `"sidechain_orientation+chemical"`
debug_file_path : str, optional
if not `None`, open this single file instead of loading the dataset
entry_type : {"biounit", "chain", "pair"}
the type of entries to generate (`"biounit"` for biounit-level complexes, `"chain"` for chain-level, `"pair"`
for chain-chain pairs (all pairs that are seen in the same biounit and have intersecting coordinate clouds))
classes_to_exclude : list of str, optional
a list of classes to exclude from the dataset (select from `"single_chain"`, `"heteromer"`, `"homomer"`)
shuffle_clusters : bool, default True
if `True`, a new representative is randomly selected for each cluster at each epoch (if `clustering_dict_path` is given)
min_cdr_length : int, optional
for SAbDab datasets, biounits with CDRs shorter than `min_cdr_length` will be excluded
feature_functions : dict, optional
a dictionary of functions to compute additional features (keys are the names of the features, values are the functions)
classes_dict_path : str, optional
a path to a pickled dictionary with biounit classes (single chain / heteromer / homomer)
cut_edges : bool, default False
if `True`, missing values at the edges of the sequence will be cut off
mask_residues : bool, default True
if `True`, the masked residues will be added to the output
lower_limit : int, default 15
the lower limit of the number of residues to mask
upper_limit : int, default 100
the upper limit of the number of residues to mask
mask_frac : float, optional
if given, the number of residues to mask is `mask_frac` times the length of the chain
mask_whole_chains : bool, default False
if `True`, the whole chain is masked
mask_sequential : bool, default False
if `True`, the masked residues will be neighbors in the sequence; otherwise a geometric
mask is applied based on the coordinates
force_binding_sites_frac : float, default 0.15
if `force_binding_sites_frac` > 0 and `mask_whole_chains` is `False`, in the fraction of cases where a chain
from a polymer is sampled, the center