当前位置：首页 > news >正文

AF3 ProteinDataset类的初始化方法解读

news 2025/10/8 17:10:03

AlphaFold3 protein_dataset模块 ProteinDataset 类主要负责从结构化的蛋白质数据中构建一个可供模型训练/推理使用的数据集，ProteinDataset 类的 __init__ 方法用于初始化一个蛋白质数据集对象。

源代码：

  def __init__(
        self,
        dataset_folder,
        features_folder="./data/tmp/",
        clustering_dict_path=None,
        max_length=None,
        rewrite=False,
        use_fraction=1,
        load_to_ram=False,
        debug=False,
        interpolate="none",
        node_features_type="zeros",
        debug_file_path=None,
        entry_type="biounit",  # biounit, chain, pair
        classes_to_exclude=None,  # heteromers, homomers, single_chains
        shuffle_clusters=True,
        min_cdr_length=None,
        feature_functions=None,
        classes_dict_path=None,
        cut_edges=False,
        mask_residues=True,
        lower_limit=15,
        upper_limit=100,
        mask_frac=None,
        mask_whole_chains=False,
        mask_sequential=False,
        force_binding_sites_frac=0.15,
        mask_all_cdrs=False,
        load_ligands=False,
        pyg_graph=False,
        patch_around_mask=False,
        initial_patch_size=128,
        antigen_patch_size=128,
        require_antigen=False,
        require_light_chain=False,
        require_no_light_chain=False,
        require_heavy_chain=False,
    ):
        """Initialize the dataset.

        Parameters
        ----------
        dataset_folder : str
            the path to the folder with proteinflow format x files (assumes that files are named {biounit_id}.pickle)
        features_folder : str, default "./data/tmp/"
            the path to the folder where the ProteinMPNN features will be saved
        clustering_dict_path : str, optional
            path to the pickled clustering dictionary (keys are cluster ids, values are (biounit id, chain id) tuples)
        max_length : int, optional
            entries with total length of chains larger than `max_length` will be disregarded
        rewrite : bool, default False
            if `False`, existing feature files are not overwritten
        use_fraction : float, default 1
            the fraction of the clusters to use (first N in alphabetic order)
        load_to_ram : bool, default False
            if `True`, the data will be stored in RAM (use with caution! if RAM isn'timesteps big enough the machine might crash)
        debug : bool, default False
            only process 1000 files
        interpolate : {"none", "only_middle", "all"}
            `"none"` for no interpolation, `"only_middle"` for only linear interpolation in the middle, `"all"` for linear interpolation + ends generation
        node_features_type : {"zeros", "dihedral", "sidechain_orientation", "chemical", "secondary_structure" or combinations with "+"}
            the type of node features, e.g. `"dihedral"` or `"sidechain_orientation+chemical"`
        debug_file_path : str, optional
            if not `None`, open this single file instead of loading the dataset
        entry_type : {"biounit", "chain", "pair"}
            the type of entries to generate (`"biounit"` for biounit-level complexes, `"chain"` for chain-level, `"pair"`
            for chain-chain pairs (all pairs that are seen in the same biounit and have intersecting coordinate clouds))
        classes_to_exclude : list of str, optional
            a list of classes to exclude from the dataset (select from `"single_chain"`, `"heteromer"`, `"homomer"`)
        shuffle_clusters : bool, default True
            if `True`, a new representative is randomly selected for each cluster at each epoch (if `clustering_dict_path` is given)
        min_cdr_length : int, optional
            for SAbDab datasets, biounits with CDRs shorter than `min_cdr_length` will be excluded
        feature_functions : dict, optional
            a dictionary of functions to compute additional features (keys are the names of the features, values are the functions)
        classes_dict_path : str, optional
            a path to a pickled dictionary with biounit classes (single chain / heteromer / homomer)
        cut_edges : bool, default False
            if `True`, missing values at the edges of the sequence will be cut off
        mask_residues : bool, default True
            if `True`, the masked residues will be added to the output
        lower_limit : int, default 15
            the lower limit of the number of residues to mask
        upper_limit : int, default 100
            the upper limit of the number of residues to mask
        mask_frac : float, optional
            if given, the number of residues to mask is `mask_frac` times the length of the chain
        mask_whole_chains : bool, default False
            if `True`, the whole chain is masked
        mask_sequential : bool, default False
            if `True`, the masked residues will be neighbors in the sequence; otherwise a geometric
            mask is applied based on the coordinates
        force_binding_sites_frac : float, default 0.15
            if `force_binding_sites_frac` > 0 and `mask_whole_chains` is `False`, in the fraction of cases where a chain
            from a polymer is sampled, the center

查看全文

http://www.dtcms.com/a/126961.html