当前位置: 首页 > news >正文

AF3 ProteinDataset类的初始化方法解读

AlphaFold3 protein_dataset模块 ProteinDataset 类主要负责从结构化的蛋白质数据中构建一个可供模型训练/推理使用的数据集,ProteinDataset 类的 __init__ 方法用于初始化一个蛋白质数据集对象。

源代码:

  def __init__(
        self,
        dataset_folder,
        features_folder="./data/tmp/",
        clustering_dict_path=None,
        max_length=None,
        rewrite=False,
        use_fraction=1,
        load_to_ram=False,
        debug=False,
        interpolate="none",
        node_features_type="zeros",
        debug_file_path=None,
        entry_type="biounit",  # biounit, chain, pair
        classes_to_exclude=None,  # heteromers, homomers, single_chains
        shuffle_clusters=True,
        min_cdr_length=None,
        feature_functions=None,
        classes_dict_path=None,
        cut_edges=False,
        mask_residues=True,
        lower_limit=15,
        upper_limit=100,
        mask_frac=None,
        mask_whole_chains=False,
        mask_sequential=False,
        force_binding_sites_frac=0.15,
        mask_all_cdrs=False,
        load_ligands=False,
        pyg_graph=False,
        patch_around_mask=False,
        initial_patch_size=128,
        antigen_patch_size=128,
        require_antigen=False,
        require_light_chain=False,
        require_no_light_chain=False,
        require_heavy_chain=False,
    ):
        """Initialize the dataset.

        Parameters
        ----------
        dataset_folder : str
            the path to the folder with proteinflow format x files (assumes that files are named {biounit_id}.pickle)
        features_folder : str, default "./data/tmp/"
            the path to the folder where the ProteinMPNN features will be saved
        clustering_dict_path : str, optional
            path to the pickled clustering dictionary (keys are cluster ids, values are (biounit id, chain id) tuples)
        max_length : int, optional
            entries with total length of chains larger than `max_length` will be disregarded
        rewrite : bool, default False
            if `False`, existing feature files are not overwritten
        use_fraction : float, default 1
            the fraction of the clusters to use (first N in alphabetic order)
        load_to_ram : bool, default False
            if `True`, the data will be stored in RAM (use with caution! if RAM isn'timesteps big enough the machine might crash)
        debug : bool, default False
            only process 1000 files
        interpolate : {"none", "only_middle", "all"}
            `"none"` for no interpolation, `"only_middle"` for only linear interpolation in the middle, `"all"` for linear interpolation + ends generation
        node_features_type : {"zeros", "dihedral", "sidechain_orientation", "chemical", "secondary_structure" or combinations with "+"}
            the type of node features, e.g. `"dihedral"` or `"sidechain_orientation+chemical"`
        debug_file_path : str, optional
            if not `None`, open this single file instead of loading the dataset
        entry_type : {"biounit", "chain", "pair"}
            the type of entries to generate (`"biounit"` for biounit-level complexes, `"chain"` for chain-level, `"pair"`
            for chain-chain pairs (all pairs that are seen in the same biounit and have intersecting coordinate clouds))
        classes_to_exclude : list of str, optional
            a list of classes to exclude from the dataset (select from `"single_chain"`, `"heteromer"`, `"homomer"`)
        shuffle_clusters : bool, default True
            if `True`, a new representative is randomly selected for each cluster at each epoch (if `clustering_dict_path` is given)
        min_cdr_length : int, optional
            for SAbDab datasets, biounits with CDRs shorter than `min_cdr_length` will be excluded
        feature_functions : dict, optional
            a dictionary of functions to compute additional features (keys are the names of the features, values are the functions)
        classes_dict_path : str, optional
            a path to a pickled dictionary with biounit classes (single chain / heteromer / homomer)
        cut_edges : bool, default False
            if `True`, missing values at the edges of the sequence will be cut off
        mask_residues : bool, default True
            if `True`, the masked residues will be added to the output
        lower_limit : int, default 15
            the lower limit of the number of residues to mask
        upper_limit : int, default 100
            the upper limit of the number of residues to mask
        mask_frac : float, optional
            if given, the number of residues to mask is `mask_frac` times the length of the chain
        mask_whole_chains : bool, default False
            if `True`, the whole chain is masked
        mask_sequential : bool, default False
            if `True`, the masked residues will be neighbors in the sequence; otherwise a geometric
            mask is applied based on the coordinates
        force_binding_sites_frac : float, default 0.15
            if `force_binding_sites_frac` > 0 and `mask_whole_chains` is `False`, in the fraction of cases where a chain
            from a polymer is sampled, the center

相关文章:

  • UWB定位算法详解(2025年更新版)
  • 电气隐患难察觉?安科瑞智慧用电方案实现风险实时可视化管理
  • 项目整合提问
  • LeetCode hot 100—最长回文子串
  • java HttpServletRequest 和 HttpServletResponse
  • 制作一款打飞机游戏教程1
  • 使用 Redis + Redisson 分布式锁来生成全局唯一、线程安全的带日期前缀的流水号的完整实现。
  • 【FPGA开发技巧】Modelsim仿真中,显示状态机的名称,而非编码数字
  • 水库大坝安全监测系统
  • 蓝桥杯--结束
  • 缓存不只是加速器:深入理解 Redis 的底层机制
  • Unity IL2CPP内存泄漏追踪方案(基于Memory Profiler)技术详解
  • Charles的安装和使用教程
  • 高支模自动化监测解决方案
  • MACOS15版本安装 python mysqlclient 以连接mysql 8.0
  • 小推桌面-一款全新的第三方电视桌面-全网通桌面
  • Python数据可视化-第8章-使用matplotlib绘制高级图表
  • 后端面试问题收集以及答案精简版
  • MySQL 事务的优先级
  • [ctfshow web入门] web41
  • 清河哪里做网站/宁德市是哪个省
  • 做外贸一定要独立网站吗/如何联系百度客服
  • wordpress 全文/如何点击优化神马关键词排名
  • 太原医疗网站建设/seo产品是什么意思
  • 网站效果图确认表/seo诊断
  • 唯品会的网站建设/学校网站模板