AF3 from_pdb_string和from_mmcif_string函数解读
AlphaFold3的from_pdb_string和from_mmcif_string函数分别用来解析蛋白质PDB和mmCIF 格式结构数据并转换为 Protein
数据类。它通过 Biopython 提供的 PDBParser
和 MMCIFParser
解析 PDB/mmCIF 文件,再通过调用_from_bio_structure函数从 Biopython 解析出的 Structure
提取 原子坐标、残基类型、B 因子等信息,最终返回一个 Protein
对象。
源代码:
def _from_bio_structure(
structure: Structure, chain_id: Optional[str] = None
) -> Protein:
"""Takes a Biopython structure and creates a `Protein` instance.
WARNING: All non-standard residue types will be converted into UNK. All
non-standard atoms will be ignored.
Args:
structure: Structure from the Biopython library.
chain_id: If chain_id is specified (e.g. A), then only that chain is parsed.
Otherwise all chains are parsed.
Returns:
A new `Protein` created from the structure contents.
Raises:
ValueError: If the number of models included in the structure is not 1.
ValueError: If insertion code is detected at a residue.
"""
models = list(structure.get_models())
if len(models) != 1:
raise ValueError(
'Only single model PDBs/mmCIFs are supported. Found'
f' {len(models)} models.'
)
model = models[0]
atom_positions = []
aatype = []
atom_mask = []
residue_index = []
chain_ids = []
b_factors = []
for chain in model:
if chain_id is not None and chain.id != chain_id:
continue
for res in chain:
if res.id[2] != ' ':
raise ValueError(
f'PDB/mmCIF contains an insertion code at chain {chain.id} and'
f' residue index {res.id[1]}. These are not supported.'
)
res_shortname = residue_constants.restype_3to1.get(res.resname, 'X')
restype_idx = residue_constants.restype_order.get(
res_shortname, residue_constants.restype_num)
pos = np.zeros((residue_constants.atom_type_num, 3))
mask = np.zeros((residue_constants.atom_type_num,))
res_b_factors = np.zeros((residue_constants.atom_type_num,))
for atom in res:
if atom.name not in residue_constants.atom_types: