Usage ===== Usage Example ------------- Let's parse the PDB entry '4ZMJ', which is a trimeric ectodomain construct of the HIV-1 envelope glycoprotein: >>> from pidibble.pdbparse import PDBParser >>> p = PDBParser(source_db='rcsb', source_id='4zmj').parse() The ``PDBParser()`` call creates a new ``PDBParser`` object, and the member function ``parse()`` executes (optionally) downloading the PDB file of the code entered with the ``PDBcode`` keyword argument to ``PDBParser()``, followed by parsing into a member dictionary ``parsed``. (The file is downloaded from the RSCB only if it is not found in the current working directory.) Alternatively, a ``PDBParser()`` invocation can fetch from the AlphaFold model database by providing the accession code with the ``alphafold`` keyword: >>> p = PDBParser(source_db='alphafold', sourced_id='O46077').parse() (Note that this fetches a model for the odor receptor OR2a from *D. melanogaster*. For the rest of this example, we'll work with the HIV-1 Env trimer 4zmj above.) Finally, one can also retrieve entries from OPM: >>> p = PDBParser(source_db='opm', source_id='7f1r').parse() (Note that this fetches the sweet receptor with dummy atoms that denote locations of lipid headgroups.) >>> type(p.parsed) We can easily ask what record types were parsed: >>> list(sorted(list(p.parsed.keys()))) ['ANISOU', 'ATOM', 'AUTHOR', 'CISPEP', 'COMPND', 'CONECT', 'CRYST1', 'DBREF', 'END', 'EXPDTA', 'FORMUL', 'HEADER', 'HELIX', 'HET', 'HETATM', 'HETNAM', 'JRNL.AUTH', 'JRNL.DOI', 'JRNL.PMID', 'JRNL.REF', 'JRNL.REFN', 'JRNL.TITL', 'KEYWDS', 'LINK', 'MASTER', 'ORIGX1', 'ORIGX2', 'ORIGX3', 'REMARK.100', 'REMARK.2', 'REMARK.200', 'REMARK.280', 'REMARK.290', 'REMARK.290.CRYSTSYMMTRANS', 'REMARK.3', 'REMARK.300', 'REMARK.350', 'REMARK.350.BIOMOLECULE1.TRANSFORM1', 'REMARK.4', 'REMARK.465', 'REMARK.500', 'REVDAT', 'SCALE1', 'SCALE2', 'SCALE3', 'SEQADV', 'SEQRES', 'SHEET', 'SOURCE', 'SSBOND', 'TER', 'TITLE'] Every value in ``p.parsed[]`` is either a single instance of the class ``PDBRecord`` or a *list* of ``PDBRecords``. Let's see which ones are lists: >>> [x for x,v in p.parsed.items() if type(v)==list] ['REVDAT', 'DBREF', 'SEQADV', 'SEQRES', 'HET', 'HETNAM', 'FORMUL', 'HELIX', 'SHEET', 'SSBOND', 'LINK', 'CISPEP', 'ATOM', 'ANISOU', 'TER', 'HETATM', 'CONECT'] These are the so-called *multiple-entry* records; conceptually, they signify objects that appear more than once in a structure or it metadata. Other keys each have only a single ``PDBRecord`` instance: >>> [x for x,v in p.parsed.items() if type(v)!=list] ['HEADER', 'TITLE', 'COMPND', 'SOURCE', 'KEYWDS', 'EXPDTA', 'AUTHOR', 'JRNL.AUTH', 'JRNL.TITL', 'JRNL.REF', 'JRNL.REFN', 'JRNL.PMID', 'JRNL.DOI', 'REMARK.2', 'REMARK.3', 'REMARK.4', 'REMARK.100', 'REMARK.200', 'REMARK.280', 'REMARK.290', 'REMARK.300', 'REMARK.350', 'REMARK.465', 'REMARK.500', 'CRYST1', 'ORIGX1', 'ORIGX2', 'ORIGX3', 'SCALE1', 'SCALE2', 'SCALE3', 'MASTER', 'END', 'REMARK.290.CRYSTSYMMTRANS', 'REMARK.350.BIOMOLECULE1.TRANSFORM1'] >>> type(p.parsed['HEADER']) >>> To get a feeling for what is in each record, use the ``pstr()`` method on any ``PDBRecord`` instance: >>> header=p.parsed['HEADER'] >>> print(header.pstr()) HEADER classification: VIRAL PROTEIN depDate: 04-MAY-15 idCode: 4ZMJ The format of this output tells you the instance attributes and their values: >>> header.classification 'VIRAL PROTEIN' >>> header.depDate '04-MAY-15' >>> atoms=p.parsed['ATOM'] >>> len(atoms) 4518 Have a look at the first atom: >>> print(atoms[0].pstr()) ATOM serial: 1 name: N altLoc: residue: resName: LEU; chainID: G; seqNum: 34; iCode: x: -0.092 y: 99.33 z: 57.967 occupancy: 1.0 tempFactor: 137.71 element: N charge: Pidibble also parses any transformations needed to generate biological assemblies: >>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM1'] >>> print(b.pstr()) REMARK.350.BIOMOLECULE1.TRANSFORM1 label: BIOMT, BIOMT, BIOMT coordinate: 1, 2, 3 divnumber: 1, 1, 1 row: [m1: 1.0; m2: 0.0; m3: 0.0; t: 0.0], [m1: 0.0; m2: 1.0; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0] header: G, B, A, C, D tokens: AUTHOR DETERMINED BIOLOGICAL UNIT: HEXAMERIC SOFTWARE DETERMINED QUATERNARY STRUCTURE: HEXAMERIC SOFTWARE USED: PISA TOTAL BURIED SURFACE AREA: 44090 ANGSTROM**2 SURFACE AREA OF THE COMPLEX: 82270 ANGSTROM**2 CHANGE IN SOLVENT FREE ENERGY: 81.0 KCAL/MOL The ``header`` instance attribute for any transform subrecord in a type-350 REMARK is the list of chains to which all transform(s) are applied to generate this biological assembly. If we send that record to the accessory method ``get_symm_ops()``, we can get ``numpy.array()`` versions of any matrices: >>> from pidibble.pdbparse import get_symm_ops >>> M,T=get_symm_ops(b) >>> print(str(M)) [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]] >>> print(str(T)) [0. 0. 0.] >>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2'] >>> M,T=get_symm_ops(b) >>> print(str(M)) [[-0.5 -0.866025 0. ] [ 0.866025 -0.5 0. ] [ 0. 0. 1. ]] >>> print(str(T)) [107.18 185.64121 0. ] >>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM3'] >>> M,T=get_symm_ops(b) >>> print(str(M)) [[-0.5 0.866025 0. ] [-0.866025 -0.5 0. ] [ 0. 0. 1. ]] >>> print(str(T)) [-107.18 185.64121 0. ] You may recognize these rotation matrices as those that generate an object with C3v symmetry. Each rotation is also accompanied by a translation, here in the ``Tlist`` object. Because many entries in the RCSB do not have "legacy" PDB files and instead only have the (now standard) mmCIF/PDBx format files, ``pidibble`` can also generate parsed objects from these files. This is activated by specifying a value ``mmCIF`` for the ``input_format`` keyword argument to the ``PDBParser`` generator: >>> from pidibble.pdbparse import PDBParser >>> p=PDBParser(PDBcode='4tvp',input_format='mmCIF').parse() >>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2'] >>> print(b.pstr()) REMARK.350.BIOMOLECULE1.TRANSFORM2 BIOMOLECULE: 1 tmp_label: BIOMOLECULE1.TRANSFORM2 header: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T divnumber: 2 TRANSFORM: 2 row1: m1: -0.5; m2: -0.8660254038; m3: 0.0; t: -515.56 row2: m1: 0.8660254038; m2: -0.5; m3: 0.0; t: 0.0 row3: m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0 row: [m1: -0.5; m2: -0.8660254038; m3: 0.0; t: -515.56], [m1: 0.8660254038; m2: -0.5; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0] coordinate: 1, 2, 3 We can compare this to the ``REMARK.350.BIOMOLECULE1.TRANSFORM2`` record from the analogous PDB file: >>> p=PDBParser(PDBcode='4tvp',input_format='PDB').parse() >>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2'] >>> print(b.pstr()) REMARK.350.BIOMOLECULE1.TRANSFORM2 label: BIOMT, BIOMT, BIOMT coordinate: 1, 2, 3 divnumber: 2, 2, 2 row: [m1: -0.5; m2: -0.866025; m3: 0.0; t: -515.56], [m1: 0.866025; m2: -0.5; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0] header: G, B, L, H, D, E, A, C, F, I, J, K, M, N, O, P, Q, R, S, T Note that the important attributes of ``row`` and ``header`` are the same (in ``header``'s case, the lists are in different orders but they have the same elements). Note the greater precision in the floating-point values for the record read in from the ``mmCIF`` file. Currently, only ``ATOM``, ``HETATM``, ``SEQADV``, ``REMARK 350``, and ``REMARK 465`` records are translated from a ``mmCIF``-format file: >>> ', '.join(list(p.parsed.keys())) 'ATOM, HETATM, LINK, SSBOND, SEQADV, REMARK.350.BIOMOLECULE1.TRANSFORM1, REMARK.350.BIOMOLECULE1.TRANSFORM2, REMARK.350.BIOMOLECULE1.TRANSFORM3, REMARK.465' These records are the bare minimum needed to generate (say) input coordinate and topology files for an MD simulation. Future versions of ``pidibble`` will provide complete PDB-like parsings of ``mmCIF`` files. This is probably not useful. Importantly: ``pidibble`` parses mmCIF input to generate a structure that is the equivalent of the PDB format; that is, it uses ``auth`` fields instead of ``label`` fields.