Usage
=====

Usage Example
-------------

Let's parse the PDB entry '4ZMJ', which is a trimeric ectodomain construct of the HIV-1 envelope glycoprotein:

>>> from pidibble.pdbparse import PDBParser
>>> p = PDBParser(source_db='rcsb', source_id='4zmj').parse()

The ``PDBParser()`` call creates a new ``PDBParser`` object, and the member function ``parse()`` executes (optionally) downloading the PDB file of the code entered with the ``PDBcode`` keyword argument to ``PDBParser()``, followed by parsing into a member dictionary ``parsed``.  (The file is downloaded from the RSCB only if it is not found in the current working directory.)

Alternatively, a ``PDBParser()`` invocation can fetch from the AlphaFold model database by providing the accession code with the ``alphafold`` keyword:

>>> p = PDBParser(source_db='alphafold', sourced_id='O46077').parse()

(Note that this fetches a model for the odor receptor OR2a from *D. melanogaster*.  For the rest of this example, we'll work with the HIV-1 Env trimer 4zmj above.)

Finally, one can also retrieve entries from OPM:

>>> p = PDBParser(source_db='opm', source_id='7f1r').parse()

(Note that this fetches the sweet receptor with dummy atoms that denote locations of lipid headgroups.)

>>> type(p.parsed)
<class 'dict'>

We can easily ask what record types were parsed:

>>> list(sorted(list(p.parsed.keys())))
['ANISOU', 'ATOM', 'AUTHOR', 'CISPEP', 'COMPND', 'CONECT', 'CRYST1', 'DBREF', 'END', 'EXPDTA', 'FORMUL', 'HEADER', 'HELIX', 'HET', 'HETATM', 'HETNAM', 'JRNL.AUTH', 'JRNL.DOI', 'JRNL.PMID', 'JRNL.REF', 'JRNL.REFN', 'JRNL.TITL', 'KEYWDS', 'LINK', 'MASTER', 'ORIGX1', 'ORIGX2', 'ORIGX3', 'REMARK.100', 'REMARK.2', 'REMARK.200', 'REMARK.280', 'REMARK.290', 'REMARK.290.CRYSTSYMMTRANS', 'REMARK.3', 'REMARK.300', 'REMARK.350', 'REMARK.350.BIOMOLECULE1.TRANSFORM1', 'REMARK.4', 'REMARK.465', 'REMARK.500', 'REVDAT', 'SCALE1', 'SCALE2', 'SCALE3', 'SEQADV', 'SEQRES', 'SHEET', 'SOURCE', 'SSBOND', 'TER', 'TITLE']

Every value in ``p.parsed[]`` is either a single instance of the class ``PDBRecord`` or a *list* of ``PDBRecords``.  Let's see which ones are lists:

>>> [x for x,v in p.parsed.items() if type(v)==list]
['REVDAT', 'DBREF', 'SEQADV', 'SEQRES', 'HET', 'HETNAM', 'FORMUL', 'HELIX', 'SHEET', 'SSBOND', 'LINK', 'CISPEP', 'ATOM', 'ANISOU', 'TER', 'HETATM', 'CONECT']

These are the so-called *multiple-entry* records; conceptually, they signify objects that appear more than once in a structure or it metadata.  Other keys each have only a single ``PDBRecord`` instance:

>>> [x for x,v in p.parsed.items() if type(v)!=list] 
['HEADER', 'TITLE', 'COMPND', 'SOURCE', 'KEYWDS', 'EXPDTA', 'AUTHOR', 'JRNL.AUTH', 'JRNL.TITL', 'JRNL.REF', 'JRNL.REFN', 'JRNL.PMID', 'JRNL.DOI', 'REMARK.2', 'REMARK.3', 'REMARK.4', 'REMARK.100', 'REMARK.200', 'REMARK.280', 'REMARK.290', 'REMARK.300', 'REMARK.350', 'REMARK.465', 'REMARK.500', 'CRYST1', 'ORIGX1', 'ORIGX2', 'ORIGX3', 'SCALE1', 'SCALE2', 'SCALE3', 'MASTER', 'END', 'REMARK.290.CRYSTSYMMTRANS', 'REMARK.350.BIOMOLECULE1.TRANSFORM1']
>>> type(p.parsed['HEADER'])
<class 'pidibble.pdbrecord.PDBRecord'>
>>> 

To get a feeling for what is in each record, use the ``pstr()`` method on any ``PDBRecord`` instance: 

>>> header=p.parsed['HEADER']
>>> print(header.pstr())
HEADER
      classification: VIRAL PROTEIN
             depDate: 04-MAY-15
              idCode: 4ZMJ

The format of this output tells you the instance attributes and their values:

>>> header.classification
'VIRAL PROTEIN'
>>> header.depDate
'04-MAY-15'
>>> atoms=p.parsed['ATOM']
>>> len(atoms)
4518

Have a look at the first atom:

>>> print(atoms[0].pstr())
ATOM
              serial: 1
                name: N
              altLoc: 
             residue: resName: LEU; chainID: G; seqNum: 34; iCode: 
                   x: -0.092
                   y: 99.33
                   z: 57.967
           occupancy: 1.0
          tempFactor: 137.71
             element: N
              charge: 

Pidibble also parses any transformations needed to generate biological assemblies:

>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM1']
>>> print(b.pstr())
REMARK.350.BIOMOLECULE1.TRANSFORM1
               label: BIOMT, BIOMT, BIOMT
          coordinate: 1, 2, 3
           divnumber: 1, 1, 1
                 row: [m1: 1.0; m2: 0.0; m3: 0.0; t: 0.0], [m1: 0.0; m2: 1.0; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0]
              header: G, B, A, C, D
              tokens:
AUTHOR DETERMINED BIOLOGICAL UNIT:  HEXAMERIC
SOFTWARE DETERMINED QUATERNARY STRUCTURE:  HEXAMERIC
            SOFTWARE USED:  PISA
TOTAL BURIED SURFACE AREA:  44090 ANGSTROM**2
SURFACE AREA OF THE COMPLEX:  82270 ANGSTROM**2
CHANGE IN SOLVENT FREE ENERGY:  81.0 KCAL/MOL

The ``header`` instance attribute for any transform subrecord in a type-350 REMARK is the list of chains to which all transform(s) are
applied to generate this biological assembly.  If we send that record to the accessory method ``get_symm_ops()``, we can get ``numpy.array()`` versions of any matrices:

>>> from pidibble.pdbparse import get_symm_ops
>>> M,T=get_symm_ops(b)
>>> print(str(M))
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
>>> print(str(T))
[0. 0. 0.]
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2']
>>> M,T=get_symm_ops(b)
>>> print(str(M))
[[-0.5      -0.866025  0.      ]
 [ 0.866025 -0.5       0.      ]
 [ 0.        0.        1.      ]]
>>> print(str(T))
[107.18    185.64121   0.     ]
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM3']
>>> M,T=get_symm_ops(b)
>>> print(str(M))
[[-0.5       0.866025  0.      ]
 [-0.866025 -0.5       0.      ]
 [ 0.        0.        1.      ]]
>>> print(str(T))
[-107.18     185.64121    0.     ]

You may recognize these rotation matrices as those that generate an object with C3v symmetry.  Each rotation is also accompanied by a translation, here in the ``Tlist`` object.

Because many entries in the RCSB do not have "legacy" PDB files and instead only have the (now standard) mmCIF/PDBx format files, ``pidibble`` can also generate parsed objects from these files.  This is activated by specifying a value ``mmCIF`` for the ``input_format`` keyword argument to the ``PDBParser`` generator:

>>> from pidibble.pdbparse import PDBParser
>>> p=PDBParser(PDBcode='4tvp',input_format='mmCIF').parse()
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2']
>>> print(b.pstr())
REMARK.350.BIOMOLECULE1.TRANSFORM2
         BIOMOLECULE: 1
           tmp_label: BIOMOLECULE1.TRANSFORM2
              header: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T
           divnumber: 2
           TRANSFORM: 2
                row1: m1: -0.5; m2: -0.8660254038; m3: 0.0; t: -515.56
                row2: m1: 0.8660254038; m2: -0.5; m3: 0.0; t: 0.0
                row3: m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0
                 row: [m1: -0.5; m2: -0.8660254038; m3: 0.0; t: -515.56], [m1: 0.8660254038; m2: -0.5; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0]
          coordinate: 1, 2, 3

We can compare this to the ``REMARK.350.BIOMOLECULE1.TRANSFORM2`` record from the analogous PDB file:

>>> p=PDBParser(PDBcode='4tvp',input_format='PDB').parse()
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2']
>>> print(b.pstr())
REMARK.350.BIOMOLECULE1.TRANSFORM2
               label: BIOMT, BIOMT, BIOMT
          coordinate: 1, 2, 3
           divnumber: 2, 2, 2
                 row: [m1: -0.5; m2: -0.866025; m3: 0.0; t: -515.56], [m1: 0.866025; m2: -0.5; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0]
              header: G, B, L, H, D, E, A, C, F, I, J, K, M, N, O, P, Q, R, S, T

Note that the important attributes of ``row`` and ``header`` are the same (in ``header``'s case, the lists are in different orders but they have the same elements).  Note the greater precision in the floating-point values for the record read in from the ``mmCIF`` file.

Currently, only ``ATOM``, ``HETATM``, ``SEQADV``, ``REMARK 350``, and ``REMARK 465`` records are translated from a ``mmCIF``-format file: 

>>> ', '.join(list(p.parsed.keys()))
'ATOM, HETATM, LINK, SSBOND, SEQADV, REMARK.350.BIOMOLECULE1.TRANSFORM1, REMARK.350.BIOMOLECULE1.TRANSFORM2, REMARK.350.BIOMOLECULE1.TRANSFORM3, REMARK.465'

These records are the bare minimum needed to generate (say) input coordinate and topology files for an MD simulation.  Future versions of ``pidibble`` will provide complete PDB-like parsings of ``mmCIF`` files.  This is probably not useful.

Importantly:  ``pidibble`` parses mmCIF input to generate a structure that is the equivalent of the PDB format; that is, it uses ``auth`` fields instead of ``label`` fields.