Usage¶

Usage Example¶

Let’s parse the PDB entry ‘4ZMJ’, which is a trimeric ectodomain construct of the HIV-1 envelope glycoprotein:

>>> from pidibble.pdbparse import PDBParser
>>> p = PDBParser(source_db='rcsb', source_id='4zmj').parse()

The PDBParser() call creates a new PDBParser object, and the member function parse() executes (optionally) downloading the PDB file of the code entered with the PDBcode keyword argument to PDBParser(), followed by parsing into a member dictionary parsed. (The file is downloaded from the RSCB only if it is not found in the current working directory.)

Alternatively, a PDBParser() invocation can fetch from the AlphaFold model database by providing the accession code with the alphafold keyword:

>>> p = PDBParser(source_db='alphafold', sourced_id='O46077').parse()

(Note that this fetches a model for the odor receptor OR2a from D. melanogaster. For the rest of this example, we’ll work with the HIV-1 Env trimer 4zmj above.)

Finally, one can also retrieve entries from OPM:

>>> p = PDBParser(source_db='opm', source_id='7f1r').parse()

(Note that this fetches the sweet receptor with dummy atoms that denote locations of lipid headgroups.)

>>> type(p.parsed)
<class 'dict'>

We can easily ask what record types were parsed:

>>> list(sorted(list(p.parsed.keys())))
['ANISOU', 'ATOM', 'AUTHOR', 'CISPEP', 'COMPND', 'CONECT', 'CRYST1', 'DBREF', 'END', 'EXPDTA', 'FORMUL', 'HEADER', 'HELIX', 'HET', 'HETATM', 'HETNAM', 'JRNL.AUTH', 'JRNL.DOI', 'JRNL.PMID', 'JRNL.REF', 'JRNL.REFN', 'JRNL.TITL', 'KEYWDS', 'LINK', 'MASTER', 'ORIGX1', 'ORIGX2', 'ORIGX3', 'REMARK.100', 'REMARK.2', 'REMARK.200', 'REMARK.280', 'REMARK.290', 'REMARK.290.CRYSTSYMMTRANS', 'REMARK.3', 'REMARK.300', 'REMARK.350', 'REMARK.350.BIOMOLECULE1.TRANSFORM1', 'REMARK.4', 'REMARK.465', 'REMARK.500', 'REVDAT', 'SCALE1', 'SCALE2', 'SCALE3', 'SEQADV', 'SEQRES', 'SHEET', 'SOURCE', 'SSBOND', 'TER', 'TITLE']

Every value in p.parsed[] is either a single instance of the class PDBRecord or a list of PDBRecords. Let’s see which ones are lists:

>>> [x for x,v in p.parsed.items() if type(v)==list]
['REVDAT', 'DBREF', 'SEQADV', 'SEQRES', 'HET', 'HETNAM', 'FORMUL', 'HELIX', 'SHEET', 'SSBOND', 'LINK', 'CISPEP', 'ATOM', 'ANISOU', 'TER', 'HETATM', 'CONECT']

These are the so-called multiple-entry records; conceptually, they signify objects that appear more than once in a structure or it metadata. Other keys each have only a single PDBRecord instance:

>>> [x for x,v in p.parsed.items() if type(v)!=list]
['HEADER', 'TITLE', 'COMPND', 'SOURCE', 'KEYWDS', 'EXPDTA', 'AUTHOR', 'JRNL.AUTH', 'JRNL.TITL', 'JRNL.REF', 'JRNL.REFN', 'JRNL.PMID', 'JRNL.DOI', 'REMARK.2', 'REMARK.3', 'REMARK.4', 'REMARK.100', 'REMARK.200', 'REMARK.280', 'REMARK.290', 'REMARK.300', 'REMARK.350', 'REMARK.465', 'REMARK.500', 'CRYST1', 'ORIGX1', 'ORIGX2', 'ORIGX3', 'SCALE1', 'SCALE2', 'SCALE3', 'MASTER', 'END', 'REMARK.290.CRYSTSYMMTRANS', 'REMARK.350.BIOMOLECULE1.TRANSFORM1']
>>> type(p.parsed['HEADER'])
<class 'pidibble.pdbrecord.PDBRecord'>
>>>

To get a feeling for what is in each record, use the pstr() method on any PDBRecord instance:

>>> header=p.parsed['HEADER']
>>> print(header.pstr())
HEADER
      classification: VIRAL PROTEIN
             depDate: 04-MAY-15
              idCode: 4ZMJ

The format of this output tells you the instance attributes and their values:

>>> header.classification
'VIRAL PROTEIN'
>>> header.depDate
'04-MAY-15'
>>> atoms=p.parsed['ATOM']
>>> len(atoms)
4518

Have a look at the first atom:

>>> print(atoms[0].pstr())
ATOM
              serial: 1
                name: N
              altLoc:
             residue: resName: LEU; chainID: G; seqNum: 34; iCode:
                   x: -0.092
                   y: 99.33
                   z: 57.967
           occupancy: 1.0
          tempFactor: 137.71
             element: N
              charge:

Pidibble also parses any transformations needed to generate biological assemblies:

>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM1']
>>> print(b.pstr())
REMARK.350.BIOMOLECULE1.TRANSFORM1
               label: BIOMT, BIOMT, BIOMT
          coordinate: 1, 2, 3
           divnumber: 1, 1, 1
                 row: [m1: 1.0; m2: 0.0; m3: 0.0; t: 0.0], [m1: 0.0; m2: 1.0; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0]
              header: G, B, A, C, D
              tokens:
AUTHOR DETERMINED BIOLOGICAL UNIT:  HEXAMERIC
SOFTWARE DETERMINED QUATERNARY STRUCTURE:  HEXAMERIC
            SOFTWARE USED:  PISA
TOTAL BURIED SURFACE AREA:  44090 ANGSTROM**2
SURFACE AREA OF THE COMPLEX:  82270 ANGSTROM**2
CHANGE IN SOLVENT FREE ENERGY:  81.0 KCAL/MOL

The header instance attribute for any transform subrecord in a type-350 REMARK is the list of chains to which all transform(s) are applied to generate this biological assembly. If we send that record to the accessory method get_symm_ops(), we can get numpy.array() versions of any matrices:

>>> from pidibble.pdbparse import get_symm_ops
>>> M,T=get_symm_ops(b)
>>> print(str(M))
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
>>> print(str(T))
[0. 0. 0.]
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2']
>>> M,T=get_symm_ops(b)
>>> print(str(M))
[[-0.5      -0.866025  0.      ]
 [ 0.866025 -0.5       0.      ]
 [ 0.        0.        1.      ]]
>>> print(str(T))
[107.18    185.64121   0.     ]
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM3']
>>> M,T=get_symm_ops(b)
>>> print(str(M))
[[-0.5       0.866025  0.      ]
 [-0.866025 -0.5       0.      ]
 [ 0.        0.        1.      ]]
>>> print(str(T))
[-107.18     185.64121    0.     ]

You may recognize these rotation matrices as those that generate an object with C3v symmetry. Each rotation is also accompanied by a translation, here in the Tlist object.

Because many entries in the RCSB do not have “legacy” PDB files and instead only have the (now standard) mmCIF/PDBx format files, pidibble can also generate parsed objects from these files. This is activated by specifying a value mmCIF for the input_format keyword argument to the PDBParser generator:

>>> from pidibble.pdbparse import PDBParser
>>> p=PDBParser(PDBcode='4tvp',input_format='mmCIF').parse()
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2']
>>> print(b.pstr())
REMARK.350.BIOMOLECULE1.TRANSFORM2
         BIOMOLECULE: 1
           tmp_label: BIOMOLECULE1.TRANSFORM2
              header: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T
           divnumber: 2
           TRANSFORM: 2
                row1: m1: -0.5; m2: -0.8660254038; m3: 0.0; t: -515.56
                row2: m1: 0.8660254038; m2: -0.5; m3: 0.0; t: 0.0
                row3: m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0
                 row: [m1: -0.5; m2: -0.8660254038; m3: 0.0; t: -515.56], [m1: 0.8660254038; m2: -0.5; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0]
          coordinate: 1, 2, 3

We can compare this to the REMARK.350.BIOMOLECULE1.TRANSFORM2 record from the analogous PDB file:

>>> p=PDBParser(PDBcode='4tvp',input_format='PDB').parse()
>>> b=p.parsed['REMARK.350.BIOMOLECULE1.TRANSFORM2']
>>> print(b.pstr())
REMARK.350.BIOMOLECULE1.TRANSFORM2
               label: BIOMT, BIOMT, BIOMT
          coordinate: 1, 2, 3
           divnumber: 2, 2, 2
                 row: [m1: -0.5; m2: -0.866025; m3: 0.0; t: -515.56], [m1: 0.866025; m2: -0.5; m3: 0.0; t: 0.0], [m1: 0.0; m2: 0.0; m3: 1.0; t: 0.0]
              header: G, B, L, H, D, E, A, C, F, I, J, K, M, N, O, P, Q, R, S, T

Note that the important attributes of row and header are the same (in header’s case, the lists are in different orders but they have the same elements). Note the greater precision in the floating-point values for the record read in from the mmCIF file.

As of version 1.7.0, pidibble translates a broad set of record types from mmCIF/PDBx files. Re-parsing the mmCIF entry and listing its keys:

>>> p=PDBParser(PDBcode='4tvp',input_format='mmCIF').parse()
>>> ', '.join(list(p.parsed.keys()))
'ATOM, HETATM, LINK, SSBOND, SEQADV, REMARK.350.BIOMOLECULE1.TRANSFORM1, REMARK.350.BIOMOLECULE1.TRANSFORM2, REMARK.350.BIOMOLECULE1.TRANSFORM3, REMARK.465, HEADER, TITLE, EXPDTA, KEYWDS, CRYST1, SEQRES, HELIX, SHEET, COMPND, SOURCE'

This covers coordinates (ATOM/HETATM), connectivity (LINK — including metal coordination — and SSBOND), sequence (SEQRES), secondary structure (HELIX and SHEET), biological assemblies (REMARK 350), missing residues (REMARK 465), sequence-database differences (SEQADV), header/metadata (HEADER, TITLE, EXPDTA, KEYWDS, CRYST1), and entity/source information (COMPND/SOURCE). Each mmCIF-derived record mirrors the attribute names of its PDB counterpart and is validated field-by-field against the corresponding legacy-PDB parse.

At parse time pidibble also logs (at INFO) how many of the file’s mmCIF categories it read and how many are present but unmapped, so you can see what data a given entry carries beyond what is surfaced.

There are two deliberate departures from exact PDB equivalence. COMPND and SOURCE are emitted as flat per-entity records (for example, p.parsed['COMPND'][0].molID, .chains, .molName) rather than the nested token-group structure the legacy-PDB parser builds, so consumers must branch on input format for those two records. A few purely representational fields also differ where the formats themselves differ — for instance HEADER.depDate is exposed in mmCIF’s native ISO form (2014-06-27) rather than the PDB DD-MON-YY form.

Importantly: pidibble parses mmCIF input to generate a structure that is the equivalent of the PDB format; that is, it uses auth fields instead of label fields.