[Fsatom] Re: CML and macromolecules

Peter Murray-Rust pm286@cam.ac.uk
Sun, 19 Oct 2003 21:43:14 +0100


At 21:16 19/10/2003 +0200, David wrote:
>Hi fsatoms,

Great to hear from you David - I gather you were at the tutorial

>I'm ready for the first discussion! How about residue information in
>CML? I ran babel on a pdb file and got only abbreviated atom names and
>coordinates.

In its original incarnation (ca 1994 (sic)) CML had support for 
macromolecules. The philosophy was based on PDB and SwissProt. This included:
<SEQUENCE> for the sequence
<FEATURE> for all sequence/structure-based annotations

and <atom> had the attributes "residue" and "atomType"

Vestiges of these could be found in CML V1.0 and I wrote PDB readers and 
Swiss readers.

However at that stage (a) We needed to concentrate more on systematising 
small molecules (b) we thought that mmCIF-like approaches would come to 
replace the PDB approach. (BTW I have been involved in the CIF effort and 
much of the CML philosophy steals from CIF). (c) I would have to maintain it.

mmCIF has gradually been introduced into macromolecular crystallography and 
provides much more extended support for things like:
- biological unit
- definitions of the chemical entities in the structure
- mapping between entities
- labelling of atoms, residues

mmCIF was the basis for the submission to the OMG Life Sciences program 
(CML was the basis for the small molecules). We therefore felt that we 
shouldn't muddy the waters and duplicate the mm effort.

However I think there is still a strong "PDB-like" approach and I am happy 
to extend CML to manage that aspect of macromolecules. I think it's *not* 
useful for CML to try to model protein hierarchy 
(primary/secondary/supersecondary/tertiary/quaternary, etc.) However it 
could be useful to have a "flat-file" approach" where the atoms had PDB 
like info on:
- their PDB type (CA, SG. etc.)
- their PDB number
- their residue type
- the chain number.

CML could carry this - and more - , but would not support the explicit 
hierarchy.

the result might look like:

<atom elementType="C" cmlx:residue="GLY13" cmlx:pdbNumber="23" 
cmlx:chain="B" x3="1.23".../>

where cmlx: is an extension CML namespace.

This can be compacted to an array format like:

<atomArray elementType="C O N C C O N C C..."
cmlx:residue="GLY13 GLY13 CLY13 GLY13 ALA14..."
cmlx:pdbNumber="23 24 25 26..."/>

The array format can actually be more cost-effective in space than PDB

what CML will not support is:

<protein>
   <biologicalUnit>
     <crystalUnit>
        <chain id="A">
           <residue>
              <atom>
              <atom>
        <chain id="B">
etc.

CML *can* support nested molecules so it is possible to write:

<molecule id="protein">
   <molecule id="chainA">
   <molecule id="chainB">
   <molecule id="ligand1">
   <molecule id="ligand2">
... and so on

but if you detail what you want we can see if it fits..

Best

Peter






>On the positive side, babel finds the chemical bonds and outputs them in
>the CML file. Are these 100 % reliable for correct structures?
>--
>David.
>________________________________________________________________________
>David van der Spoel, PhD, Assist. Prof., Molecular Biophysics group,
>Dept. of Cell and Molecular Biology, Uppsala University.
>Husargatan 3, Box 596,          75124 Uppsala, Sweden
>phone:  46 18 471 4205          fax: 46 18 511 755
>spoel@xray.bmc.uu.se    spoel@gromacs.org   http://xray.bmc.uu.se/~spoel
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++