Today my PI gave me the following task:
- Parse the alpha carbons from a PDB file
- Scan the sequence using a window of a given size
- For each window, collect all residues within a certain distance (called the "context")
- Split the context into contiguous chains
If you want to follow along, use the full code sample in the appendix and download and extract the sample PDB file: 1YU0.pdb.
Parsing
The first thing I had to do was to take a protein structure and parse it into its alpha carbons. The input format is a Protein Data Bank (i.e. PDB) file, which has a well documented file format found here. I'm interested in the ATOM record, which specifies the coordinates of a single atom in the PDB file:
COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location ind 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence numbe 27 AChar iCode Code for insertion of 31 - 38 Real(8.3) x Coordinates for X in A 39 - 46 Real(8.3) y Coordinates for Y in A 47 - 54 Real(8.3) z Coordinates for Z in A 55 - 60 Real(6.2) occupancy Occupancy. 61 - 66 Real(6.2) tempFactor Temperature factor. 77 - 78 LString(2) element Element symbol, right- 79 - 80 LString(2) charge Charge on the atom.There are seven fields I have to parse out from an Atom record:
- resName: Alpha carbons have this set to " CA ". Each residue has exactly one alpha carbon which is often used as a coarse-grained proxy for the entire residue.
- chainID: Chains are contiguous segments of proteins, labeled by a single letter.
- resSeq: Residues are the individual building blocks of proteins. Each residue in a chain is sequentially numbered for unique identification.
- altLoc: Residues can have multiple coordinates, and I arbitrarily pick the first set of coordinates.
- x, y, and z: The atom's coordinates.
import Numeric.LinearAlgebra -- from the "hmatrix" package type Point = Vector Double data Atom = Atom { chainID :: Char , resSeq :: Int , coord :: Point } deriving (Show)Here I use a type synonym to document what I want to use coord for. This has the added advantage that I can use a different Point type later on (such as a length-indexed type), and I need only update the type synonym to update my type signatures.
Now I need a way to parse an atom record, and I'll use attoparsec:
{-# LANGUAGE OverloadedStrings #-} import Control.Monad import Data.Attoparsec.Char8 as P hiding (skipWhile) import Data.Attoparsec (skipWhile) skip n = void $ P.take n decimal' = skipSpace >> decimal double' = skipSpace >> double pAtom :: Parser Atom pAtom = do string "ATOM " -- Must be an "ATOM " record skip 6 string " CA " -- Must be an alpha carbon satisfy (inClass " A") -- First conformation (' ' or 'A') skip 4 _chainID <- anyChar _resSeq <- decimal' skip 1 x <- double' y <- double' z <- double' skip 26 endOfLine return $ Atom { chainID = _chainID, resSeq = _resSeq, coord = fromList [x, y, z] }That was pretty easy! Notice how I rename P.take to skip as a sort of in-line comment to document my intention to discard the output. I also use void to force it to discard the result and match the intention.
The next step is to create a parser for the entire file that throws out all lines that do not match:
import Control.Applicative import Data.Maybe pLine = skipWhile (not . isEndOfLine) *> endOfLine pPDB :: Parser [Atom] pPDB = catMaybes <$> (many $ (Just <$> pAtom) <|> (Nothing <$ pLine) )Now, the awesome thing about Haskell is that you can very quickly test things out in ghci, so let's fire up the module and see if it works:
*Main> import qualified Data.ByteString.Char8 as B *Main B> str <- B.readFile "/path/to/1YU0.pdb" *Main B> parseOnly pPDB str ... sSeq = 376, coord = fromList [5.274,108.854,3.352]},Atom {chainI D = 'A', resSeq = 377, coord = fromList [5.643,111.164,6.349]},A tom {chainID = 'A', resSeq = 378, coord = fromList [4.35,109.87, 9.685]},Atom {chainID = 'A', resSeq = 379, coord = fromList [3.2 33,112.127,12.52]},Atom {chainID = 'A', resSeq = 380, coord = fr omList [1.9729999999999999,110.268,15.618]}]Great! It works! (Note: It actually is not correct because it doesn't remove multiple protein models for NMR structures, but I'm glossing over that for now).
Distances
I can't really calculate distance cutoffs without some sort of a distance function:
import Data.Function (on) dist :: Point -> Point -> Double dist v1 v2 = norm2 (v1 - v2) distA :: Atom -> Atom -> Double distA = dist `on` coordon is a very handy utility to extend binary functions. More importantly, it makes the code read more like English. The definition for distA reads like saying "Take the distance on the coord field".
Context
Now I have to associate each atom with its neighbors within some distance. Before I write a potentially confusing function, I write out the type to focus my mind:
type Context = [Atom] type Cutoff = Double addContext :: Cutoff -> [Atom] -> [(Atom, Context)]This function takes a list of atoms, and associates each atom with a list of its neighbors within some distance cutoff. I will be joining these lists of neighbors later on, so really a more efficient data structure would be a Set, but this is good enough for now, and I can optimize it later if necessary.
Also, note the use of a type synonyms as a quick and dirty documentation. A well-chosen type synonym can go a long way towards making your code easy to understand. Also, if I go back and decide to use Set, I only need to change the type synonym definition to:
type Context = Set Atom.. and all my type signatures will be fixed.
Anyway, let's compute some neighbors:
for = flip fmap addContext cutoff as = for as $ \a1 -> (a1, filter (\a2 -> distA a1 a2 < cutoff) as)This is definitely not an efficient algorithm (a HashMap-based binning algorithm would work faster), but I will let it slide for now since this is just a simple script. Also, notice that it does not eliminate an atom from its own neighbor list. This is because the atom will be reintroduced later when joining contexts, so I postpone doing this.
Windows
My PI wants a sliding window of 7 residues and a context generated for each window. Rather than manually slide the window in an imperative style, I instead generate the set of all windows:
import Data.List windows :: Int -> [a] -> [[a]] windows n = filter (\xs -> length xs == n) . map (Prelude.take n) . tailsFor each given window, I have to join the contexts together and eliminate duplicates. This requires defining what a duplicate is:
duplicate a1 a2 = chainID a1 == chainID a2 && resSeq a1 == resSeq a2I could have also used on:
duplicate a1 a2 = ((==) `on` chainID) a1 a2 && ((==) `on` resSeq ) a1 a2Now I need to specify what it means to join contexts. I write out a type signature to be precise about what I want:
joinContexts :: [(Atom, Context)] -> ([Atom], Context)joinContexts takes a window of atoms and their individual contexts, and joins all the contexts into one, leaving behind just a list of atoms:
joinContexts ps = (self, nonSelf) where self = map fst ps context = concat $ map snd ps totalContext = nubBy duplicate context nonSelf = deleteFirstsBy duplicate totalContext selfUsing let or where clauses is another great way to annotate intermediate steps in a function, although that will be pretty obvious to most readers.
Chains
Finally, I have to split each context into chains, defined as consecutive residues with the same chainID. I have a utility function for this purpose that I keep lying around, defined as:
import Data.Monoid import Data.Ord sequential a1 a2 = chainID a1 == chainID a2 && (resSeq a1 + 1 == resSeq a2) type Chain = [Atom] chains' :: ([Atom] -> [Atom]) -> [Atom] -> [Chain] chains' c [] = [c [] ] chains' c [a] = [c [a]] chains' c (a1:bs@(a2:as)) | sequential a1 a2 = chains' c' bs | otherwise = (c' []):chains' id bs where c' = c . (a1:) -- The actual function I use chains :: [Atom] -> [Chain] chains = chains' id . sortBy (comparing chainID `thenBy` comparing resSeq) where thenBy = mappendI want to draw attention to the sortBy call in the chains function. This takes advantage of two very useful tricks. The first is the comparing function from Data.Ord. The second is the Monoid trick to compose two orderings, giving priority to the first one. Combining the two tricks gives code that reads like English: "Sort by comparing chainID first, then by comparing resSeq".
The last step is that I have to take each window and the chains in its context and pair the window with each chain. If that didn't make sense, I'm pretty sure the type signature would make it clear:
type Window = [Atom] distribute :: (Window, [Chain]) -> [(Window, Chain)]This shows how Haskell type signatures can often be "self-documenting" and can say a lot of things that we might have difficulty articulating clearly in words:
-- most polymorphic type :: (a, [b]) -> [(a, b)] distribute (x, ys) = map ((,) x) ys
Combining Everything
Now I combine all the small functions I wrote into one pipeline:
import Control.Arrow (second) type WindowSize = Int pairs :: WindowSize -> Cutoff -> [Atom] -> [(Window, Chain)] pairs windowSize cutoff = concat . map (distribute . second chains . joinContexts) . windows windowSize . addContext cutoffThe above function pipeline clearly illustrates the flow of information and makes it easy to reason about what my code actually does. I could even use it as a high-level specification of my program. Reading from bottom to top it says that you:
- Attach contexts
- Partition the sequence into windows
- For each window:
- Join the contexts
- Group the joined context into chains
- Distribute the window over each chain
- Concatenate all the results
import Control.Category import Prelude hiding ((.), id) pairs windowSize cutoff = addContext cutoff >>> windows windowSize >>> map (joinContexts >>> second chains >>> distribute) >>> concatA compositional style where you chain lots of small functions makes the final overarching program much easier to follow.
All that remains is to quickly test out the code in ghci:
*Main> import qualified Data.ByteString.Char8 as B *Main B> str <- B.readFile "/path/to/1YU0.pdb" *Main B> let atoms = parseOnly pPDB str *Main B> -- I use fmap since parseOnly returns an Either *Main B> let ps = fmap (pairs 7 12.0) atoms *Main B> -- How many pairs? *Main B> fmap length ps Right 3559 *Main B> -- What is the largest contiguous chain in a context? *Main B> let size = length . snd *Main B> let largest = fmap (maximumBy (comparing size)) ps *Main B> fmap size largest Right 38 *Main B> -- Which residues did it span? *Main B> fmap (map resSeq . snd) largest Right [337,338,339,340,341,342,343,344,345,346,347,348,349,350, 351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366 ,367,368,369,370,371,372,373,374] *Main B> -- What was the window it matched? *Main B> fmap (map resSeq . fst) largest Right [304,305,306,307,308,309,310]I then load up the protein in PyMOL to inspect the matched ranges and, sure enough, it works correctly: In the image I highlighted the matched window in yellow and the matched context in purple.
Conclusion
So I hope this shows some of the tricks I use to improve code clarity. I left out a lot of type signatures for brevity because this was a one-off script and I only wanted to draw attention to certain functions, but if I were really serious about fully documenting the code I would include all type signatures.
Appendix: Full code
{-# LANGUAGE OverloadedStrings #-} import Control.Applicative import Control.Arrow (second) import Control.Monad import Data.Attoparsec.Char8 as P hiding (skipWhile) import Data.Attoparsec (skipWhile) import Data.Function (on) import Data.List import Data.Maybe import Data.Monoid import Data.Ord import Numeric.LinearAlgebra -- from the "hmatrix" package type Point = Vector Double data Atom = Atom { chainID :: Char, resSeq :: Int, coord :: Point } deriving (Show) skip n = void $ P.take n decimal' = skipSpace >> decimal double' = skipSpace >> P.double pAtom :: Parser Atom pAtom = do string "ATOM " -- Must be an "ATOM " record skip 6 string " CA " -- Must be an alpha carbon satisfy (inClass " A") -- First confirmation (' ' or 'A') skip 4 _chainID <- anyChar _resSeq <- decimal' skip 1 x <- double' y <- double' z <- double' skip 26 endOfLine return $ Atom { chainID = _chainID, resSeq = _resSeq, coord = fromList [x, y, z] } pLine = skipWhile (not . isEndOfLine) *> endOfLine pPDB :: Parser [Atom] pPDB = catMaybes <$> ( many $ (Just <$> pAtom) <|> (Nothing <$ pLine)) dist :: Point -> Point -> Double dist v1 v2 = norm2 (v1 - v2) distA :: Atom -> Atom -> Double distA = dist `on` coord type Context = [Atom] type Cutoff = Double for = flip fmap addContext :: Cutoff -> [Atom] -> [(Atom, Context)] addContext cutoff as = for as $ \a1 -> (a1, filter (\a2 -> distA a1 a2 < cutoff) as) windows :: Int -> [a] -> [[a]] windows n = filter (\xs -> length xs == n) . map (Prelude.take n) . tails duplicate a1 a2 = chainID a1 == chainID a2 && resSeq a1 == resSeq a2 joinContexts :: [(Atom, Context)] -> ([Atom], Context) joinContexts ps = (self, nonSelf) where self = map fst ps context = concat $ map snd ps totalContext = nubBy duplicate context nonSelf = deleteFirstsBy duplicate totalContext self sequential a1 a2 = chainID a1 == chainID a2 && (resSeq a1 + 1 == resSeq a2) type Chain = [Atom] chains' :: ([Atom] -> [Atom]) -> [Atom] -> [Chain] chains' c [] = [c [] ] chains' c [a] = [c [a]] chains' c (a1:bs@(a2:as)) | sequential a1 a2 = chains' c' bs | otherwise = (c' []):chains' id bs where c' = c . (a1:) -- The actual function I use chains :: [Atom] -> [Chain] chains = chains' id . sortBy (comparing chainID `thenBy` comparing resSeq) where thenBy = mappend type Window = [Atom] distribute :: (Window, [Chain]) -> [(Window, Chain)] distribute (x, ys) = map ((,) x) ys type WindowSize = Int pairs :: WindowSize -> Cutoff -> [Atom] -> [(Window, Chain)] pairs windowSize cutoff = concat . map (distribute . second chains . joinContexts) . windows windowSize . addContext cutoff
No comments:
Post a Comment