You can download a copy of the LabelHash tools and run them on your own computer or cluster. Below we will describe how you use these tools once you have them installed.
The LabelHash algorithm consists of two stages: a preprocessing stage and a matching stage. During the preprocessing stage a hash table is built for a set of target PDB files. This has to be done only once for a given set of targets. During the matching stage partial matches can be looked up quickly in this table for almost any motif. The matching algorithm computes complete matches and their statistical significance.
The workflow for using the LabelHash command line programs consists of the following steps:
We will describe these steps in detail below.
LabelHash works on (gzipped) PDB files, so you need to download some PDB files for motifs, a background data set, and homologs. You need to set the environment variable PDBPATH to the directories where you store the PDB files. The PDB files can be stored all in one directory or in the kind of ‘divided’ directory structure that the PDB uses. If bash is your shell, add something like this to your ${HOME}/.bash_profile:
export PDBPATH=${HOME}/pdb
The script labelhash_input.py is used to create (1) input XML files for table creation, (2) match option files, and (3) motif files. Below we will describe basic usage. Run “labelhash_input.py --help” to see all options.
To match a motif, you first need to create LabelHash tables of the PDB files that are to be used as matching targets (unless you have downloaded a LabelHash table from the LabelHash web site). First, create a text file with all the PDB files you want to build tables for:
echo 1ady:A 1adj 1qe0:A > pdblist.txt
Note that you can optionally specify a chain. By default, all chains are used. If you have a directory full of files in the format pdb1abc.ent.gz, then you could use something like this:
cd ~/pdb; \ls -1 *.ent.gz | sed 's/^pdb//;s/\.ent.gz//' > pdblist.txt
(The sed command strips the prefix ‘pdb’ and the suffix ‘.ent.gz’ to obtain a list of pdb id’s.) The next step is to create XML input files with the labelhash_input.py script:
labelhash_input.py -t table -o pdb.xml pdblist.txt
The script will create pdb.xml, which will be used in the next step. There are many other parameters to this script with which you can change the default settings. The last step is creating the LabelHash tables:
create pdb.xml
After “create” is finished you should now have the file pdb.lhash3. You can run create on multiple cores like so:
mpirun -np 8 create pdb.xml
Here, ‘8’ is the number of parallel processes to be used. For best performance the number of processes should be more or less equal to the number of available cores.
For large LabelHash tables the datasets stored in them can become very fragmented, which will negatively affect the performance of the matching program. To ‘defragment’ your LabelHash file you can use h5repack
, a tool that is part of the HDF5 package, like so:
h5repack oldpdb.lhash3 newpdb.lhash3
This can sometimes take a very long time (days!), but if you plan to use the table often, it can be worth it. After h5repack
is finished, you can delete oldpdb.lhash3
.
To match a motif, you first need to create a motif XML file and a matching options XML file. A motif can be created like so:
labelhash_input.py -t motif -o 1ady.xml 1ady 81:ED 83:T 112:RS 130:ED 264:YL 311:RNKQ
The script expects the name of a PDB file followed by a number of residues, specified by the residue sequence ID. Each residue sequence ID can optionally be followed by a colon and a number of one-letter residue names, which are taken to be all allowed residue labels for that motif point. If the residue names are omitted, the residue label in the PDB file is used.
The options file can be created like so:
labelhash_input.py -t matchoptions -o options.xml
The values for all the options in options.xml can be changed with optional parameters to the labelhash_input.py script. The motif can be matched against the targets in the LabelHash table computed before:
match options.xml 1ady.xml pdb.lhash3
The matches will by default be saved in 1ady-matches.xml. You can specify a different output file like so:
match -o myfile.xml options.xml 1ady.xml pdb.lhash3
Analogous to “create”, you can run “match” in parallel using mpirun.
If you only want to match against specific targets, you can specify them as additional arguments:
match -o myfile.xml options.xml 1ady.xml pdb.lhash3 1adj 1kmm:A pdbids.txt
The “1adj” argument is interpreted as “all chains in 1adj,” “1kmm:A” means only chain A of 1kmm will be matched, and, finally, any argument ending in “.txt” is interpreted to be a plain text file with a whitespace-separated list of target names.
The labelhash_input.py
script can be used to create input files for table creation and the matching program. The default settings should work well for a variety of motifs, but many aspects can be controlled through the following options:
Global options: | |
-h, --help | show this help message and exit |
-t TYPE, --type=TYPE | Type of XML file to create (table , motif , or matchoptions ) |
-o OUTFILE, --outfile=OUTFILE | Name for the XML output [default: standard output] |
Reference set parameters: These options are effective only for the types table and matchoptions . | |
--maxmindistance=MAXMINDISTANCE | Max min distance between point in ref. set [default: 16] |
--maxdiameter=MAXDIAMETER | Max diameter of points in ref. set [default: 25] |
--maxmindepth=MAXMINDEPTH | Max min distance between point in ref. set and molecular surface [default: 1.6] |
--maxdepth=MAXDEPTH | Max distance between point in ref. set and molecular surface [default: 3.1] |
LabelHash table creation options: | |
--nosplitchains | Treat multiple chains in a PDB file as one point sequence |
--allmodels | Include all models in a PDB file |
--nnradius=NNRADIUS | Radius for precomputing nearest neighbors [default: 14.0] |
--writefraction=WRITEFRACTION | Fraction of processes involved in writing data [default: 0.01 ] |
Match options: | |
--radius=RADIUS | The max. distance to look for match augmentation points [default: 7] |
--maxdistance=MAXDISTANCE | The maximum allowed LRMSD for a complete match [default: 7] |
--numneighbors=NUMNEIGHBORS | The maximum number of nearest neighbors to consider for one step of match augmentation [default: 50] |
--numtries=NUMTRIES | The number of different reference sets to try [default: 15] |
--maxmatches=MAXMATCHES | The max. number of matches to return [default: 10000000] |
--minscore=MINSCORE | The minimum score required for a valid match [default: -1] |
--maxscore=MAXSCORE | The maximum score allowed for a valid match [default: 10000000] |
--maxpvalue=MAXPVALUE | Keep only matches whose pvalue does not exceed maxpvalue [default: 1] |
--keepmultiplematches | Keep multiple matches per target |
--keepbadpartial | Keep matches that can be augmented to larger matches |
--nopointweight | Do not use statistical pointweight correction |
--geometricsieving | Instead of regular matching perform Geometric Sieving |
The “match” program produces a compressed XML file with all the matches. This file can be opened with the Chimera ViewMatch plugin. If you want to look at any of the XML files produced by LabelHash from the command line, use the command “xmllint -format matches.xml”. The “xmllint” program is part of libxml2. It is already installed on OS X, while on Linux it is part of a package called libxml2-utils, which you can install through your package manager. The XML file is structured as follows:
To analyze the XML output you can use the Python module. It allows you read in a match file like so:
from LabelHash import * matchlist = MatchList('matches.xml') total_rmsd=0 for match in matchlist: if str(match.id())=='1did:0:A': print 'found a match to 1did:A!' total_rmsd = total_rmsd + match.rmsd() print 'average rmsd = %f' % (total_rmsd/len(matchlist))
More examples can be found in the labelhash_test.py script.
The matching program can also be used for Geometric Sieving, an algorithm that finds the best submotif of a given size for a certain motif. The best submotif is defined as the one that maximizes the median of the LRMSDs of its matches, since this is highly correlated with high specificity (without losing sensitivity). To enable Geometric Sieving, set the geometricsieving option to 1, and minscore to the desired submotif size in the match options (see labelhash_input.py). Instead of the usual match results file, it will produce match files for the submotif(s) whose median confidence intervals lie strictly above those of the other submotifs. These confidence intervals are determined through bootstrap sampling. It is possible that only one submotif remains after matching against only a small fraction of your background dataset, so the match file does in this case not contain all possible matches. Geometric Sieving is a CPU- and memory-intensive procedure and will need to be run in parallel on all but the simplest test cases. Geometric Sieving has only been lightly tested and should be considered a beta feature.