The LabelHash Server

Getting Started

You can download a copy of the LabelHash tools and run them on your own computer or cluster. Below we will describe how you use these tools once you have them installed.

Attention:
  • Before you install LabelHash, you should install python 2.7. The LabelHash installer will then make sure python can find the LabelHash python module.
  • The LabelHash tools expect Michel Sanner’s msms tool to be installed somewhere in your $PATH. You can download this tool from http://mgltools.scripps.edu/downloads#msms. After you uncompress the msms tar ball or zip file, the main binary is called “msms.<platform.version>.” Rename it to “msms” and put it somewhere in your $PATH, e.g., /usr/local/bin.
  • You may also want to install Chimera. The LabelHash matching program produces files that can be opened in Chimera using our ViewMatch plugin.

The LabelHash algorithm consists of two stages: a preprocessing stage and a matching stage. During the preprocessing stage a hash table is built for a set of target PDB files. This has to be done only once for a given set of targets. During the matching stage partial matches can be looked up quickly in this table for almost any motif. The matching algorithm computes complete matches and their statistical significance.

The workflow for using the LabelHash command line programs consists of the following steps:

We will describe these steps in detail below.

Preparing your data

LabelHash works on (gzipped) PDB files, so you need to download some PDB files for motifs, a background data set, and homologs. You need to set the environment variable PDBPATH to the directories where you store the PDB files. The PDB files can be stored all in one directory or in the kind of ‘divided’ directory structure that the PDB uses. If bash is your shell, add something like this to your ${HOME}/.bash_profile:

 export PDBPATH=${HOME}/pdb

Creating LabelHash tables

The script labelhash_input.py is used to create (1) input XML files for table creation, (2) match option files, and (3) motif files. Below we will describe basic usage. Run “labelhash_input.py --help” to see all options.

To match a motif, you first need to create LabelHash tables of the PDB files that are to be used as matching targets (unless you have downloaded a LabelHash table from the LabelHash web site). First, create a text file with all the PDB files you want to build tables for:

echo 1ady:A 1adj 1qe0:A > pdblist.txt

Note that you can optionally specify a chain. By default, all chains are used. If you have a directory full of files in the format pdb1abc.ent.gz, then you could use something like this:

cd ~/pdb; \ls -1 *.ent.gz | sed 's/^pdb//;s/\.ent.gz//' > pdblist.txt

(The sed command strips the prefix ‘pdb’ and the suffix ‘.ent.gz’ to obtain a list of pdb id’s.) The next step is to create XML input files with the labelhash_input.py script:

labelhash_input.py -t table -o pdb.xml pdblist.txt

The script will create pdb.xml, which will be used in the next step. There are many other parameters to this script with which you can change the default settings. The last step is creating the LabelHash tables:

 create pdb.xml

After “create” is finished you should now have the file pdb.lhash3. You can run create on multiple cores like so:

 mpirun -np 8 create pdb.xml

Here, ‘8’ is the number of parallel processes to be used. For best performance the number of processes should be more or less equal to the number of available cores.

For large LabelHash tables the datasets stored in them can become very fragmented, which will negatively affect the performance of the matching program. To ‘defragment’ your LabelHash file you can use h5repack, a tool that is part of the HDF5 package, like so:

 h5repack oldpdb.lhash3 newpdb.lhash3

This can sometimes take a very long time (days!), but if you plan to use the table often, it can be worth it. After h5repack is finished, you can delete oldpdb.lhash3.

Matching a motif

To match a motif, you first need to create a motif XML file and a matching options XML file. A motif can be created like so:

labelhash_input.py -t motif -o 1ady.xml 1ady 81:ED 83:T 112:RS 130:ED 264:YL 311:RNKQ

The script expects the name of a PDB file followed by a number of residues, specified by the residue sequence ID. Each residue sequence ID can optionally be followed by a colon and a number of one-letter residue names, which are taken to be all allowed residue labels for that motif point. If the residue names are omitted, the residue label in the PDB file is used.

The options file can be created like so:

labelhash_input.py -t matchoptions -o options.xml

The values for all the options in options.xml can be changed with optional parameters to the labelhash_input.py script. The motif can be matched against the targets in the LabelHash table computed before:

match options.xml 1ady.xml pdb.lhash3

The matches will by default be saved in 1ady-matches.xml. You can specify a different output file like so:

match -o myfile.xml options.xml 1ady.xml pdb.lhash3

Analogous to “create”, you can run “match” in parallel using mpirun.

If you only want to match against specific targets, you can specify them as additional arguments:

match -o myfile.xml options.xml 1ady.xml pdb.lhash3 1adj 1kmm:A pdbids.txt 

The “1adj” argument is interpreted as “all chains in 1adj,” “1kmm:A” means only chain A of 1kmm will be matched, and, finally, any argument ending in “.txt” is interpreted to be a plain text file with a whitespace-separated list of target names.

LabelHash options

The labelhash_input.py script can be used to create input files for table creation and the matching program. The default settings should work well for a variety of motifs, but many aspects can be controlled through the following options:

Global options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE Type of XML file to create (table, motif, or matchoptions)
-o OUTFILE, --outfile=OUTFILE Name for the XML output [default: standard output]
Reference set parameters:
These options are effective only for the types table and matchoptions.
--maxmindistance=MAXMINDISTANCEMax min distance between point in ref. set [default: 16]
--maxdiameter=MAXDIAMETER Max diameter of points in ref. set [default: 25]
--maxmindepth=MAXMINDEPTH Max min distance between point in ref. set and molecular surface [default: 1.6]
--maxdepth=MAXDEPTH Max distance between point in ref. set and molecular surface [default: 3.1]
LabelHash table creation options:
--nosplitchains Treat multiple chains in a PDB file as one point sequence
--allmodels Include all models in a PDB file
--nnradius=NNRADIUS Radius for precomputing nearest neighbors [default: 14.0]
--writefraction=WRITEFRACTION Fraction of processes involved in writing data [default: 0.01 ]
Match options:
--radius=RADIUS The max. distance to look for match augmentation points [default: 7]
--maxdistance=MAXDISTANCE The maximum allowed LRMSD for a complete match [default: 7]
--numneighbors=NUMNEIGHBORS The maximum number of nearest neighbors to consider for one step of match augmentation [default: 50]
--numtries=NUMTRIES The number of different reference sets to try [default: 15]
--maxmatches=MAXMATCHES The max. number of matches to return [default: 10000000]
--minscore=MINSCORE The minimum score required for a valid match [default: -1]
--maxscore=MAXSCORE The maximum score allowed for a valid match [default: 10000000]
--maxpvalue=MAXPVALUE Keep only matches whose pvalue does not exceed maxpvalue [default: 1]
--keepmultiplematches Keep multiple matches per target
--keepbadpartial Keep matches that can be augmented to larger matches
--nopointweight Do not use statistical pointweight correction
--geometricsieving Instead of regular matching perform Geometric Sieving

Understanding match output

The “match” program produces a compressed XML file with all the matches. This file can be opened with the Chimera ViewMatch plugin. If you want to look at any of the XML files produced by LabelHash from the command line, use the command “xmllint -format matches.xml”. The “xmllint” program is part of libxml2. It is already installed on OS X, while on Linux it is part of a package called libxml2-utils, which you can install through your package manager. The XML file is structured as follows:

  • First, it lists the number of possible matches and the time it took.
  • Next, it lists the motif, the matching parameters, and the LabelHash table that the motif was matched against.
  • Finally, the bulk of the file consists of a list of matches. At the start of the list, the count element denotes the number of matches found. This number can be smaller than the number of possible matches, due to the distance cut-off used during matching. For each match the following information is listed:
    • The id of the matching target.
    • The alignment between the motif and the match. This is specified by the elements quat0 through quat3, which represent the rotation as a quaternion, and the elements transx, transy, and transz, which represent the translation.
    • The score currently indicates the number of matched residues. The web server only returns complete matches, so this number is not very informative. With the command line version, however, one can compute partial matches.
    • The rmsd is the LRMSD between the motif and the match.
    • The depth is the average distance between the matched residues and the molecular surface. This is not currently used during matching, but it is just returned as an extra bit of information.
    • The p-value indicates the statistical significance of a match.
    • The correspondence shows the mapping between the motif residues and residues in the match. The residue numbers are based on a renumbering of the residues in the corresponding chains starting from 0.

To analyze the XML output you can use the Python module. It allows you read in a match file like so:

from LabelHash import *
matchlist = MatchList('matches.xml')
total_rmsd=0
for match in matchlist:
    if str(match.id())=='1did:0:A': print 'found a match to 1did:A!'
    total_rmsd = total_rmsd + match.rmsd()
print 'average rmsd = %f' % (total_rmsd/len(matchlist))

More examples can be found in the labelhash_test.py script.

Motif refinement

The matching program can also be used for Geometric Sieving, an algorithm that finds the best submotif of a given size for a certain motif. The best submotif is defined as the one that maximizes the median of the LRMSDs of its matches, since this is highly correlated with high specificity (without losing sensitivity). To enable Geometric Sieving, set the geometricsieving option to 1, and minscore to the desired submotif size in the match options (see labelhash_input.py). Instead of the usual match results file, it will produce match files for the submotif(s) whose median confidence intervals lie strictly above those of the other submotifs. These confidence intervals are determined through bootstrap sampling. It is possible that only one submotif remains after matching against only a small fraction of your background dataset, so the match file does in this case not contain all possible matches. Geometric Sieving is a CPU- and memory-intensive procedure and will need to be run in parallel on all but the simplest test cases. Geometric Sieving has only been lightly tested and should be considered a beta feature.