Automated sequencing by two-dimensional analysis of mass chromatograms
This code demonstrates an algorithm to automate sequence extraction from laddering patterns in liquid chromatography-mass spectrometry data. The algorithm has been implemented as a Python 3 package, accompanying our manuscript Bidirectional direct sequencing of non-canonical RNA by two-dimensional analysis of mass chromatograms published in October 2015 in the Journal of the American Chemical Society. The source code has been made available under the GNU AGPL v3.
Please see the publication for background on the algorithm and application. The readme file, found in the repo top directory, contains detailed instructions for installation on multiple platforms. The package has been lightly tested on Windows, Linux, and OS X using Python 3.4. There are several dependencies, including the clustalo binary that must be installed prior to use. Please see the readme for details. Several example datasets are also included in the /examples directory, and instructions on running the examples are also included in the readme file.
Anders Björkbom, Victor S. Lelyveld, Shenglong Zhang, Weicheng Zhang, Chun Pong Tam, J. Craig Blain, and Jack W. Szostak at HHMI, MGH, and Harvard
Please contact @lelyveld for help or pull requests or open an issue.
Note: An updated version of this documentation is available in the readme.
Change to the directory where you've cloned the repository. In Debian/Ubuntu Linux, do:
$ sudo pip3 install numpy matplotlib pandas scipy patsy statsmodels
...
$ sudo apt-get install clustalo
...
$ sudo python3 setup.py install
...
$ cd examples
...
$ python3 test.py
...
The final command will process and display an example dataset and resulting sequencing reads.
This package gives a proof-of-concept implementation of an algorithm for extracting sequence information from a two-dimensional dataset of degradative fragments of a sequenceable polymer.
You must have Python 3. In recent Linux distributions, Python 3 is typically installed by default. If not, use your distro's package manager to install the latest python 3 package.
A few generally available Python 3 packages (most from the Scipy stack) must also be installed:
numpy 1.9+
matplotlib 1.3+
statsmodels 0.6+
scipy 0.9+
pandas 0.7.1+
patsy 0.3+
The bottom three packages are only necessary to satisfy installation requirements for statsmodels. Information about the Scipy stack may be found here:
http://www.scipy.org/install.html
On Linux, it should generally be possible to use pip to satisfy all Python 3 module dependencies quickly. Or use your favorite python package manager. Alternatively, it may be possible to install certain Python distributions that contain all or part of the above packages (e.g. Anaconda). Installation details for each environment are suggested below.
For this initial release, the package currently makes use of the clustalo binary for alignment operations. See specific instructions below. The binary for ClustalO must be named 'clustalo' must be executable from the current path for the current user.
Finally, for this release, the input data must be generated by Agilent MassHunter Qualitative Analysis software. See below under Input Data for more discussion.
On Debian/Ubuntu systems with typical configuration, it should be possible to run the following command to install all required python packages:
$ sudo pip3 install numpy matplotlib scipy pandas patsy statsmodels
On Ubuntu-based systems that use the apt package manager, ClustalO can easily be obtained by running the following command to install it from default Ubuntu repositories.
$ sudo apt-get install clustalo
If you obtain clustalo from another source, the binary for ClustalO must be named 'clustalo' must be executable from the current path for the current user.
After fulfilling these requirements, see the installation section below to install this package.
Users who already have the Scipy stack for Python 3 installed may be able to simply run the following:
sudo pip3 install patsy statsmodels
Alternatively, if this is your first installation of Python 3 and/or the Scipy stack, the easiest method to satisfy package dependencies in OS X is to install a free scientific Python 3 distribution, such as Anaconda. This suite can be downloaded from here:
http://continuum.io/downloads#py34
If this works for you, then skip to installation of clustalo below. Alternatively, download and install the latest stable Python 3.4+ binary release obtainable from https://www.python.org . The current version at the time of this writing is here:
https://www.python.org/ftp/python/3.4.3/python-3.4.3-macosx10.6.pkg
Installing the packages may be achieved through pip3, as above for Linux. However, as of this writing, direct compilation of numpy is problematic.
The OS X binary for clustalo can be found here:
http://www.clustal.org/omega/clustal-omega-1.2.0-macosx
The binary should be renamed, made executable, and moved to an accessible path, as follows:
$ mv clustal-omega-1.2.0-macosx clustalo
$ chmod a+x clustalo
$ sudo mv clustalo /usr/local/bin
After fulfilling these requirements, see the installation section below to install this package.
For Windows users, the struggle is real. It may be possible satisfy all dependencies by installing a third-party Python distribution. At present, the easiest strategy in Windows is to install Anaconda's Python 3 distribution, available here:
http://continuum.io/downloads#py34
If Python 3 from this distribution works for you, then skip down to clustalo installation. If not, download and install the latest stable Python 3.4+ binary release obtainable from https://www.python.org. During installation in Windows, select the option to "add python.exe to path," which will make life much easier. The current Windows 64-bit installer can be found here:
https://www.python.org/ftp/python/3.4.3/python-3.4.3.amd64.msi
To install packages, generally pip does not work without errors. Thankfully, Christoph Gohlke at the Laboratory for Fluorescence Dynamics, University of California, Irvine, maintains an unofficial repository of Windows python package binaries, such that direct compilation on Windows is unnecessary. After installing Python 3.4 in Windows, download the cp34 versions of the required packages. Be sure to install the bit version that matches your version of Python 3.4 (simply run python at the command line to check whether you have the 32-bit or 64-bit version). The pip "wheels" below allow quick installation in Windows. Currently, those files can be found at the following URL:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
For 64-bit Python 3.4 in 64-bit Windows, download the following files:
numpy‑1.9.2+mkl‑cp34‑none‑win_amd64.whl
scipy‑0.15.1‑cp34‑none‑win_amd64.whl
matplotlib‑1.4.3‑cp34‑none‑win_amd64.whl
pandas‑0.16.0‑cp34‑none‑win_amd64.whl
patsy‑0.3.0‑py2.py3‑none‑any.whl
statsmodels‑0.6.1‑cp34‑none‑win_amd64.whl
Install each of these packages by running the following command on each filename from a command prompt:
pip install filename.whl
The ClustalO binary must be installed and accessible in the current path. ClustalO for Windows can be obtained here:
http://www.clustal.org/omega/clustal-omega-1.2.0-win32.zip
Unzip the contents of this zipfile to a good location, such as C:\clustalo . The path must be changed to allow clustalo.exe to be accessible from elsewhere. The path can be changed from Control Panel > System > Advanced system settings > Advanced tab > Environment Variables > System Variables > Path. Append "C:\clustalo;" to the value of the "Path" variable and hit OK.
After fulfilling these requirements, see the installation section below to install this package.
A setup script is included with this package for installation. From the root directory of the package, simply invoke the setup.py script. In Linux or OS X, run:
$ sudo python3 setup.py install
In Windows, run:
python setup.py install
Several instructive examples are included in the examples subdirectory. Run these from the command line by typing:
$ cd examples
$ python3 test.py
$ python3 figures.py
From wtihin the Python 3 interpreter, the following commands are all that is necessary to run the example dataset provided in the examples/ subdirectory from the python interpreter.
>>> import lcmsseq
>>> lcmsseq.read_params('default.cfg')
>>> lcmsseq.process('compounds.csv')
For the time being, processing parameters are currently accessible from a parameter CSV file. The default parameters can be found in the bundled file "default.cfg". An explanation of the function of each parameter is given as a comment in this configuration file.
Alternatively, the modules can be run from the command line as follows:
$ python3 lcmsseq.py dataset.csv parameters.cfg
Parameters:
<dataset.csv>: compound dataset filename
Filename of a comma-delimited list of compounds extracted from LCMS data by, e.g., Agilent MassHunter Qualitative
Analysis Find by Molecular Feature. The format of this file is described below under _Input Data_.
<parameters.cfg>: parameters filename
Filename of a comma-delimited list of parameters key,value pairs. The bundled filed default.cfg shows the
expected content of this file.
The parameter file specifies the location of two key comma-delimited databases, containing a list of adducts and bases. The adduct database is a csv file containing a list of adduct exact masses for clustering. The base database is a csv file containing chemical formulas that correspond to mass differences between degradative fragments differing by loss of a single RNA monomer. Further instructions are available in files examples/adducts.csv and examples/bases.csv . An Excel file is also included to aid in generating the base database starting from nucleoside chemical formulas.
At the time of this initial release, the package has been designed to import comma-delimited (csv) files generated by Agilent MassHunter Qualitative Analysis software. Specifically, the file must be a list of compounds identified using the "Find by Molecular Feature" algorithm. Guidelines on using this feature to generate useful datasets can be found in our initial publication. Other compound-finding software are likely to be suitable, but this package currently expects to find several Agilent-generated column headings in the imported csv file.
Use the CSV export file from Agilent MassHunter Qualitative Analysis software, or ensure your CSV file is a list of compounds with one compound per row and contains (at the absolute minimum) the following column headers:
Cpd,Mass,RT,Vol
These columns contain the following information:
Cpd: compound ID (generally an integer)
Mass: neutral mass
RT: retention time
Vol: integrated intensity
Processing will fail if any of these columns are missing. The package can also filter on a number of other optional column headers that are typically assigned by Agilent's software, although this filtering is not strictly required.
The module makes use of the python 3 multiprocessing package to spawn several parallel processes to generate walk trajectories across the dataset. For this release, multicore features are operational in Linux and OS X environments.