MSPepSearch is a command-line interface (CLI) tool developed by NIST. It allows users to perform batch library searches against mass spectral libraries in NIST format. However, its usage can be challenging, since the CLI command may be long and contain more than a dozen flags, as demonstrated below.
# Example command that performs library searches for all mass spectra in
# 'test.msp' against the 'mainlib' and 'replib' sub-libraries using the
# 'Identity EI Normal' algorithm. Results are written to
# 'library_search_results.tsv'. For each candidate, additional information such
# as molecular weight, CAS numbers, molecular formula, etc. is included.
<path>\MSPepSearch64.exe \
Id \
/MAIN <path>\NIST23\MSSEARCH\mainlib \
/REPL <path>\NIST23\MSSEARCH\replib \
/INP <path>\test.msp \
/OUTTAB <path>\library_search_results.tsv \
/HITS 100 \
/All \
/OutRevMF \
/OutMW \
/OutCAS \
/OutChemForm \
/OutIK
Search algorithms and options (e.g., presearch) are typically encoded
as single letters, either uppercase or lowercase, which makes preparing
commands manually error-prone. In the example above, I
corresponds to the ‘Identity EI Normal’ algorithm, and d
represents the ‘Default Presearch’.
MSPepSearch outputs results in a tab-separated values (TSV) format. While R provides convenient tools for reading such tables, additional processing is required to extract individual hit lists, since all results are merged into a single file and must be separated programmatically.
mspepsearchr packageThe mspepsearchr package provides an R interface to the
MSPepSearch tool. It was developed to overcome the practical limitations
described above and to make MSPepSearch fully accessible from within R.
All major search options can be specified directly via R function
arguments. Less commonly used options (not yet available as dedicated
arguments) can be provided manually through the
addl_cli_args parameter, which is available in all relevant
functions.
The MSPepSearch executables are included with this package, in compliance with NIST’s distribution policy. While including executables in an R package is uncommon, this design provides two key benefits: improved user convenience (no manual installation required) and automated testing support, ensuring reliable and reproducible performance across platforms.
The package is available for Windows, Linux, and macOS. On Linux and
macOS, Wine must be installed and accessible via the system
$PATH to run the Windows-based MSPepSearch binaries.
The package provides user-friendly access to all library search algorithms available in NIST MS Search software for EI and MS/MS spectra of small molecules. Supported algorithms and their corresponding functions include:
IdentitySearchEiNormal()IdentitySearchHighRes()IdentitySearchMsMs()SimilaritySearchEiSimple()SimilaritySearchEiNeutralLoss()SimilaritySearchEiHybrid()SimilaritySearchMsMsInEi()SimilaritySearchMsmsHybrid()It is assumed that readers are familiar with the library search algorithms provided by NIST. Detailed descriptions of these algorithms are available in the package documentation and were adapted from the official MS Search help manual. For the most complete and up-to-date information, users are encouraged to consult the official NIST documentation.
The general functionality is illustrated here using the ‘Identity EI Normal’ algorithm. The main principles are identical for other algorithms.
Assume that raw GC/MS data have been processed, and pure mass spectra of all components have been extracted either manually (e.g., via background subtraction) or automatically (e.g., with AMDIS, MZmine, or ChromaTOF). The resulting spectra are saved in an MSP file. The path to this file can be specified as either an absolute or relative path (the latter is convenient when the spectra are in the same directory as the R script).
To use MSPepSearch, the mass spectral databases must be in NIST
format. Commercial databases are typically supplied in the correct
format along with the NIST MS Search software. For example, the NIST EI
mass spectral library is divided into two sub-libraries
(mainlib and replib) located inside the
MSSEARCH/ directory. The default installation path is typically
C:/NISTxx/MSSEARCH/ (where xx is the version of
the NIST database). Therefore, to search against both mainlib
and replib sub-libraries of NIST23, the libraries
argument can be specified as a character vector
c("C:/NIST23/MSSEARCH/mainlib/", "C:/NIST23/MSSEARCH/replib/").
Most open-source mass spectral libraries, are distributed as MSP
files. In such cases, the library must first be converted to NIST format
using MS Search software or the Lib2NIST Converter tool. In the
following example, a subset of the Mass Bank database (already converted
to NIST format) is used. To perform a library search with default
settings, the spectra and libraries arguments
must be provided.
alkanes <- system.file("extdata", "spectra", "alkanes_ei_lr.msp",
package = "mspepsearchr")
eims_lib <- system.file("extdata", "libraries", "massbank_subset_ei_lr_with_ri",
package = "mspepsearchr")
hitlists <- IdentitySearchEiNormal(alkanes, eims_lib)The resulting object (hitlists) is a list of data
frames. Each data frame represents the hit list obtained for a specific
mass spectrum. The order of spectra in the MSP file and in
hitlists is identical, allowing indexing to access specific
hit list.
col_names <- c("name", "mf", "rmf", "prob", "formula", "mw", "inchikey", "cas")
head(hitlists[[4L]][, col_names], 3L)
#> name mf rmf prob formula mw inchikey cas
#> 1 TETRADECANE 904 921 54.6 C14H30 198 BGHCVCJVXZWKCC-UHFFFAOYSA-N 0
#> 2 TRIDECANE 877 904 24.1 C13H28 184 IIYFAKIEWZDVMP-UHFFFAOYSA-N 0
#> 3 DODECANE 859 905 14.0 C12H26 170 SNRUBQQJIBEYMU-UHFFFAOYSA-N 0The name of each mass spectrum is stored as the
unknown_name attribute of the corresponding hit list. This
can be used to extract results for a specific compound.
unknown_names <- vapply(hitlists, attr, which = "unknown_name", character(1L))
idx <- which(unknown_names == "Dodecane")
head(hitlists[[idx]][, col_names], 3L)
#> name mf rmf prob formula mw inchikey cas
#> 1 TRIDECANE 893 917 35.2 C13H28 184 IIYFAKIEWZDVMP-UHFFFAOYSA-N 0
#> 2 DODECANE 891 928 33.3 C12H26 170 SNRUBQQJIBEYMU-UHFFFAOYSA-N 0
#> 3 DODECANE 888 900 33.3 C12H26 170 SNRUBQQJIBEYMU-UHFFFAOYSA-N 0Mass spectra can be provided either as file paths (pointing to MSP or
MGF files) or directly as R objects. The latter is convenient when
mspepsearchr is integrated into existing workflows. When
supplied as R objects, each spectrum is represented as a list containing
at least:
name - a string containing the compound name or short
description;mz - a numeric or integer vector of m/z values;intst - a numeric or integer vector of corresponding
peak intensities.Additional optional fields should follow MSP format conventions, for
example, molecular formula (formula), molecular weight
(mw), exact mass (exactmass), GC retention
index (retention_index), and precursor ion m/z
(precursormz).
Using plain R lists instead of S4 objects simplifies integration with other R packages and workflows. Lists are lightweight, easy to construct, and can be converted from other data formats without extra dependencies.
The LC-MS/MS
data analysis with xcms vignette provides a detailed example of
extracting a tandem mass spectrum from LC-MS/MS data. In that vignette,
a spectrum corresponding to a component with retention time 418.926
seconds was obtained from PestMix1_DDA.mzML. The example below
demonstrates how to convert an S4-class Spectra object into
a format compatible with mspepsearchr.
ex_spectrum <- readRDS("data/ex_spectrum.rds")
data_origin <- Spectra::dataOrigin(ex_spectrum)
rtime <- Spectra::rtime(ex_spectrum)
ms_level <- Spectra::msLevel(ex_spectrum)
precursor_mz <- Spectra::precursorMz(ex_spectrum)
ion_mode <- switch(Spectra::polarity(ex_spectrum) + 1L, "NEGATIVE", "POSITIVE")
mz <- Spectra::mz(ex_spectrum)
intst <- Spectra::intensity(ex_spectrum)
fenamiphos <- lapply(seq_along(ex_spectrum), function(i) {
list(name = paste0(basename(data_origin[[i]]), ", rt=", rtime[[i]], "s"),
spectrum_type = paste0("MS", ms_level[[i]]),
precursormz = precursor_mz[[i]],
ion_mode = ion_mode[[i]],
mz = mz[[i]],
intst = intst[[i]])
})For convenience, a pre-converted spectrum is also included in the package.
msp_path <- system.file("extdata", "spectra", "fenamiphos_msms_hr.msp",
package = "mspepsearchr")
fenamiphos <- mssearchr::ReadMsp(msp_path)Once converted, the mass spectrum can be searched against a tandem mass spectral library.
msms_lib <- system.file("extdata", "libraries", "massbank_subset_msms_hr",
package = "mspepsearchr")
hitlists <-
IdentitySearchMsMs(fenamiphos, msms_lib,
precursor_ion_tol = list(value = 0.05, utits = "mz"),
product_ions_tol = list(value = 0.05, utits = "mz"))
col_names <- c("name", "dot", "rdot", "formula", "prec_mz", "delta_mz")
head(hitlists[[1L]][, col_names], 3L)
#> name dot rdot formula prec_mz delta_mz
#> 1 Fenamiphos 876 905 C13H22NO3PS 304.1131 -0.0004
#> 2 (-)-Scopolamine 1 62 C17H21NO4 304.1543 0.0408This approach enables seamless integration into automated workflows, allowing direct library searches on spectra extracted from raw GC-MS or LC-MS/MS data without the need to write or read intermediate MSP files.
MSPepSearch offers extensive flexibility for adjusting search
parameters and controlling the output. These options are implemented
through a large number of command-line flags. While the most common
parameters can be controlled via R function arguments, less common
options can be specified manually through the addl_cli_args
argument for fine-tuning.
The official MSPepSearch manual is a plain-text file named
MSPepSearch64.exe.hlp.txt. It can be accessed directly from R
using the OpenHelpFile() function, which opens it in the
default text editor.
Passing additional flags via the addl_cli_args requires
caution, as only a minimal check for duplicated flags is performed.
Several potential issues are not automatically detected:
v and
/R.Match);/M) and ppm
(/MPPM) simultaneously;For example, attempting to manually specify the number of hits with
/HITS will raise an error, since this option is already
managed automatically based on n_hits.
hitlists <- IdentitySearchEiNormal(spectra, ms_library,
addl_cli_args = "/HITS 10")
#> Error in .PrepareJobs(spectra, libraries, algorithm = "identity_normal", :
#> The following CLI flags are duplicated: /HITSThe use of the addl_cli_args argument is illustrated
below using retention indices (RI). The ri_column_type
argument of IdentitySearchEiNormal() allows specifying only
the stationary phase type. However, MSPepSearch supports additional
RI-related options. For example, RI mismatches can be used to penalize
match factors. In the example below, the mass spectrum of tridecane
lacks low-intensity peaks (including the molecular ion), so the correct
hit appears only in the third position.
tridecane <- list(
list(name = "Tridecane",
mz = c(53, 55, 56, 57, 69, 70, 71, 84, 85),
intst = c(51, 314, 220, 999, 110, 126, 526, 54, 274),
retention_index = 1300)
)
hitlists <- IdentitySearchEiNormal(tridecane, eims_lib)
col_names <- c("name", "mf", "rmf", "prob", "formula", "mw", "ri")
head(hitlists[[1L]][, col_names], 3L)
#> name mf rmf prob formula mw ri
#> 1 DODECANE 842 842 39.7 C12H26 170 1200
#> 2 UNDECANE 829 829 26.9 C11H24 156 1100
#> 3 TRIDECANE 819 819 20.1 C13H28 184 1300By setting the RI tolerance to 15 i.u. (t15) and
applying an infinite penalty rate (rIN), all other normal
alkanes are removed from the hit list, placing tridecane at the top.
hitlists <- IdentitySearchEiNormal(tridecane, eims_lib,
addl_cli_args = "/RI nt15rIN")
head(hitlists[[1L]][, col_names], 3L)
#> name mf rmf prob formula mw ri
#> 1 TRIDECANE 819 819 98.7 C13H28 184 1300
#> 2 DELTA-TETRADECALACTONE 345 345 1.0 C14H26O2 226 NA
#> 3 D-CAMPHOLYLMETHANE 310 310 0.3 C11H20O 168 NAMSPepSearch is a single-threaded application. To improve performance,
external parallelization can be achieved by running multiple independent
instances of MSPepSearch from within R using the parallel
package. The n_threads argument specifies how many parallel
threads to use for library searching.
Performance improvement is measured as speedup, defined as the ratio of single-thread execution time to multi-thread execution time. As expected, speedup depends on task complexity. In library searching, complexity is mainly determined by the number of mass spectra to search and the size of the library. Here, single-thread execution time serves as a practical indicator of task complexity.
As shown in Figure 1, speedup increases monotonically with workload size. Library searches were performed using the ‘Identity EI Normal’ algorithm on an Intel Core i7-4790K with four threads. For simple tasks, the overhead of parallelization dominates. Moderate tasks achieved speedups in the range of 2-3, while computationally demanding tasks, requiring several minutes on a single core, reached a the maximum speedup of approximately 3.4.
Figure 1. Speedup obtained for various library search tasks using four threads as a function of task complexity (i.e., single-thread execution time). Representative tasks are encoded as three underscore-separated values: the number of unknown spectra, the mass spectral library and the presearch option. All searches used the ‘Identity EI Normal’ algorithm.
Figure 2 illustrates how speedup varies with the number of threads for several representative library search tasks. The color scheme matches that of Figures 1. For tasks of moderate to high complexity, speedup grows nearly linearly with the number of threads up to four, which corresponds to the number of physical cores. Beyond four threads, a noticeable slowdown occurs due to hyperthreading overhead, a common phenomenon in parallel computing.
Figure 2. Speedup as a function of the number of threads for several representative library search tasks. Task complexity is represented by single-thread execution time.
Built with mspepsearchr_0.2.0.