Introduction to mspepsearchr

Andrey Samokhin

2025-11-08

Overview

The MSPepSearch tool

MSPepSearch is a command-line interface (CLI) tool developed by NIST. It allows users to perform batch library searches against mass spectral libraries in NIST format. However, its usage can be challenging, since the CLI command may be long and contain more than a dozen flags, as demonstrated below.

# Example command that performs library searches for all mass spectra in
# 'test.msp' against the 'mainlib' and 'replib' sub-libraries using the
# 'Identity EI Normal' algorithm. Results are written to
# 'library_search_results.tsv'. For each candidate, additional information such
# as molecular weight, CAS numbers, molecular formula, etc. is included.

<path>\MSPepSearch64.exe \
Id \
/MAIN <path>\NIST23\MSSEARCH\mainlib \
/REPL <path>\NIST23\MSSEARCH\replib \
/INP <path>\test.msp \
/OUTTAB <path>\library_search_results.tsv \
/HITS 100 \
/All \
/OutRevMF \
/OutMW \
/OutCAS \
/OutChemForm \
/OutIK

Search algorithms and options (e.g., presearch) are typically encoded as single letters, either uppercase or lowercase, which makes preparing commands manually error-prone. In the example above, I corresponds to the ‘Identity EI Normal’ algorithm, and d represents the ‘Default Presearch’.

MSPepSearch outputs results in a tab-separated values (TSV) format. While R provides convenient tools for reading such tables, additional processing is required to extract individual hit lists, since all results are merged into a single file and must be separated programmatically.

The mspepsearchr package

The mspepsearchr package provides an R interface to the MSPepSearch tool. It was developed to overcome the practical limitations described above and to make MSPepSearch fully accessible from within R. All major search options can be specified directly via R function arguments. Less commonly used options (not yet available as dedicated arguments) can be provided manually through the addl_cli_args parameter, which is available in all relevant functions.

The MSPepSearch executables are included with this package, in compliance with NIST’s distribution policy. While including executables in an R package is uncommon, this design provides two key benefits: improved user convenience (no manual installation required) and automated testing support, ensuring reliable and reproducible performance across platforms.

The package is available for Windows, Linux, and macOS. On Linux and macOS, Wine must be installed and accessible via the system $PATH to run the Windows-based MSPepSearch binaries.

Library search algorithms

The package provides user-friendly access to all library search algorithms available in NIST MS Search software for EI and MS/MS spectra of small molecules. Supported algorithms and their corresponding functions include:

It is assumed that readers are familiar with the library search algorithms provided by NIST. Detailed descriptions of these algorithms are available in the package documentation and were adapted from the official MS Search help manual. For the most complete and up-to-date information, users are encouraged to consult the official NIST documentation.

General functionality

The general functionality is illustrated here using the ‘Identity EI Normal’ algorithm. The main principles are identical for other algorithms.

Assume that raw GC/MS data have been processed, and pure mass spectra of all components have been extracted either manually (e.g., via background subtraction) or automatically (e.g., with AMDIS, MZmine, or ChromaTOF). The resulting spectra are saved in an MSP file. The path to this file can be specified as either an absolute or relative path (the latter is convenient when the spectra are in the same directory as the R script).

To use MSPepSearch, the mass spectral databases must be in NIST format. Commercial databases are typically supplied in the correct format along with the NIST MS Search software. For example, the NIST EI mass spectral library is divided into two sub-libraries (mainlib and replib) located inside the MSSEARCH/ directory. The default installation path is typically C:/NISTxx/MSSEARCH/ (where xx is the version of the NIST database). Therefore, to search against both mainlib and replib sub-libraries of NIST23, the libraries argument can be specified as a character vector c("C:/NIST23/MSSEARCH/mainlib/", "C:/NIST23/MSSEARCH/replib/").

Most open-source mass spectral libraries, are distributed as MSP files. In such cases, the library must first be converted to NIST format using MS Search software or the Lib2NIST Converter tool. In the following example, a subset of the Mass Bank database (already converted to NIST format) is used. To perform a library search with default settings, the spectra and libraries arguments must be provided.

alkanes    <- system.file("extdata", "spectra", "alkanes_ei_lr.msp",
                             package = "mspepsearchr")
eims_lib <- system.file("extdata", "libraries", "massbank_subset_ei_lr_with_ri",
                        package = "mspepsearchr")
hitlists   <- IdentitySearchEiNormal(alkanes, eims_lib)

The resulting object (hitlists) is a list of data frames. Each data frame represents the hit list obtained for a specific mass spectrum. The order of spectra in the MSP file and in hitlists is identical, allowing indexing to access specific hit list.

col_names <- c("name", "mf", "rmf", "prob", "formula", "mw", "inchikey", "cas")
head(hitlists[[4L]][, col_names], 3L)
#>          name  mf rmf prob formula  mw                    inchikey cas
#> 1 TETRADECANE 904 921 54.6  C14H30 198 BGHCVCJVXZWKCC-UHFFFAOYSA-N   0
#> 2   TRIDECANE 877 904 24.1  C13H28 184 IIYFAKIEWZDVMP-UHFFFAOYSA-N   0
#> 3    DODECANE 859 905 14.0  C12H26 170 SNRUBQQJIBEYMU-UHFFFAOYSA-N   0

The name of each mass spectrum is stored as the unknown_name attribute of the corresponding hit list. This can be used to extract results for a specific compound.

unknown_names <- vapply(hitlists, attr, which = "unknown_name", character(1L))
idx <- which(unknown_names == "Dodecane")
head(hitlists[[idx]][, col_names], 3L)
#>        name  mf rmf prob formula  mw                    inchikey cas
#> 1 TRIDECANE 893 917 35.2  C13H28 184 IIYFAKIEWZDVMP-UHFFFAOYSA-N   0
#> 2  DODECANE 891 928 33.3  C12H26 170 SNRUBQQJIBEYMU-UHFFFAOYSA-N   0
#> 3  DODECANE 888 900 33.3  C12H26 170 SNRUBQQJIBEYMU-UHFFFAOYSA-N   0

Integration into other workflows

Mass spectra can be provided either as file paths (pointing to MSP or MGF files) or directly as R objects. The latter is convenient when mspepsearchr is integrated into existing workflows. When supplied as R objects, each spectrum is represented as a list containing at least:

Additional optional fields should follow MSP format conventions, for example, molecular formula (formula), molecular weight (mw), exact mass (exactmass), GC retention index (retention_index), and precursor ion m/z (precursormz).

Using plain R lists instead of S4 objects simplifies integration with other R packages and workflows. Lists are lightweight, easy to construct, and can be converted from other data formats without extra dependencies.

The LC-MS/MS data analysis with xcms vignette provides a detailed example of extracting a tandem mass spectrum from LC-MS/MS data. In that vignette, a spectrum corresponding to a component with retention time 418.926 seconds was obtained from PestMix1_DDA.mzML. The example below demonstrates how to convert an S4-class Spectra object into a format compatible with mspepsearchr.

ex_spectrum <- readRDS("data/ex_spectrum.rds")

data_origin <- Spectra::dataOrigin(ex_spectrum)
rtime <- Spectra::rtime(ex_spectrum)
ms_level <- Spectra::msLevel(ex_spectrum)
precursor_mz <- Spectra::precursorMz(ex_spectrum)
ion_mode <- switch(Spectra::polarity(ex_spectrum) + 1L, "NEGATIVE", "POSITIVE")
mz <- Spectra::mz(ex_spectrum)
intst <- Spectra::intensity(ex_spectrum)

fenamiphos <- lapply(seq_along(ex_spectrum), function(i) {
  list(name = paste0(basename(data_origin[[i]]), ", rt=", rtime[[i]], "s"),
       spectrum_type = paste0("MS", ms_level[[i]]),
       precursormz = precursor_mz[[i]],
       ion_mode = ion_mode[[i]],
       mz = mz[[i]],
       intst = intst[[i]])
})

For convenience, a pre-converted spectrum is also included in the package.

msp_path <- system.file("extdata", "spectra", "fenamiphos_msms_hr.msp",
                        package = "mspepsearchr")
fenamiphos <- mssearchr::ReadMsp(msp_path)

Once converted, the mass spectrum can be searched against a tandem mass spectral library.

msms_lib <- system.file("extdata", "libraries", "massbank_subset_msms_hr",
                        package = "mspepsearchr")
hitlists <-
  IdentitySearchMsMs(fenamiphos, msms_lib,
                     precursor_ion_tol  = list(value = 0.05, utits = "mz"),
                     product_ions_tol = list(value = 0.05, utits = "mz"))
col_names <- c("name", "dot", "rdot", "formula", "prec_mz", "delta_mz")
head(hitlists[[1L]][, col_names], 3L)
#>              name dot rdot     formula  prec_mz delta_mz
#> 1      Fenamiphos 876  905 C13H22NO3PS 304.1131  -0.0004
#> 2 (-)-Scopolamine   1   62   C17H21NO4 304.1543   0.0408

This approach enables seamless integration into automated workflows, allowing direct library searches on spectra extracted from raw GC-MS or LC-MS/MS data without the need to write or read intermediate MSP files.

Setting search parameters with command-line flags

MSPepSearch offers extensive flexibility for adjusting search parameters and controlling the output. These options are implemented through a large number of command-line flags. While the most common parameters can be controlled via R function arguments, less common options can be specified manually through the addl_cli_args argument for fine-tuning.

The official MSPepSearch manual is a plain-text file named MSPepSearch64.exe.hlp.txt. It can be accessed directly from R using the OpenHelpFile() function, which opens it in the default text editor.

Passing additional flags via the addl_cli_args requires caution, as only a minimal check for duplicated flags is performed. Several potential issues are not automatically detected:

For example, attempting to manually specify the number of hits with /HITS will raise an error, since this option is already managed automatically based on n_hits.

hitlists <- IdentitySearchEiNormal(spectra, ms_library,
                                   addl_cli_args = "/HITS 10")
#> Error in .PrepareJobs(spectra, libraries, algorithm = "identity_normal",  : 
#>   The following CLI flags are duplicated: /HITS

The use of the addl_cli_args argument is illustrated below using retention indices (RI). The ri_column_type argument of IdentitySearchEiNormal() allows specifying only the stationary phase type. However, MSPepSearch supports additional RI-related options. For example, RI mismatches can be used to penalize match factors. In the example below, the mass spectrum of tridecane lacks low-intensity peaks (including the molecular ion), so the correct hit appears only in the third position.

tridecane <- list(
  list(name = "Tridecane",
       mz = c(53, 55, 56, 57, 69, 70, 71, 84, 85),
       intst = c(51, 314, 220, 999, 110, 126, 526, 54, 274),
       retention_index = 1300)
)
hitlists <- IdentitySearchEiNormal(tridecane, eims_lib)
col_names <- c("name", "mf", "rmf", "prob", "formula", "mw", "ri")
head(hitlists[[1L]][, col_names], 3L)
#>        name  mf rmf prob formula  mw   ri
#> 1  DODECANE 842 842 39.7  C12H26 170 1200
#> 2  UNDECANE 829 829 26.9  C11H24 156 1100
#> 3 TRIDECANE 819 819 20.1  C13H28 184 1300

By setting the RI tolerance to 15 i.u. (t15) and applying an infinite penalty rate (rIN), all other normal alkanes are removed from the hit list, placing tridecane at the top.

hitlists <- IdentitySearchEiNormal(tridecane, eims_lib,
                                   addl_cli_args = "/RI nt15rIN")
head(hitlists[[1L]][, col_names], 3L)
#>                     name  mf rmf prob  formula  mw   ri
#> 1              TRIDECANE 819 819 98.7   C13H28 184 1300
#> 2 DELTA-TETRADECALACTONE 345 345  1.0 C14H26O2 226   NA
#> 3     D-CAMPHOLYLMETHANE 310 310  0.3  C11H20O 168   NA

External parallelization

MSPepSearch is a single-threaded application. To improve performance, external parallelization can be achieved by running multiple independent instances of MSPepSearch from within R using the parallel package. The n_threads argument specifies how many parallel threads to use for library searching.

Performance improvement is measured as speedup, defined as the ratio of single-thread execution time to multi-thread execution time. As expected, speedup depends on task complexity. In library searching, complexity is mainly determined by the number of mass spectra to search and the size of the library. Here, single-thread execution time serves as a practical indicator of task complexity.

As shown in Figure 1, speedup increases monotonically with workload size. Library searches were performed using the ‘Identity EI Normal’ algorithm on an Intel Core i7-4790K with four threads. For simple tasks, the overhead of parallelization dominates. Moderate tasks achieved speedups in the range of 2-3, while computationally demanding tasks, requiring several minutes on a single core, reached a the maximum speedup of approximately 3.4.

Figure 1. Speedup obtained for various library search tasks using four threads as a function of task complexity (i.e., single-thread execution time). Representative tasks are encoded as three underscore-separated values: the number of unknown spectra, the mass spectral library and the presearch option. All searches used the 'Identity EI Normal' algorithm.

Figure 1. Speedup obtained for various library search tasks using four threads as a function of task complexity (i.e., single-thread execution time). Representative tasks are encoded as three underscore-separated values: the number of unknown spectra, the mass spectral library and the presearch option. All searches used the ‘Identity EI Normal’ algorithm.

Figure 2 illustrates how speedup varies with the number of threads for several representative library search tasks. The color scheme matches that of Figures 1. For tasks of moderate to high complexity, speedup grows nearly linearly with the number of threads up to four, which corresponds to the number of physical cores. Beyond four threads, a noticeable slowdown occurs due to hyperthreading overhead, a common phenomenon in parallel computing.

Figure 2. Speedup as a function of the number of threads for several representative library search tasks. Task complexity is represented by single-thread execution time.

Figure 2. Speedup as a function of the number of threads for several representative library search tasks. Task complexity is represented by single-thread execution time.


Built with mspepsearchr_0.2.0.