Abstraction: Mass spectroscopy has become one of the most of import engineerings in proteomic analysis. Tandem mass spectroscopy ( LC-MS/MS ) is a major tool for the analysis of peptide mixtures from protein samples. The cardinal measure of MS informations processing is the designation of peptides from sample atomization spectra by seeking public sequence databases. Although a figure of algorithms to place peptides from MS/MS informations have been already proposed, e.g. Sequest, OMSSA, Ten! Tandem, Mascot, MassWiz, etc. , they are chiefly based on statistical theoretical accounts sing merely peak-matches between experimental and theoretical spectra, but non peak strength information. Furthermore, different algorithms gave different consequences from the same MS information, connoting their rawness and low stableness. We developed a fresh peptide designation algorithm ProVerB based on a binomial chance distribution theoretical account of protein tandem mass spectroscopy uniting with a new marking map, doing full usage of peak strength information, and therefore heightening the ability of designation. Compared with MASCOT, Sequest and SQID, ProVerB identified significantly more peptides from LC-MS/MS datasets than the current algorithms at 1 % False Discovery Rate ( FDR ) , and provided more confident peptide designations. ProVerB is besides compatible with assorted platforms and experimental datasets, demoing its hardiness and versatility. The open-source plan ProVerB is available at hypertext transfer protocol: //bioinformatics.jnu.edu.cn/software/proverb/ .
KEYWORDS: Protein Identification Algorithm, Tandem Mass Spectrometry, Statistical Model
Soft ionisation techniques, e.g. Matrix-Assisted Laser Desorption Ionization ( MALDI ) 1 and Electrospray Ionization ( ESI ) 2 are able to keep the unity of peptides, therefore authorising the mass spectroscopy ( MS ) methods to execute proteomic analysis 3-5. Protein designation is the most cardinal algorithm in the information processing grapevine, since the sensitiveness and truth of the designation algorithm is important for the downstream analyses. Generally, a peptide designation algorithm selects some extremums from the spectra, evaluates the similarity between the experimental and theoretical spectra, and so assigns the best lucifer within the peptide mistake window as the consequence 8. The marking theoretical accounts that evaluate the similarity between experimental and theoretical spectra should see the three facets: the figure of peak lucifers, the figure of peak back-to-back lucifers and the strengths of matched extremums 9.
A figure of peptide designation algorithms with assorted constructs for MS informations are available, e.g. Mascot 10, Sequest 11, OMSSA 12, Ten! Tandem 13, MassWiz 14, Andromeda 15 and SQID 9. Mascot and Sequest are widely-used commercial package and normally adapted hunt tools in protein identification15, nevertheless merely limited inside informations of these algorithms are released. Mascot is based on a chance theoretical account, whereas Sequest is based on an empirical marking theoretical account that computes cross-correlation between experimental and theoretical spectra. Mascot selects the highest extremum in each 14Da mass interval and keeps the extremums with their strengths above the threshold. Sequest takes back-to-back lucifers of ions and strength information into history, and so preprocesses the spectrum by maintaining the top 200 extremums and separates the spectrum into 10 bins for normalization15. Ten! Tandem uses a hyper geometric hiting theoretical account, while OMSSA is based on a Poisson hiting theoretical account to measure the significance of peptide lucifer. They select 50 most intensive extremums by default. MassWiz divides the spectrum into 10 parts and selects 20 highest extremums from each portion. SQID 9 keeps the top 80 extremums after canceling parent related extremums.
However, none of these algorithms accurately uses the full information in MS experiments. They portion similar methods to bring forth theoretical spectra. Sing six types of ions ( B, Y, b-H2O, b-NH3, y-H2O and y-NH3 ) in CID ( Collision-Induced Dissociation ) atomization manner, theoretical extremum strengths are so set as three unreal values: 50 ( B and y ions ) , 25 ( B and Y ions without H2O or NH3 ) and 10 ( a ions ) for a theoretical spectrum that does non to the full reflect the experimental features of mass spectroscopy 11. Therefore, these algorithms do non utilize the peak strength information obtained in the experiment to do the comparing of the experimental and theoretical spectra once the extremums are selected. SQID introduces the strength chance of the pair-wise amino acid fragments to see the strength match quality 9, but most designation algorithms based on statistical theoretical accounts are based merely on peak-matches between experimental and theoretical spectra, but non utilizing peak strength information. The uncomplete usage of MS information compromises the sensitiveness, hardiness and assurance of these methods.
To do full usage of the MS information and to maximise the catholicity, we present here a fresh designation algorithm, Protein Verification algorithm based on Binomial chance distribution ( ProVerB ) , to heighten the truth, completeness and hardiness of the peptide designation. We tested ProVerB against other algorithms utilizing multiple MS datasets, demoing its higher ability and assurance to place peptides from the mass spectroscopy at 1 % FDR, significantly and stably higher than widely-used Mascot and Sequest.
2. MATERIALS AND METHODS
2.1 Cell civilization and protein extraction and trypsin digestion
Streptococcus pneumoniae D39 was cultivated in Todd-Hewitt stock with 0.5 % barm infusion ( THY ) in a controlled brooder ( 37A°C, 5 % CO2 ) . Cells were harvested at OD600 ~ 0.6 by centrifugation at 5000 A- g for 20 min at 4 A°C. The harvested cells were washed three times with prechilled PBS ( 10 millimeter, pH 7.4 ) and so resuspended in lysis buffer ( 15 mM Tris-HCl, pH 8.0 ) .18 The mixture was frozen-thawed for three rhythms and so sonicated 10 times each for 30 sec. The lysate was centrifuged at 12000 A- g for 10 min at 4 A°C. Protein concentrations were determined utilizing Bradford check and subjected to decrease with 10 millimeters DTT ( 37 A°C, 3 H ) and alkylation with 20 millimeters iodoacetamide ( room temperature, 1 H in dark ) . Proteins were precipitated with four volumes of ice-cold propanone, pelleted by centrifugation and washed twice with ethyl alcohol. The pellet was resuspended in 25 mM Tris-HCl buffer ( pH 7.6 ) and digested with sequencing class modified trypsin ( 1:25 w/w ; Promega, Madison, WI ) at 37 A°C for 20 H 19.
2.2 SCX-RPLC-MS/MS analysis
Dried peptides were reconstituted in 5 % ACN/0.1 % formic acid and analyzed with a Finnigan Surveyor HPLC system online coupled with a LTQ-Orbitrap Forty ( Thermo Fisher Scientific, Waltham, MA ) equipped with a nanospray beginning. The peptide mixtures were loaded onto an SCX column and so eluted with 0, 0.05, 0.2 and 1 M NH4Cl. Each fraction flowed in a C18 column ( 100 I?m ID, 10 centimeter length, 5 I?m-size rosin ( Michrom Bioresources, Auburn, CA ) ) utilizing an autosampler. Peptides were eluted with a 0~35 % gradient ( Buffer A, 0.1 % formic acid, and 5 % ACN ; Buffer B, 0.1 % formic acid and 95 % ACN ) over 120 min and analyzed online with the LTQ-Orbitrap MS utilizing a data-dependent TOP10 method 20. The parametric quantities used for the mass spectrometric analysis were: spray electromotive force, 1.85 kilovolt ; no sheath and subsidiary gas flow ; ion transportation tubing temperature 200 A°C ; 35 % normalized hit energy utilizing for MS2 ; ion choice thresholds, 1000 counts for MS2 ; and activation Q = 0.25 and activation clip of 30 MS during MS2 acquisitions. The mass spectrometers were operated in positive ion manner with a data-dependent automatic switch between MS and MS/MS acquisition manners 19.
2.3 Mass spectroscopy datasets
The datasets ( Mix 3 ) of standard mixtures of 18 proteins obtained by five types of instruments ( Agilent XCT, Thermofinnigan LTQ-FT, Thermofinnigan LCQ DECA, Thermofinnigan LTQ and Micromass/Waters QTOF Ultima, abbreviated below as Agilent, FT, LCQ, LTQ and QTOF, severally ) were downloaded ( hypertext transfer protocol: //regis-web.systemsbiology.net//PublicDatasets/ ) to prove the truth and dynamic scope of algorithms. The LTQ-Orbitrap informations obtained from the S. pneumoniae D39 protein designation incorporating more than 270,000 spectra served as preparation dataset for parametric quantities of the theoretical account. The dataset of E. coli proteome 23 was downloaded from hypertext transfer protocol: //marcottelab.org/MSdata/Data_03/ .
2.4 Datas preprocessing
For S. pneumoniae D39 and E. coli dataset, the natural format files were converted to dta file format by Bioworks 3.31 ( Thermo Finnigan, San Jose, CA ) and the dta format files were merged to Mascot generic format ( mgf ) utilizing the merge.pl plan ( hypertext transfer protocol: //www.matrixscience.com/downloads/merge.zip ) . For the 18 proteins dataset, the downloaded dta format files were merged to Mascot generic format ( mgf ) by the merge.pl plan. The information format files were the input files of our method and Sequest package.
2.5 MS/MS database hunt
For target-decoy based FDR computation, the forward and contrary databases were built for the three datasets as in Table 1.
Table 1. The databases used for MS/MS database hunt
S. pneumoniaeD39 database
18 proteins database
Forward and change by reversal database
The MascotA genericA formatA ( mgf ) files were searched utilizing Mascot 2.3 ( Matrix Science, London, UK ) against the forward and contrary database. The dta files were searched utilizing Sequest 28.13 ( Thermo Fisher Scientific, Waltham, MA ) and our algorithm ProVerB. The undermentioned hunt standards were applied for all three algorithms: full tryptic specificity ; two missed cleavages were allowed ; cysteine ( +57.021464 Da, Carbamidomethylation ) was set as fixed alteration, whereas methionine ( +15.994915 Da, Oxidation ) was considered as variable alteration. The values of precursor ion mass tolerance and fragment ion mass tolerance were set as in Table 2 based on the instrument features. The fragment ion tolerance of Sequest was set to 1.0 Da since it requires an whole number value for m/z in the preprocessing of MS informations 11.
Table 2. The parametric quantities of precursor and fragment ion tolerance scenes
ProVerB and Mascot
precursor ion tolerance
fragment ion tolerance
precursor ion tolerance
fragment ion tolerance
2.6 False find rate ( FDR )
The peptide spectrum lucifers ( PSMs ) were extracted from the Mascot ‘s informations format file ( .dat ) with our in-house Matlab plan and PSMs with the highest rank were exported to cipher FDR threshold. Sequest consequences were extracted from Sequest end product files ( .out ) and PSMs with the highest rank and a?†Cn a‰? 0.1 were exported to cipher FDR threshold. ProVerB consequences and the extracted consequence of Mascot and Sequest were written to csv format files. All mark and decoy tonss with rank 1 PSMs were sorted in go uping order to cipher their FDR values by Kall ‘s method. The different threshold is picked up to acquire the FDR from the undermentioned expression:
The mark threshold was tuned to make FDR a‰¤ 1 % . The scoring maps vary in different hunt algorithms: for Mascot, the ion tonss were sorted to cipher FDR when peptide length & gt ; =6 ; for Sequest and SQID, the Xcorr tonss were sorted to cipher FDR by different precursor ion charge when peptide length & gt ; =6 and a?†Cn a‰?0.1 and 0.05 severally ; for ProVerB, the S tonss ( the concluding mark of each peptide, see below ) were sorted to cipher FDR when peptide length & gt ; =6.
2.8 Comparison of algorithms
All algorithms were compared harmonizing to the figure of identified MS/MS spectra and alone peptides at FDR a‰¤ 0.01. The same rate of alone peptides and MS/MS spectra were farther analyzed harmonizing to the different designation consequences in the three algorithms.
3. RESULTS AND DISCUSSION
3.1 Peak choice in the spectra
Peaks closer than 1A±0.25 Da are considered as isotope extremums and were filtered 9. The figure of extremums for spectrum hunt was minimized in the algorithms to minimise random lucifers and heighten the truth. Sequest selected the highest 200 extremums from all fragment spectra 11. Mascot selected one extremum from every 14 Da and the extremum above a certain threshold as subsequent analysis extremum 10. A upper limit of 50 extremums wais used by Ten! Tandem 13. Besides many other algorithms select the 1~10 highest ion extremums from the mean 100 Da window for subsequent analysis 26-28. Our algorithm ProVerB selected top 6 ion extremums in 100 Da window since we considered the duplicate status of six types of fragment ions, viz. B, Y, b-H2O, y-H2O, b-NH3, y-NH3. The fragment ions were selected merely if their strengths are higher than 33 % of the highest extremum.
3.2 Theoretical spectra
A theoretical spectrum was generated based on the chemical science of b/y ions atomization. If the B, y fragment ions contained S, T, E, D ions, a loss of b-H2O or y-H2O was considered ; if the B, y fragment ions contained R, K, Q, N ions, a loss of b-NH3 or y-NH3 was considered15. If the parent ion charge was +1 or +2, we considered +1/+2 fragment ion extremums. Merely when the parent ion charge was non less than 2 and the fragment ions contained one of the R, K, H residues, +2 fragment ion extremums were considered 9.
3.3 Scoring map
Scoring map is the critical portion of MS peptide designation algorithm. In our algorithm we applied binomial chance denseness map to see three facets: simple fragment ion lucifer, back-to-back fragment ion lucifers and the strength of the b/y ion extremums.
3.3.1 The marking map for simple fragment lucifers
It is hard to suggest a cosmopolitan hiting map to suit assorted types of instruments and schemes, the variableness in the atomization forms, every bit good as the extent of atomization and strengths of the extremums. We solved this job by set uping a binomial distribution statistical theoretical account based on the nature of fiting itself, independent of all the experimental factors listed supra. The lucifer chance of experimental and theoretical fragment ions reflects the assurance of the lucifer:
P = chance of random lucifer.
p0 = 0.06. From each 100 Da interval we selected the highest 6 extremums, therefore the random lucifer chance is 0.06.
f = ratio between the figure of selected extremums of spectrum in the residue extremums and the scope of experimental mass spectroscopy in m/z value.
n = figure of theoretical fragment extremums.
K = figure of matched extremums in the experimental spectrum.
P = chance where K extremums lucifers in the n theoretical extremums, calculated by the binomial distribution chance denseness map.
3.3.2 The marking map for back-to-back ion lucifers
Multiple back-to-back ion lucifers were converted into a series of ion braces lucifers: ten back-to-back ion lucifers were converted into x-1 ion braces, and the duplicate chance of each brace was calculated as above. For illustration, if b1, b2 and b3 ions were consecutively matched, this back-to-back ion lucifer was converted into two back-to-back braces: b1-b2 and b2-b3. Additionally, the chance of back-to-back fragment lucifers was calculated as follows:
p1 = chance of the back-to-back fragment lucifers
P1 = chance where there are k1 extremums back-to-back matching in the n1 back-to-back theoretical extremums, calculated by the binomial distribution chance denseness map
n1 = figure of the back-to-back lucifers in the theoretical spectrum
k1 = figure of the back-to-back lucifers in the experimental spectrum
R is the background invariable. Trained from big sums of designation consequences in S. pneumoniae D39 dataset, we derived R = 0.09083 utilizing the undermentioned expression:
It reflects the chance of existent back-to-back matching. It is necessary to add a background value for rectification of the back-to-back lucifers of more than two ions. However, the chance of back-to-back lucifers of three ions was far less than two ions, ensuing in a little R value.
3.3.3 The marking map for spectrum strength of b/y ion extremums
Another freshness of our algorithm is to see peak strength quantitatively for designation. The peak strengths of b/y ions generated from the same peptide were correlated based on their physical and chemical properties9. This provides of import extra information to filtrate the noise and increase the sensitiveness of designation. We introduced matrices Bij and Yij based on the chemical belongingss of bonds between each amino acid brace ( AAP ) . The matrices Bij and Yij were calculated utilizing the S. pneumoniae D39 dataset and listed in Supplementary Table 1.
M_I = the figure of AAP b-ions or y-ions lucifers of the highest two extremums in every 100 Da.
M_E = the AAP b-ions or y-ion duplicate figure of the top six extremums in every 100 Da.
I and J base for amino acids, runing from 1 to 20.
Peptide mark map is defined as:
K2 = figure of the extremums fiting b/y-ions
n2 = figure of b/y-ions in theoretical spectra
T = the amount of Bij and Yij of the AAP b/y ion extremums which are the highest two extremums in every 100 Da and matched to amino acids I and J.
degree Celsiuss = figure of the highest two extremums fiting b/y ions in every 100 Da.
f = ratio between the figure of selected extremums and the m/z scope of experimental mass spectroscopy. A changeless 0.02 is added since the random lucifer chance of two ions in 100 Da interval is 0.02.
Here, p2 is the random lucifer chance of b/y ions match refering the peak strength. indirectly reflects the peak strength match quality of b/y ions and T should be greater than degree Celsius.
A elaborate illustration is included in the auxiliary stuffs.
3.3.4 The overall marking map and background value
The three tonss above were so used to cipher the overall peptide mark PEP_S:
PEP_S = -10a?™lg ( Pa?™P1a?™P2 )
To look into the influence of the P1 and P2, we plotted the peptide figure against the FDR sing these three tonss P, P1 and P2 increasingly by using three different hiting methods -10a?™lg ( P ) , -10a?™lg ( Pa?™P1 ) , -10a?™lg ( Pa?™P1a?™P2 ) , in S. pneumoniae D39 dataset ( Auxiliary Figure 1 ) . The curves showed that both the back-to-back ion lucifers P1 and the strength matches P2 contribute to the betterment of designation.
The peptide mark can be affected by extra information including peptide length, figure of alterations, figure of lost cleavages, charge of precursor ions, therefore necessitates a rectification 15. A background value B was subtracted from PEP_S:
S = PEP_S – Bacillus
The rectification values for different categories of peptides were derived from S. pneumoniae D39 dataset with the Bayesian acquisition method. The statistical probability=0.5 of PEP_S from Bayesian web means that the forward and change by reversal peptide can non be distinguished, where. we defined S = 0. In this instance the background value B equals the PEP_S. The background values B in different categories of peptides are listed in Table 5. S is the concluding mark of each peptide.
Table 5. Background values learnt from Bayesian webs
Background values type
Missed cleavage sites
precursor ion mass*0.018
Background values type
Parent ion charge
30 ( charge & gt ; 2 )
3.4 Comparison of ProVerB with Mascot, Sequest and SQID
3.4.1 Number of identified peptides and spectra
We compared our algorithm ProVerB with two widely-used MS designation algorithms Mascot and Sequest for their sensitiveness in Matlab version. The trial datasets include in-house generated S. pneumoniae D39 dataset, E. coli dataset and the dataset from 18 standard protein mixture.
Under the standards FDR a‰¤ 0.01, all three algorithms were able to place more than 3000 peptides from the S. pneumonia D39 dataset ( Fig. 1 ) . The Venn diagram shows that most of the peptides ( 2702 ) and spectra ( 81243 ) could be identified by all three algorithms. The overlap ratio of identified peptides and spectra from Mascot and ProVerb was every bit high as 91.0 % and 97.9 % , demoing a good consistence with other algorithms. Clearly, ProVerB identified more peptides and spectra than Mascot and Sequest. The advantage of ProVerB remained the same in the three E. coli datasets every bit good, demoing its firm power of designation ( Figs. 2A and 2B ) . We besides compared ProVerB with SQID, which besides considers the peak strength information. Compared with SQID consequence ( 3441 peptides and 96542 spectra ) , the overlap ratio of identified peptides and spectra from SQID and ProVerB was every bit high as 84.6 % and 87.3 % . The comparing secret plan of peptide designation figure versus FDR for the four algorithms showed that ProVerB identifies the most peptides within the FDR scope of 0.5 % ~3 % ( Auxiliary Figure 2 ) .
Fig. 1. Comparison of Mascot, Sequest and ProVerB utilizing S. pneumoniae D39 dataset. ( A ) Number of identified peptides. ( B ) Number of identified spectra.
Fig. 2. ( A ) The figure of identified peptides from the E. coli datasets utilizing ProVerB, Mascot and Sequest. ( B ) The figure of identified spectra from the E. coli datasets utilizing three algorithms.
Following, we tested the adaptability of ProVerB to assorted types of MS instruments, including Agilent, FT, LCQ, LTQ, QTOF, utilizing the downloaded 18 standard protein MS spectra. Again, ProVerB identified significantly more peptides and spectra than Mascot ( up to 45.7 % ) and Sequest ( up to 41.7 % ) in all instruments except Agilent ( Figs. 3A and 3B ) . These informations clearly indicate that ProVerB provided largely significantly higher ability to place peptides and spectra than the other two designation algorithms and it is besides applicable in a broad assortment of MS instruments.
We used the background value R = 0.09083 in all analyses above. However, the precursor ion charge and peptide length may act upon the background value R somewhat ( Auxiliary Figure 3 ) . To turn to how much the fluctuation of R value influences the designation public presentation, we tested ProVerB utilizing the two dimensional R value matrix ( Supplementary Table 2 ) . In this instance ProVerB identified merely one peptide more than utilizing the mean R value, and 98.8 % of the peptides overlap under two scenes. Therefore, the precursor ion charge and peptide length generate merely fiddling influence, if at all. The R values vary depending on the instrument type: Agilent, FT, LCQ, LTQ and QTOF give R values 0.1261, 0.1475, 0.1328, 0.1236 and 0.09006, severally. We tested ProVerB utilizing R = 0.1475 to place the dataset generated by FT, which deviates most from the mean R value, and ensuing in merely one more peptide identified and all the other identified peptides were the same. These consequences confirmed that the mean value R = 0.09083 can be used universally in ProVerB, insensitive to the precursor ion charge, peptide length and instrument type.
Fig. 3. ( A ) The figure of identified peptides from the 18 standard protein dataset obtained from five types of MS instruments utilizing three algorithms. ( B ) The figure of identified spectra from the 18 standard protein dataset obtained from five types of MS instruments utilizing three algorithms.
3.4.2 The figure of identified high-confidence peptides
Since different algorithms give different designation consequences, a cross-check of consequences from different algorithms may uncover the assurance of identified peptides. The high-confidence peptides and spectra qualify the quality of designation of an algorithm 14. To cipher the figure of high-confidence peptides, we foremost calculated the convergences of the identified peptides of each two algorithms ( Supplementary Table 3 ) . The high-confidence peptides can be calculated as, where A, B and C represent the identified peptides or spectra of ProVerB, Mascot and Sequest, severally. The fraction of high-confidence peptides identified by these three algorithms are listed in Table 3.
Table 3: The fraction of high-confidence peptides of the three algorithms
18 Standard proteins mixture
In most instances, ProVerB doubtless exceeded Mascot and Sequest in placing high-confidence peptides, demoing its odd, robust and instrument-/dataset-independent designation power ( Auxiliary Fig. 4 ) .
3.4.3 Correlation between ProVerB and Mascot tonss
The tonss in the MS designation algorithms quantitatively reflect the significance of the designation. We so compared the mark values of ProVerB and Mascot utilizing the S. pneumoniae D39 dataset ( more than 270,000 spectra ) ( Fig. 4 ) . The Pearson correlativity coefficient reached 0.8124 ( P & lt ; 10-16 ) , demoing a good correlativity between the two algorithms. This validates that ProVerB provided hiting strategy compatible with Mascot.
Fig. 4: The spread secret plan of ProVerB and Mascot scores placing the S. pneumoniae D39 dataset.
The roar of the proteomics applications and the broad assortment of mass spectroscopy engineering on peptide designation necessitate a versatile and accurate peptide designation algorithm. In this paper, we present a new algorithm ProVerB based on a fresh binomial distribution statistical theoretical account, and validated its truth, hardiness and compatibility. Additionally, ProVerB is an unfastened beginning plan so that no algorithmic item is hidden as in the commercial package bundles. Users may tune the parametric quantities harmonizing to their specific experimental apparatus to optimise the consequences. Besides, it can be compiled in assorted runing systems with a user-friendly graphical user interface. Although ProVerB does non back up ECD/ETD mass spectroscopy informations, we believe that ProVerB will happen its wide application in the proteomics surveies and supply more robust and accurate consequences than two commercial algorithms, bring forthing a more solid base of informations for the downstream analyses.
Chuan-Le Xiao, Gong Zhang and Qing-Yu He conceived this undertaking, Theoretical theoretical account of ProVerB was developed by Chuan-Le Xiao and Xiao-Zhou Chen. The algorithm was originally programmed by Yang-Li Du. The trial consequence was carried out by Chuan-Le Xiao and Gong Zhang. The experimental portion of D39 dataset was accomplished by Xuesong Sun.
This work was jointly supported by National “ 973 ” Undertakings of China ( 2011CB910700 ) , National Natural Science Foundation of China ( 20871057, 31000373 and 31200612 ) , the Fundamental Research Funds for the Central Universities ( 11610101 and 21611201 ) , “ 211 ” Undertakings and the Pearl River Rising Star of Science and Technology of Guangzhou City ( 2011048b ) .
We are thankful to Shuai Liu, Chao Ma for the aid with programming ProVerB and for the proficient intimations on public presentation optimisation.
Three auxiliary tabular arraies and auxiliary notes that support this article are available free of charge via the Internet at hypertext transfer protocol: //pubs.acs.org. The ProVerB plan, beginning codification and trial dataset can be downloaded at hypertext transfer protocol: //bioinformatics.jnu.edu.cn/software/proverb/ .