top of page

CSPred: A machine-learning-based compound model to identify the functional activities of biologicall

Peptide drugs offer better target selectivity than the conventional “small molecule” drugs with the typical molecular weight of <500 Da. However, peptide-based ‘biologics’ demonstrate a reduced bioavailability as those are less stable in the physiological environments compare to small molecule-based drugs. There are several cysteine-stabilize peptide toxins where the disulfide bridges give them a high stability. There such type peptides which are structurally well identified such as, sequential tri-disulfide peptide (STPs), peptides with inhibitory cysteine knots (ICK), cyclic ICKs (cyclotides). Some the peptides are also identified according to their origins such as conotoxins, scorpion toxins, snake toxins, agatoxins, etc., while some other are grouped based on their functions, i.e., Defensins, channel blockers, protease inhibitors, etc. Although, these peptides are characteristically toxins they all are not harmful to human. Hence, these peptides have a huge potential to be used as specifically targeted stable peptide drugs, insecticides, and antimicrobial peptides (to preserve food). Several of these cysteine-stabilize peptides have already received licenses for clinical and agricultural use. Despite their importance, the functions of a wide array of the cysteine-stabilized proteins are undiscovered as it is challenging to classify them based on the conventional sequence alignment methods due to a low signal noise ratio in their sequences. In this study, we made a machine-learning based compound model (CSPred) to predict the five most common functional profiles of cysteine-stabilized peptides from their primary sequences. Here, the five functional properties are ion-channel blocker (ICB), antimicrobial peptide (AMP), acetylcholine receptor inhibitor(ACRI), serine protease inhibitor (SPI) and hemolytic peptide (HLP). We constructed five different supervised models to mention the above mention functional properties using a feature generation method exploiting modified n-grams and skip-grams (m-NGSG) from natural language processing. This feature generation is coupled to a logistic regression classifier to construct the complete supervised classifier. The area under curve illustrated higher area under curve (AUC) by five-fold cross-validation for each of the five models compared to the models which were constructed using PSI-BLAST and HMMER. Further, the m-NGSG-based models show better accuracy and MCC values on the out-of-sample test-sets compared to the corresponding PSI-BLAST and HMMER based models. These results demonstrate the CSPred model as the optimal prediction model to predict the functional profile of cysteine-stabilized proteins at present. CSPred is freely available as a web-server at watson.ecs.baylor.edu/cspred.​

bottom of page