- Genome mining reveals high topological diversity of ω-ester-containing peptides and divergent evolution of ATP-grasp macrocyclasesHyunbin Lee, MinGyu Choi, Jung-Un Park, Heejin Roh, and 1 more authorJournal of the American Chemical Society 2020
ω-Ester-containing peptides (OEPs) are a family of ribosomally synthesized and post-translationally modified peptides (RiPPs) containing intramolecular ω-ester or ω-amide bonds. Although their distinct side-to-side connections may create considerable topological diversity of multicyclic peptides, it is largely unknown how diverse ring patterns have been developed in nature. Here, using genome mining of biosynthetic enzymes of OEPs, we identified genes encoding nine new groups of putative OEPs with novel core consensus sequences, disclosing a total of ∼1500 candidate OEPs in 12 groups. Connectivity analysis revealed that OEPs from three different groups contain novel tricyclic structures, one of which has a distinct biosynthetic pathway where a single ATP-grasp enzyme produces both ω-ester and ω-amide linkages. Analysis of the enzyme cross-reactivity showed that, while enzymes are promiscuous to nonconserved regions of the core peptide, they have high specificity to the cognate core consensus sequence, suggesting that the enzyme–core pair has coevolved to create a unique ring topology within the same group and has sufficiently diversified across different groups. Collectively, our results demonstrate that the diverse ring topologies, in addition to diverse sequences, have been developed in nature with multiple ω-ester or ω-amide linkages in the OEP family of RiPPs.
- ProgressingA Regression Framework for Contrastive Learning of Continuous LabelsMinGyu Choi, Wonseok Shin, Yijingxiu Lu, and Hong SoonaUnder Review 2022
- Joint Similarity based Contrastive Learning of Sentence EmbeddingYijingxiu Lu, MinGyu Choi, and Kim SunUnder Review 2022
Contrastive learning is recently surging at its new frontier of language model pre-training for generating high-quality sentence representation. However, it is nearly unavoidable that the meaning of original sentence could be changed during data augmentation for transformation generating. To learn high-quality sentence embeddings through contrastive learning is significantly important that remains an unresolved issue. In this work, we propose an unsupervised contrastive learning framework that can be aware of sentence relationship and capture semantic similarity across different data views. We point out the potential problem of applying existing contrastive learning objective on multiple augmentations and introduce the joint similarity term to learn self-aware sentence representations through pre-training. We evaluate the framework on seven semantic textual similarity (STS) downstream tasks with previous best baselines. Achieving the state-of-the-art performace, our model is proved to be effective in generating more effective embeddings that be aware of true contextual semantics.
- Triangular Contrastive Learning on Molecular GraphsMinGyu Choi, Wonseok Shin, Yijingxiu Lu, and Sun KimarXiv preprint arXiv:2205.13279 2022
Recent contrastive learning methods have shown to be effective in various tasks, learning generalizable representations invariant to data augmentation thereby leading to state of the art performances. Regarding the multifaceted nature of large unlabeled data used in self-supervised learning while majority of real-word downstream tasks use single format of data, a multimodal framework that can train single modality to learn diverse perspectives from other modalities is an important challenge. In this paper, we propose TriCL (Triangular Contrastive Learning), a universal framework for trimodal contrastive learning. TriCL takes advantage of Triangular Area Loss, a novel intermodal contrastive loss that learns the angular geometry of the embedding space through simultaneously contrasting the area of positive and negative triplets. Systematic observation on embedding space in terms of alignment and uniformity showed that Triangular Area Loss can address the line-collapsing problem by discriminating modalities by angle. Our experimental results also demonstrate the outperformance of TriCL on downstream task of molecular property prediction which implies that the advantages of the embedding space indeed benefits the performance on downstream tasks.
- On Modeling and Utilizing Chemical Compound Information with Deep Learning Technologies: A Task-oriented ApproachSangsoo Lim, Sangseon Lee, Yinhua Piao, MinGyu Choi, and 3 more authorsComputational and Structural Biotechnology Journal 2022
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.