This is the final project for the graduate level course Natural Language Processing with Representation Learning. The project is based on Chemformer (Irwin et al., 2022) and LlasMol (Yu et al., 2024). Instead of exploring the capacity of LLM in performing such task, we shifted our focus to the finetuning process of small LMs as well as dataset construction for training. The strong motive for this shifted goal/focus is because of the limited resources one may have in real-world scenarios. In this project, we first popose a data construction pipeline when time and chemical resources are rather limited. Then, we tried out various finetunning methods on a Bart-based pretrained small scale LM and acquire a good enough performance comparing to a full parameter finetunning. We tested our model on the dataset we constructed as well as the original dataset.
References
2024
-
LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
Botao Yu, Frazier N. Baker, Ziqi Chen, and 2 more authors
2024
2022
-
Chemformer: a pre-trained transformer for computational chemistry
Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and 1 more author
Machine Learning: Science and Technology, Jan 2022
Transformer models coupled with a simplified molecular line entry system (SMILES) have recently proven to be a powerful combination for solving challenges in cheminformatics. These models, however, are often developed specifically for a single application and can be very resource-intensive to train. In this work we present the Chemformer model—a Transformer-based model which can be quickly applied to both sequence-to-sequence and discriminative cheminformatics tasks. Additionally, we show that self-supervised pre-training can improve performance and significantly speed up convergence on downstream tasks. On direct synthesis and retrosynthesis prediction benchmark datasets we publish state-of-the-art results for top-1 accuracy. We also improve on existing approaches for a molecular optimisation task and show that Chemformer can optimise on multiple discriminative tasks simultaneously. Models, datasets and code will be made available after publication.