June 2020
Billy B Richards - Oxford University
Abstract
Natural Language Processing pioneers deep learning by allowing computational models, with multiple processing layers, to represent data with multiple levels of abstraction. Recent breakthroughs in bidirectional neural networks and unsupervised learning have triggered a technological shift allowing machines to understand complex relationships within language (written or spoken). Biochemists are now using these techniques directly, segmenting biochemical data (such as DNA or amino-acid sequence) into word-like structures, using near-identical pre-training routines, and then performing supervised learning tasks. The more ‘word-like’ the sequence segment, the better the results. Synergy within this field is defining state-of-the-art in modelling of complex dependencies within large datasets. The result: predictions and classifications previously impossible as-well as hope in the battle to understand how life works.
Introduction
All research is data-driven and ever since phage ΦX174 was sequenced in 1977 [1] there has been an enormous increase in biological data dimension and acquisition rate that is challenging conventional analysis techniques. Machine Learning (ML) and notably, Deep Learning (DL) algorithms appeal to the scientific community due to their ability to automatically detect of patterns in data owing to the introduction of ‘hidden layers’ [2]. Free from hard-coded hypothesis and assumption they allow machines to be more humanlike. Research into artificial neurons began in 1943 [3] beginning a journey that culminates in Deep Neural Networks (DNNs) (Figure 1). These empower computers to perform the subconscious human process of learning by example. The introduction of these ‘hidden’ layers nonlinearly transforms inputs into a space where the classes become linearly separable. As a result computers can classify text, images and sounds with state-of-the-art accuracy, sometimes beating humans[4].
At the heart of ML research is the quest to understand language. Natural Language Processing (NLP) [5] has been defined as any work that computationally represents, transforms or utilises text (or speech) and its derivatives [6•]. Language is sequential by nature, and so shares many similarities with biochemical data. NLP has seen a recent technological shift [7•] owing to unsupervised or weakly supervised pretraining of algorithms. This results in a ‘subconscious’ understanding of complex relationships within the sequence, such as context dependency of words: ‘The children love to play in the leaves’, ‘They do not like when their father leaves for work’. In these sentences the meaning of ‘leaves’ is different. Just as the meaning of a DNA motif could be different if found in a UTR or intron. It has been found that further fine-tuning of these pre-trained algorithms yields ground-breaking results in highly complicated language tasks [8], and more recently in biochemical prediction [9].
With this review, I seek to ask the question: ‘has NLP ‘transformed’ the study of sequential data in Bioscience’. I redefine NLP as: the application of transferrable NLP techniques on any sequence based data and take the bar for ‘transform’ from Grove [10] as has been done in other critical studies in this field [11]. Grove’s definition uses the term ‘Strategic Inflection Point’ to refer to ‘a change in technologies or environment that requires a business to be fundamentally reshaped.
Learning Language through Representation
ML Algorithms model complex dependencies via analysis of features. Their performance strongly depends on the how the data is represented; how each variable (or feature) is computed. For example, in order to classify a tumour as malign or benign from a microscopy image, a pre-processing algorithm could detect cells, identify the cell type and generate a list of cell counts for each type. An ML algorithm would take these estimated cell counts (examples of handcrafted features) as inputs to classify the tumour. The performance of the algorithm would be dependent on the quality and relevance of these features: mistakes, incorrect labelling or non-physiological distributions owing to machine or human error would inhibit accurate classification. Deep learning addresses this issue by integrating the computation of features (representation learning) into the ML model itself, yielding end-to-end models [12] and accurate generation of low-dimensional vector representations from raw input data (Figure 2). Neural networks are not the focus of this review, however Goodfellow et al [12] covers them in detail and Lecun [2] gives a more general introduction.
For language modelling, feature extraction involves generating representations of words [13], sentences [14] or indeed paragraphs and documents [15]. In biochemistry this involves generating representations of important regions within the sequence-data of interest: Nucleotide [16–19], Protein [20••], Glycan [9] or combinations therein [21].
Figure 1

Embeddings seek to capture contextual information, positional dependencies and other high-dimensional relationships, by definition they are generated with minimal supervision. Language concerns semantic relationships, therefore embeddings with similar semantic and linguistic properties are close in Euclidean distance in the vectorspace, allowing semantic similarly to be assessed as mathematical similarities between vectors [14]. Word2Vec by Mikolov et al [22]. gave rise to the first large scale implementation of word embeddings via both Continuous bag of words (CBOW) and Skipgram models. CBOW predicts a word based on the context of the surrounding words, while Skipgram is the opposite, predicting context from the target word itself.
The effectiveness of their embeddings was subsequently demonstrated: vector(“king”) – vector(“man”) + vector(“woman”) gave the vector(“queen”). Feature extraction for biological sequences face the same challenge in how best to retain contextual information. Successes in language modelling have been harnessed in genomics [16,18,23••] and protein studies [24] with techniques such as Seq-SetNet [25], dna2Vec [26] and Sequence2Vec [27] all improving on the next best in class.
A turning point in language modelling came when Google released its Bidirectional Encoder Representations from Transformers (BERT) model, whose key technical innovation was in the application of bidirectional learning [28], using a neural network to map the input to the output whilst also mapping the output to the input. The low- level implications of this are beyond the scope of this review, but in principal it allows the production of a classifier and a generator in opposite directions on the same dataset. The combination of deep neural pretraining, attention (Figure 3) and bidirectionality gave rise to superior results of BERT leading to the creation of BioBert
[29] and SciBert [30] as-well as numerous other iterations including AlBERT, RoBERTa and Elmo [8,31]. Word2Vec and GloVe [32] feature extraction techniques provide embeddings that perform well on bioscientific language tasks, but are outperformed by models pre-trained on biospecific corpora [13,29,30,33].
Biomedical Text Mining (BTM)
There is a huge amount of information trapped in years of literature, which when synergised with the growing availability of unstructured biomedical text (clinical trials, articles, electronic health records (EHRs) and patient-authored texts) sees great importance in effective BTM.
Language modelling’s main tasks are Question Answering, Textual Entailment, Role Labelling, Coreference Resolution, Named Entity Recognition, Sentiment analysis and by extension Relation Extraction (RE). Applications are stratified by domain (clinical and biomedical) and task. This review does not focus on clinical NLP, however it is an active area of research owing to the increased data availability in part due to inventions such as the Apple Watch and FitBit. Clinical NLP is well reviewed by Xu et al. [6]; and the convergence of clinical and biological NLP pioneers cutting edge personalised medicine [36,37] as-well as novel approaches to vaccine development [36]
Figure 2

In early bioNLP NER, RE and Classification used traditional pipelines which combined hand-crafted pre-processing (such as string matching) with statistical techniques to create simple representations of clinical text [37]. It should be noted that traditional pipelines such as these still produce excellent results, providing high quality data and representative feature extraction [38]. The drawback being that for high quality results 100s of hours as well as domain expertise are required. Performance of NER has been improved by reformulating the task as a sequence labelling problem using conditional random fields [39]. RE has seen various innovative DL architectures well reviewed in [11]. Iterative improvements in all areas have resulted from embeddings [40], attention [41] and pretraining. [42,43].
These recent advancements in representation learning have allowed breakthrough progress in predicting novel drug indications [44], improved understanding of the relationships between gene, drugs and diseases [45–47], and expanded our horizons regarding protein studies [38,48,49]. [11] states that ‘technical advancements in current methods’ would be needed to realise NLPs full potential in this domain. Bi- directional pretraining is this advancement.
Figure 3

NLP techniques on Biochemical Sequences
Studies on genetic or protein data traditionally use a position weight matrix (PWM) to generate input features (Figure 4). Protein structure prediction [50], molecular function identification [51,52], synthetic biology [53••] and substrate specificity identification [54] commonly use Multiple Sequence Alignment (MSAs), converting these into PWMs to feed neural networks. PWMs typically result in fixed-sized vector representations of the input data and research into protein-sequence embeddings has primarily focussed on unsupervised k-mer co-occurance [55,56]. Seminal studies into Enhancer-Promoter Interaction (EPI) and Transcription Factor (TF) binding to DNA [57,58] use similar PWM techniques, representing DNA via ‘one hot encoding’ [59] into 4xL images (L is the length of a sequence) using a CNN to model DNA sequences. Other ML based enhancer prediction techniques such as SPEID [60] use similar techniques, SPEID transforms variable-length sequences to fixed length k-mer features to classify the input sequence. These methods, based on similar methods in NLP [15], are almost always restricted as by taking a fixed size input they fail to preserve contextual information during representation learning. In the case of Position- specific scoring matrices (a type of PWM), their failure is bound to the fact that an MSA is an unordered sequence set rather than a matrix. Interchanging two sequences will have no effect for an MSA but would do for rows of pixels in an image, furthermore featurizing to two-dimensions seems unnatural when a DNA sequence is one- dimensional [61•]
Language based approaches seen in [61] address this challenge and other new approaches involve Seq-Seq networks borrowed from NLP using symmetric functions to integrate features calculated from preceding layers [62]. Further improvements are seen in pre-training protein embeddings by using weak supervision from global structural data [20] .
Identifying anomalous DNA sequences from high-throughput genetic techniques is another realm which NLP techniques are being widely used. By treating nucleotides as characters and codons as words, robust identification of viral reads in metagenomic samples without homology searching [23] has been made possible. Other active improvements include low-allele frequency gene identifications [18] and sugar- sensitive immunogenicity to be predicted at 92% accuracy [9].
Figure 4

Discussion
The most effective NLP techniques for text-based tasks are bi-LSTMs. Xu et al [41] compared architectures looking for medication and anomalous drug events in EHRs, showing that for NER and Relation Classification (RC) a bi-LSTM deep learning model outperformed traditional ML methods on all tasks, and all other DL models on all tasks except one. While improvements are continuing, the percentage improvements achieved with new architectures are slowing as different modularisations are being explored. [46]
The idea of using LSTM language model representations as inputs for supervised learning problems as part of a larger neural network has shown success in NLP and was used for the first time in protein structure prediction by Berger et al [20]. The most advanced studies all followed suit. Rational glycoengineering [9], metagenomics [23] protein structure prediction [20,25] and DNA classification [19] all use variants of the highest performing language models: creating word-like features of the sequence data, pretrained with weak or no supervision using techniques dogmatic to language modelling such as CBOW [9,19], next word prediction [63•] or term-frequency methods [23]. The most successful techniques employ attention and bi-directionality– mechanisms that revolutionised language 2 years prior.
Unsupervised training allows success of language models as it results in a holistic understanding of the relationships between words. These methods capture the kind of relationships we humans struggle to describe: do you even know what polysemy means? Capturing relationships, understood only by our subconscious, definitely seems applicable to biochemistry. The complex intra- and inter-dependent relationships of DNA, RNA, protein (and everything in between) is beyond human comprehension thus naturally any technique that captures this without the need of humanised input will take the field forward.
To answer the question of whether NLP has transformed the study of sequence data, my answer is not yet. There has been an enormous technological shift after BERT which is certainly transforming language study alone. But in terms of sequence data in general, there needs to be better and continued implementation of language- oriented pre-training to better capture the high dimensional relations, this disregards both the challenges faced in data acquisition and the high-level of domain expertise required to implement DL architectures. Both would need to be drastically improved in order to satisfy Grove’s definition.
References
1. Dna DX, Sanger F, Air GM, Barrell BG, Brownt NL, Coulson AR, Fiddes JC, Iii CAH, Slocombe PM, Smith M: articles Nucleotide sequence of bacteriophage. 1977, 265:687–695.
2. Lecun Y, Bengio Y, Hinton G: Deep learning. Nature 2015, 521:436–444.
3. McCulloch WS, Pitts W: A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943, 5:115–133.
4. Buetti-Dinh A, Galli V, Bellenberg S, Ilie O, Herold M, Christel S, Boretska M, Pivkin I V., Wilmes P, Sand W, et al.: Deep neural networks outperform human expert’s capacity in characterizing bioleaching bacterial biofilm composition. Biotechnol Reports 2019, 22:e00321.
5. Young T, Hazarika D, Poria S, Cambria E: Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 2018, 13:55–75.
6•. Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, Soni S, Wang Q, Wei Q, Xiang Y, et al.: Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc 2020, 27:457–470. An essential review for anyone wishing to stay up-to-date on NLP. Accessible and interesting covering a wide variety of uses for NLP within medicine in general.
7•. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. 2018, GLUE scores monitor the development of the field against a series of human benchmarks. Constantly updated, great way to stay in touch with the fields development
8. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L: Deep contextualized word representations. In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference. . Association for Computational Linguistics (ACL); 2018:2227–2237.
9. Bojar D, Camacho DM, Collins JJ: Using Natural Language Processing to Learn the Grammar of Glycans. bioRxiv 2020, doi:10.1101/2020.01.10.902114.
10. AS. G: Academy of Management. 1998,
11. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al.: Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018, 15:20170387.
12. Ian Good fellopw, Yoshua Bengio, Aaron Courville: Deep Learning. MIT Press; 2016.
13. Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Kingsbury P, Liu H: A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 2018, 87:12–20.
14. Tawfik NS, Spruit MR: Evaluating sentence representations for biomedical text: Methods and experimental results. J Biomed Inform 2020, 104.
15. Le Q, Mikolov T: Distributed representations of sentences and documents. 2014.
16. Amgarten D, Braga LPP, da Silva AM, Setubal JC: MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front Genet 2018, 9:304.
17. Wang Y, Fu L, Ren J, Yu Z, Chen T, Sun F: Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures. Front Microbiol 2018, 9:872.
18. Luo R, Sedlazeck F, Lam T-W, Schatz M: Clairvoyante: a multi-task convolutional deep neural network for variant calling in Single Molecule Sequencing. Prepr bioRxiv 2018, doi:10.1101/310458.
19. Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY: iEnhancer- 5Step: Identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal Biochem 2019, 571:53– 61.
20••. Bepler T, Berger B: Learning protein sequence embeddings using information from structure. In 7th International Conference on Learning Representations, ICLR 2019. . 2019. Very good assessment of the similarities between protein sequence and language problems. Uses weak supervised pre-training and a bi-directional model.
21. Eraslan G, Avsec Ž, Gagneur J, Theis FJ: Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 2019, 20:389–403.
22. Mikolov T, Yih W-T, Zweig G: Linguistic Regularities in Continuous Space Word Representations. Association for Computational Linguistics; 2013.
23••. Abdelkareem AO, Khalil MI, Elbehery AHA, Abbas HM: Viral Sequence Identification in Metagenomes using Natural Language Processing Techniques. bioRxiv 2020, doi:10.1101/2020.01.10.892158. Pioneers the use of attention in metagenomics
24. Menegaux R, Vert J-P: Continuous embeddings of DNA sequencing reads, and application to metagenomics. 2018, doi:10.1101/335943.
25. Ju F, Zhu J, Wei G, Zhang Q, Sun S, Bu D: Seq-SetNet: Exploring Sequence Sets for Inferring Structures. 2019,
26. Ng P: dna2vec: Consistent vector representations of variable-length k- mers. 2017,
27. Dai H, Umarov R, Kuwahara H, Li Y, Song L, Gao X: Sequence2Vec: novel embedding approach for modeling transcription factor binding affinity landscape. 2017, doi:10.1093/bioinformatics/btx480.
28. Devlin J, Chang M-W, Lee K, Toutanova K: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018,
29. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J: BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36:1234–1240.
30. Beltagy I, Lo K, Cohan A: SciBERT: A Pretrained Language Model for Scientific Text. 2019,
31. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V: RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019,
32. Pennington J, Socher R, Manning CD: Glove: Global Vectors for Word Representation. 2014, doi:10.3115/V1/D14-1162.
33. Chen Z, He Z, Liu X, Bian J: Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 2018, 18:65.
34••. Xu J, Yang P, Xue S, Sharma B, Sanchez-Martin M, Wang F, Beaty KA, Dehan E, Parikh B: Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives. Hum Genet 2019, 138:109–124. An excellent paper, detailing the multidiscipline improvements required to progress the field within the context of genomics and cancer therapy. It raises valid points regarding the improvements needed in sequencing and the technology capable of analysing such high dimensional data.
35. Grapov D, Fahrmann J, Wanichthanarak K, Khoomrung S: Rise of deep learning for genomic, proteomic, and metabolomic data integration in precision medicine. Omi A J Integr Biol 2018, 22:630–636.
36. Qiu X, Duvvuri VR, Bahl J: Computational Approaches and Challenges to Developing Universal Influenza Vaccines. Vaccines 2019, 7:45.
37. Sager N, Lyman M, Bugknall C, Nhan N, Tick LJ: Natural language processing and the representation of clinical data. J Am Med Informatics Assoc 1994, 1:142–160.
38. Badal VD, Kundrotas PJ, Vakser IA: Natural language processing in text mining for structural modeling of protein complexes. BMC Bioinformatics 2018, 19:84.
39. Jasper J. Koehorst, Jesse C. J. van Dam, Edoardo Saccenti, Vitor A. P. Martins dos Santos MS-D and PJS: tmVar: a text mining approach for extracting sequence variants in biomedical literature. Oxford 2017, 34:1401–1403.
40. Hao Y, Liu X, Wu J, Lv P: Exploiting Sentence Embedding for Medical Question Answering. 2018,
41. Wei Q, Ji Z, Li Z, Du J, Wang J, Xu J, Xiang Y, Tiryaki F, Wu S, Zhang Y, et al.: A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J Am Med Inform Assoc 2020, 27:13–21.
42. Kalyan KS, Sangeetha S: SECNLP: A survey of embeddings in clinical natural language processing. J Biomed Inform 2020, 101.
43. Sahu SK, Anand A: Unified neural architecture for drug, disease, and clinical entity recognition. In Deep Learning Techniques for Biomedical and Health Informatics. . Elsevier; 2020:1–19.
44. Jang G, Lee T, Lee BM, Yoon Y: Literature-based prediction of novel drug indications considering relationships between entities. Mol Biosyst 2017, 13:1399–1405.
45. Bouaziz J, Mashiach R, Cohen S, Kedem A, Baron A, Zajicek M, Feldman I, Seidman D, Soriano D: How artificial intelligence can improve our understanding of the genes associated with endometriosis: Natural language processing of the pubmed database. Biomed Res Int 2018, 2018.
46. Fabris F, Palmer D, Salama KM, de Magalhães JP, Freitas AA: Using deep learning to associate human genes with age-related diseases.
Bioinformatics 2019, 36:2202–2208.
47. Wang P, Hao T, Yan J, Jin L: Large-scale extraction of drug–disease pairs from the medical literature. J Assoc Inf Sci Technol 2017, 68:2649–2661.
48. Badal VD, Kundrotas PJ, Vakser IA: Text Mining for Protein Docking. PLoS Comput Biol 2015, 11.
49. Peng Y, Lu Z: Deep learning for extracting protein-protein interactions from biomedical literature. Association for Computational Linguistics (ACL); 2017:29–38.
50. An JY, Zhou Y, Zhao YJ, Yan ZJ: An Efficient Feature Extraction Technique Based on Local Coding PSSM and Multifeatures Fusion for Predicting Protein-Protein Interactions. Evol Bioinforma 2019, 15.
51. Le NQK, Ho QT, Ou YY: Classifying the molecular functions of Rab GTPases in membrane trafficking using deep convolutional neural networks. Anal Biochem 2018, 555:33–41.
52. Le NQK, Yapp EKY, Ou YY, Yeh HY: iMotor-CNN: Identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou’s 5-step rule. Anal Biochem 2019, 575:17–26.
53. Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM: Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 2019, 16:1315–1322 Uses language inspired techniques to distil the key features of a protein into a numerical representation that is grounded in structure, evolution and biophysical behaivour. Truly incredible work.
54. Nguyen TTD, Le NQK, Ho QT, Phan D Van, Ou YY: Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. Anal Biochem 2019, 577:73–81.
55. Asgari E, Mofrad MRK: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One 2015, 10.
56. Yang KK, Wu Z, Bedbrook CN, Arnold FH: Learned protein embeddings for machine learning. Bioinformatics 2018, 34:2642–2648.
57. Alipanahi B, Delong A, Weirauch MT, Frey BJ: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 2015, 33:831–838.
58. Zhou J, Troyanskaya OG: Predicting effects of noncoding variants with deep learning-based sequence model HHS Public Access. Nat Methods 2015, 12:931–934.
59. Potdar K, S. T, D. C: A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers. Int J Comput Appl 2017, 175:7– 9.
60. Ghandi M, Lee D, Mohammad-Noori M, Beer MA: Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 2014, 10:e1003711.
61. Zeng W, Wu M, Jiang R: Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics 2018, 19:84.This paper is innovative in its approach and rationally challenges vision-based attempts at handling genetic information
62. Mirabello C, Wallner B: rawMSA: End-to-end Deep Learning Makes Protein Sequence Profiles and Feature Extraction obsolete. 2018, doi:10.1101/394437.