Focus on Text analytics “Human and Machine Intelligence: Building The Future Of Text Analytics”
Text analytics draws on multiple techniques from simple string matching to deeper statistical analyses. Recently, Deep Learning (DL) has emerged as a compelling approach to address many common use-cases including Named Entity Recognition (NER), document classification and relationship extraction. Novel algorithms including Word2Vec, ELMo, BERT, XLNet and ERNIE released by acknowledged sources, represent a step-change in machine-understanding of human-written text.
In LifeSciences, text analytics plays a major role in unlocking the insight within internal repositories containing medical records, laboratory experiment, patents, scientific literature and more. Forward thinking CTOs and CIOs are determining how best to deploy these latest approaches to address key industry challenges such as enterprise search, drug repurposing, pharmacovigilance monitoring and the creation of knowledge graphs. Given the immaturity and levels of flux of these models, deploying within an enterprise environment is far from straightforward. Indeed, BERT, XLNet and ERNIE papers were all published within a month of one another, each claiming substantial improvements on the prior technique. Further iterations of these models are likely, which poses three major challenges:
♦ Change Management: How to utilise new algorithms without breaking existing workflows leading to expensive remediation
♦ Model Training: Published models will only work in specific environments. Additional infrastructure is required to tailor DL models to specific internal business questions. Costs to train and deploy these models are estimated to be $10,000+ for each iteration.
♦ Annotation Consistency: As these models evolve over time, how can one ensure consistent behaviour (A company’s products correctly identified each time)
SciBite’s award-winning text-analytics platform is used by the world’s largest science-based businesses to address these issues. Our APIs offer a unified point of entry to these technologies, providing an abstraction layer on top of the DL models. This consistent layer insulates and future-proofs users from the flux in the machine learning world. The combination of our data-led expertise and technology can rapidly generate bespoke training data, tailored to individual needs.
While our approach addresses the first two points above, annotation consistency, requires a deeper understanding of what customers are trying to achieve. Most use-cases stem from a need to identify the “things, not strings” within sources and the relationships between them. To a computer the word “aspirin” is a string of 7 ASCII codes, one for each character in this word. The computer does not know this is the name of a drug, nor that a completely different string “acetylsalicylic acid” is the same thing in the real-world. This is why keyword-based search engines are so poor at returning a true picture of the results of a query, particularly in STEM domains where any “thing” always goes by many different names.
Ontologies are the de facto standard to encode semantics in an understandable form for both humans and machines. Connecting “aspirin” and “acetylsalicylic acid” to the same unique ID within an ontology, provides huge benefits when building data for search and knowledge-graph applications. The computer now understands these are representations of the same drug and can better identify more valid and desired relationships (e.g. drug to side-effect, gene to disease) within the data. Without ontologies as an anchor, DL approaches only hint at what has been found without mapping to any form of community standard. When the DL model inevitably changes, the results will also change and without these mappings, DL-alone approaches won’t be a reliable source in master-data management applications. Ontologies are more than just lists of entities, they represent an agreed standard, across a community, of what “things’ mean. This human consensus is not something DL-alone can nor should create.
At SciBite we’re pioneering a best-of-both strategy, combining cutting edge DL technologies with established semantic techniques to enable customers like Pfizer1, to make more of their content via a proven a cycle. Namely:
♦ Build on established community developed standards, vastly enriched and optimised for text-analytics purposes over public domain alternatives
♦ Where no standard exists, DL-assisted curation to generate training data and novel ontologies quickly and accurately
♦ Deploy a collaborative master-data-management system, where employees who know about their data can access and enrich those standards
♦ Combine semantic standards and DL to provide the most powerful APIs for named entity recognition, relationship extraction, document clustering and trend analysis
We believe that together, Deep Learning and semantics have a significant role in the future of text analytics. For transforming unstructured content into usable data, we recommend a synergistic approach through a consistent API that addresses concerns over change management and future-proofing existing workflows. By leveraging the best of machine learning and organisational standards, there’s a bright future ahead for text analytics in the STEM domain!
Headquartered in the UK,
with offices in the USA and Japan.
Call: +44 (0) 1223 786 129