Synaptica’s Dave Clarke looks at ontologies for enterprise auto categorization at KMWorld Connect 2021
At KMWorld Connect 2021, Dave Clarke, CEO and co-founder, Synaptica, presented a keynote titled, “Ontologies for Enterprise Auto Categorization.” Clarke considered how taxonomies are evolving into more complex ontologies the impact on enterprise auto categorization. In addition, he explored alternative methodologies and tools for categorizing content against very large-scale ontologies as well as smaller enterprise taxonomies.
KMWorld Connect 2021 is going on this week, November 15 -18, with workshops on Friday, November 19. On-demand replays of sessions will be available for a limited time to registered attendees and many presenters are also making their slide decks available through the conference portal. For more information, go to www.kmworld.com/conference/2021. Access to session archives will be available on or about November 29, 2021, so be sure to check back for on-demand replays.
Synaptica is company that helps people to organize, categorize, and discover enterprise knowledge and does so with knowledge graphs. Clarke compared and contrasted small enterprise-specific taxonomies with big knowledge graphs and ontologies, compared and contrasted explicit rules with machine learning auto-categorization methodologies, and reviewed pros and cons and what to use when.
The end game of auto-categorization is to help people find the need in a haystack, explained Clarke. It improves both recall and precision by harmonizing the semantics of search and content terminology, helps to identify what is most relevant or salient, and provides content recommendations.
Tagging is distinct from categorization, said Clarke. “Semantic tagging” identifies the many individual concepts and named entities that are mentioned with the full text of a document and uses concept labels, disambiguators, and contextual rules. “Categorization” identifies the few concepts and named entities that best describe the whole document and uses term frequency, document zone relevancy, semantic proximity, semantic similarity, inferencing, and TF-IDF.
Clarke also summarized the pros and cons of rule-based versus machine learning approaches, what the approaches are best suited for, and noted that hybrid approaches are also possible.
The pros of a rules-based approach include rapid development of rules, transparent fules yield explainable results, and the ability to rapidly adapt to changing taxonomy topics. However, the cons include manual development of rules per concept and less sophisticated tagging and categorization.
The pros of machine learning approaches include statistical linguistic analytiss that learns by example, no need to manually develop explicit rules, and more sophisticated tagging and categorization. Cons are that it is costly to train gold standards, “explainability” is only possible for simple models, there is indirect control over the results, and changing taxonomy topics require retraining of the ML.
While rules-based approaches are most suited to smaller taxonomies, business-specific knowledge domains, and rapidly changing taxonomy topics, machine learning approaches are best for large graph datasets, knowledge domains with open data resources, and relatively static taxonomy topics, said Clarke.