Sub-topic and Semantic Sub-structure Extraction via SPLIT: Joint Nonnegative Matrix Factorization (NMF) with Automatic Model Selection

Abstract

Topic modeling is one of the key analytic techniques for organizing and analysis of large text corpora. One approach to topic modeling is the recently introduced SeNMFk, a method based on semantic non-negative matrix factorization (NMF) with automatic model determination (NMFk), where the text-document matrix and word-context (co-occurrence) matrix are jointly factorized. The text-document matrix is the term frequency-inverse document frequency (TF-IDF) matrix, and the word-context matrix represents the number of times two words co-occur in a pre-determined window of text. Incorporating the semantic structure of the text with the ability to estimate the number of topics enables a coherent separation of the latent topics and accurate document clustering. This approach, however, only identifies the highest level of topics or the main topics/themes. Many text corpora often include a very complex structure of sub-topics beyond the main themes. For example, a set of documents can be separated into main topics, such as, sports, politics, science, etc. Each of these main topics can be further separated into sub-topics. For example, sport theme can be separated to the subtopics tennis, soccer, football, etc. This process can be repeated by expanding the separation until finding all the sub-topics in the corpus. Here, we introduce a hierarchical SeNMFk approach, that can extract fine-grained sub-topics and their semantic sub-structures. By hierarchically applying SeNMFk, we break down the main topics and extract previously unknown sub-topics as well as the corresponding sub-semantic structures that can serve as narrow vocabularies – scientific-jargon seeds for local Name Entities Recognition (NER). We demonstrate our hierarchical SeNMFk method by performing topic modeling on all papers posted in arXiv, which is ~2 million+ papers. To enhance the semantic clustering in each topic, we also jointly factorize the category-text matrix, values of which represents the TF-IDF of tokens per document category. Here the categories are pre-determined/reported by the authors of the document based on its field of research in arXiv. Our results show the ability and practicality of our hierarchical SeNMFk to extract meaningful topics and find their semantic sub-structures from large datasets.

Publication
Presented at the Conference on Data Analysis 2023 (CoDA 23’), Santa Fe, New Mexico. March 7-9, 2023

Keywords:

topic modeling, non-negative matrix factorization, large data, joint factorization

Citation:

Eren, M.E., Nicholas, S., Barron, R., Bhattarai, M., Boureima, I.D., Rasmussen, K.O., and Alexandrov, B.. Sub-topic and Semantic Sub-structure Extraction via SPLIT: Joint Nonnegative Matrix Factorization (NMF) with Automatic Model Selection. CoDA ’23: Conference on Data Analysis, March 7-9, 2023, Santa Fe, New Mexico, USA.

BibTeX:

@INPROCEEDINGS{eren_coda_2023,
  author={M. E. {Eren} and N. {Solovyev} and R. {Barron} and M. {Bhattarai} and I.D. {Boureima} and K.O. {Rasmussen} and B. S. {Alexandrov}},
  booktitle={Conference on Data Analysis 2023 (CoDA 23')}, 
  title={Sub-topic and Semantic Sub-structure Extraction via SPLIT: Joint Nonnegative Matrix Factorization (NMF) with Automatic Model Selection}, 
  year={2023}}
Maksim E. Eren
Maksim E. Eren
Scientist

My research interests lie at the intersection of the machine learning and cybersecurity disciplines, with a concentration in tensor decomposition.