SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection

Abstract

As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.

Publication
In ACM Symposium on Document Engineering 2022 (DocEng ’22), 2022

Keywords:

non-negative matrix factorization, topic modeling, document organization, model selection, semantic

Citation:

Maksim E. Eren, Nick Solovyev, Manish Bhattarai, Kim Rasmussen, Charles Nicholas, and Boian S. Alexandrov. 2022. SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection. In ACM Symposium on Document Engineering 2022 (DocEng ’22), September 20-23, 2022, San Jose, CA, USA. ACM, New York, NY, USA, 4 pages.

BibTeX:

@inproceedings{10.1145/3558100.3563844, author = {Eren, Maksim E. and Solovyev, Nick and Bhattarai, Manish and Rasmussen, Kim \O{}. and Nicholas, Charles and Alexandrov, Boian S.}, title = {SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-Negative Matrix Factorization with Automatic Model Selection}, year = {2022}, isbn = {9781450395441}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3558100.3563844}, doi = {10.1145/3558100.3563844}, abstract = {As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.}, booktitle = {Proceedings of the 22nd ACM Symposium on Document Engineering}, articleno = {10}, numpages = {4}, keywords = {topic modeling, model selection, semantic, document organization, non-negative matrix factorization}, location = {San Jose, California}, series = {DocEng '22} }

Maksim E. Eren
Maksim E. Eren
Scientist

My research interests lie at the intersection of the machine learning and cybersecurity disciplines, with a concentration in tensor decomposition.