Topic Modeling and Link-Prediction for Material Property Discovery

Ryan Barron, Maksim E. Eren, Valentin Stanev, Cynthia Matuszek, Boian S. Alexandrov

July 2025

Abstract

Link prediction is a key network analysis technique that infers missing or future relations between nodes in a graph, based on observed patterns of connectivity. Scientific literature networks and knowledge graphs are typically large, sparse, and noisy, and often contain missing links, potential but unobserved connections, between concepts, entities, or methods. Here, we present an AI-driven hierarchical link prediction framework that integrates matrix factorization and human-in-the-loop visualization to infer hidden associations and steer discovery in complex material domains. Our method combines Hierarchical Nonnegative Matrix Factorization (HNMFk) and Boolean matrix factorization (BNMFk) with automatic model selection, as well as Logistic matrix factorization (LMF), we use to construct a three-level topic tree from a 46,862-document corpus focused on 73 transition-metal dichalcogenides (TMDs). This class of materials has been studied in a variety of physics fields and has a multitude of current and potential applications. An ensemble BNMFk + LMF approach fuses discrete interpretability with probabilistic scoring. The resulting HNMFk clusters map each material onto coherent research themes, such as superconductivity, energy storage, and tribology, and highlight missing or weakly connected links between topics and materials, suggesting novel hypotheses for cross-disciplinary exploration. We validate our method by removing publications about superconductivity in wellknown superconductors, and demonstrate that the model correctly predicts thier association with the superconducting TMD clusters. This highlights the ability of the method to find hidden connections in a graph of material to latent topic associations built from scientific literature. This is especially useful when examining a diverse corpus of scientific documents covering the same class of phenomena or materials but originating from distinct communities and perspectives. The inferred links generating new hypotheses, produced by our method, are exposed through an interactive Streamlit dashboard, designed for human-in-the-loop scientific discovery.

Type

Conference paper

Publication

In 25th ACM Symposium on Document Engineering (DocEng), 2025

Keywords:

NMF, NMFk, LMF, Link Prediction, Matrix completion

Citation:

Barron, R., Eren, M.E., Stanev, V., Matuszek, C., and Alexandrov, B.. Topic Modeling and Link-Prediction for Material Property Discovery. In DocEng ‘25: 25th ACM Symposium on Document Engineering, Sep. 02-05, 2025, University of Nottingham, Nottingham, UK. 4 pages.

BibTeX:

@article{Barron2025DocEng,
  title={Topic Modeling and Link-Prediction for Material Property Discovery},
  author={Ryan C. Barron and Maksim E. Eren and Valentin Stanev and Cynthia Matuszek and Boian S. Alexandrov},
  journal={ArXiv},
  year={2025},
  volume={abs/2507.06139},
}

Topic Modeling and Link-Prediction for Material Property Discovery

Abstract

Keywords:

Citation:

BibTeX:

Maksim E. Eren

Scientist