Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

Abstract

Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.

Publication
In 20th International Conference on Artificial Intelligence and Law (ICAIL), 2025

Keywords:

law, legal knowledge, nmf, topic labeling, llm, chain of thought, prompt tuning,information retrieval

Citation:

Ryan Calvin Barron, Maksim Eren, Olga Serafimova, Cynthia Matuszek, and Boian Alexandrov. 2026. Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization. In Proceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ‘25). Association for Computing Machinery, New York, NY, USA, 51–60. https://doi.org/10.1145/3769126.3769215

BibTeX:

@inproceedings{10.1145/3769126.3769215,
author = {Calvin Barron, Ryan and Eren, Maksim and Serafimova, Olga and Matuszek, Cynthia and Alexandrov, Boian},
title = {Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization},
year = {2026},
isbn = {9798400719394},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3769126.3769215},
doi = {10.1145/3769126.3769215},
abstract = {Agentic Generative AI, powered by Large Language Models (LLMs) and enhanced with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable across specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain we focus on here comprises inherently complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research and decision-making. Here, we introduce a generative AI system, a jurisdiction-specific legal information retrieval that integrates RAG, VS, and KG, constructed via Hierarchical Non-Negative Matrix Factorization (HNMFk), to enhance information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends—challenging tasks essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This approach is demonstrated in legal document clustering, summarization, and cross-referencing tasks. The framework marks a significant step toward augmenting legal research with scalable, interpretable, and accurate retrieval methods for semi-structured data, advancing the intersection of computational law and artificial intelligence.},
booktitle = {Proceedings of the Twentieth International Conference on Artificial Intelligence and Law},
pages = {51–60},
numpages = {10},
keywords = {law, legal knowledge, nmf, topic labeling, llm, chain of thought, prompt tuning, information retrieval},
location = {
},
series = {ICAIL '25}
}
Maksim E. Eren
Maksim E. Eren
Scientist

Maksim E. Eren is a Scientist at Los Alamos National Laboratory, specializing in machine learning and artificial intelligence for large-scale data science applications.