HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain’s specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.

Publication
In 13th International Conference on Learning Representations, Workshop on Scaling Self-Improving Foundation Models without Human Supervision (ICLR 2025 SSI-FM)

Keywords:

Contrastive Learning, Hierarchical Labels, Retrieval-Augmented Generation, Embedding Models, Document Clustering

Citation:

Bhattarai, M., Barron, R., Eren, M.E., Vu, M., Grantcharov, V., Boureima, I., Stanev, V., Matuszek, C., Valtchinov, V., Rasmussen, K. and Alexandrov, B.. HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning. Under review at the In ICLR ’25 SSI-FM Workshop: 13th International Conference on Learning Representations, Workshop on Scaling Self-Improving Foundation Models without Human Supervision, Apr. 21, 2025, Singapore. 10 pages.

BibTeX:

@inproceedings{bhattarai-etal-2025-heal,
    title = "{HEAL}: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning",
    author = "Bhattarai, Manish  and
      Barron, Ryan  and
      Eren, Maksim E.  and
      Vu, Minh N.  and
      Grantcharov, Vesselin  and
      Ismael, Ismael  and
      Stanev, Valentin  and
      Matuszek, Cynthia  and
      Valtchinov, Vladimir I  and
      Rasmussen, Kim  and
      Alexandrov, Boian S.",
    editor = "Shi, Weijia  and
      Yu, Wenhao  and
      Asai, Akari  and
      Jiang, Meng  and
      Durrett, Greg  and
      Hajishirzi, Hannaneh  and
      Zettlemoyer, Luke",
    booktitle = "Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.knowledgenlp-1.19/",
    doi = "10.18653/v1/2025.knowledgenlp-1.19",
    pages = "205--214",
    ISBN = "979-8-89176-229-9",
    abstract = "Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain{'}s specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths."
}
Maksim E. Eren
Maksim E. Eren
Scientist

Maksim E. Eren is a Scientist at Los Alamos National Laboratory, specializing in machine learning and artificial intelligence for large-scale data science applications.