Book Recommendation with Retrieval Augmented Generation — Part I

Jeffery chiang
5 min readMay 30, 2024

--

Photo by Susan Q Yin on Unsplash

In the ever-evolving landscape of book discovery, traditional recommendation systems often fall short. Large language models (LLMs) offer a promising new approach. By leveraging their ability to process vast amounts of text data, LLMs can delve into the intricacies of different genres, writing styles, and reader preferences. This newfound depth holds the potential to revolutionize book recommendations, leading readers not just to familiar tropes, but to truly personalized literary journeys.

Figure 1. Retrieval Augmented Generation (Image taken from LangChain https://python.langchain.com/v0.1/docs/use_cases/question_answering)

One of the exciting advancements in LLM-powered book recommendation systems is the integration of Retrieval-Augmented Generation (RAG). RAG functions as a sophisticated information retrieval tool for the LLM. By efficiently searching vast datasets of book information, RAG identifies titles with similar content and stylistic elements. This retrieved data empowers the LLM to move beyond simple similarity-based recommendations. RAG allows the LLM to grasp the underlying themes and narrative approaches that resonated with the user, enabling it to generate highly personalized suggestions that cater to the user’s specific literary preferences.

In this post, we are going to demonstrate how to build a simple vector store and retrieve the documents that are semantically relevant.

Data Set

We will use a dataset from Kaggle

Source : https://www.kaggle.com/datasets/ishikajohari/best-books-10k-multi-genre-data/

Setup

Vector Store: FAISS

Embedding model: bert-base-uncased from Hugging facehere

Github Code

Link: https://github.com/chiang9/Medium_blog/tree/main/llm_bookreco

Let’s Get Started

You will need to first get the API key from Hugging face, which you can find here.

Next we can start creating the vector embedding from the csv file we just created.

The embedding process might take a while. In this example, we will only sample the top 1000 documents.

CPU times: total: 58min 47s
Wall time: 6min 3s

Langchain provides various of different abstraction for data loader and vector store loader. Documentation can be found here. With Langchain, we will be able to save, load, merge the vector embedding.

Book: Eleanor & Park
Genres: ['Young Adult', 'Romance', 'Contemporary', 'Fiction', 'Realistic Fiction', 'Audiobook', 'Teen']

Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']

Book: The Paper Bag Princess
Genres: ['Picture Books', 'Childrens', 'Fantasy', 'Fiction', 'Dragons', 'Classics', 'Fairy Tales']

Book: The Velveteen Rabbit
Genres: ['Classics', 'Childrens', 'Fiction', 'Picture Books', 'Fantasy', 'Animals', 'Young Adult']

CPU times: total: 578 ms
Wall time: 69 ms

NLP Data Preprocessing

Performing NLP data cleaning for texts when creating a vector store is generally recommended, but the extent of cleaning depends on your specific use case.

  • Improved Vector Representation: Raw text with inconsistencies like punctuation, special characters, and typos can lead to noisy and inaccurate vector representations. Cleaning helps the model focus on the core meaning of the text.
  • Enhanced Similarity Search: When searching for similar vectors in your store, a cleaner representation allows for more accurate comparisons. Imagine searching for documents about “cooking” — you wouldn’t want results dominated by entries with typos like “coking”.
  • Reduced Storage Requirements: Removing unnecessary elements like stop words (common words like “the” or “a”) can help reduce the size of your vector representations, leading to more efficient storage.

In this session, we are going to perform some Natural Language Processing (NLP) data preprocessing to the description part, and we can compare the result.

Average length of description before and after data cleaning
956.1368537740602, 915.5401592260405
Average length of description before and after preprocessing
915.5401592260405, 900.0581477375794

Next, we perform similar step as what we previously have for the original data to generate the docsearch. We noticed that the embedding processing time is shorter.

CPU times: total: 44min 29s
Wall time: 4min 35s
Book: Inkheart (Inkworld, #1)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Middle Grade', 'Childrens', 'Adventure', 'Magic']

Book: Artemis Fowl (Artemis Fowl, #1)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Middle Grade', 'Childrens', 'Science Fiction', 'Adventure']

Book: Alexander and the Terrible, Horrible, No Good, Very Bad Day
Genres: ['Picture Books', 'Childrens', 'Fiction', 'Classics', 'Realistic Fiction', 'Humor', 'Kids']

Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']

CPU times: total: 469 ms
Wall time: 43 ms

Let’s compare the two embedding created from original description and the preprocessed description.

original docsearch: 
Book: Eleanor & Park
Genres: ['Young Adult', 'Romance', 'Contemporary', 'Fiction', 'Realistic Fiction', 'Audiobook', 'Teen']
score = 63.062381744384766


Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']
score = 66.79731750488281


Book: The Paper Bag Princess
Genres: ['Picture Books', 'Childrens', 'Fantasy', 'Fiction', 'Dragons', 'Classics', 'Fairy Tales']
score = 67.06227111816406


Book: Looking for Alaska
Genres: ['Young Adult', 'Fiction', 'Contemporary', 'Romance', 'Realistic Fiction', 'Coming Of Age', 'Teen']
score = 67.72935485839844


----------------------
preprocessed docsearch:
Book: Inkheart (Inkworld, #1)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Middle Grade', 'Childrens', 'Adventure', 'Magic']
score = 65.75355529785156


Book: Charlie and the Chocolate Factory (Charlie Bucket, #1)
Genres: ['Childrens', 'Fiction', 'Fantasy', 'Classics', 'Young Adult', 'Middle Grade', 'Humor']
score = 68.92379760742188


Book: The Sea of Monsters (Percy Jackson and the Olympians, #2)
Genres: ['Fantasy', 'Young Adult', 'Mythology', 'Fiction', 'Middle Grade', 'Adventure', 'Greek Mythology']
score = 69.14126586914062


Book: The Magician's Nephew (Chronicles of Narnia, #6)
Genres: ['Fantasy', 'Classics', 'Fiction', 'Young Adult', 'Childrens', 'Middle Grade', 'Adventure']
score = 70.03553771972656

Conclusion

The document retrieval is an essential component of RAG, and vector store provides the functionality to do the semantic search efficiently. Langchain is a very powerful abstraction tool in helping us build the vector store, retrieve the relevant document and also provides the functionality to communicate with the LLMs.

Different embedding model and similarity calculation method will give different result. We need to examine the business need to tune the embedding model to find the relevant documents to the given query.

Next we will communicate with the LLMs to complete the RAG book recommendation.

Thank you for reading, and have a great day!

--

--

Jeffery chiang
Jeffery chiang

Written by Jeffery chiang

Data Science | Machine Learning | Mathematics | DevOps

No responses yet