Book Recommendation with Retrieval Augmented Generation — Part I
In the ever-evolving landscape of book discovery, traditional recommendation systems often fall short. Large language models (LLMs) offer a promising new approach. By leveraging their ability to process vast amounts of text data, LLMs can delve into the intricacies of different genres, writing styles, and reader preferences. This newfound depth holds the potential to revolutionize book recommendations, leading readers not just to familiar tropes, but to truly personalized literary journeys.
One of the exciting advancements in LLM-powered book recommendation systems is the integration of Retrieval-Augmented Generation (RAG). RAG functions as a sophisticated information retrieval tool for the LLM. By efficiently searching vast datasets of book information, RAG identifies titles with similar content and stylistic elements. This retrieved data empowers the LLM to move beyond simple similarity-based recommendations. RAG allows the LLM to grasp the underlying themes and narrative approaches that resonated with the user, enabling it to generate highly personalized suggestions that cater to the user’s specific literary preferences.
In this post, we are going to demonstrate how to build a simple vector store and retrieve the documents that are semantically relevant.
Data Set
We will use a dataset from Kaggle
Source : https://www.kaggle.com/datasets/ishikajohari/best-books-10k-multi-genre-data/
Setup
Vector Store: FAISS
Embedding model: bert-base-uncased from Hugging face — here
Github Code
Link: https://github.com/chiang9/Medium_blog/tree/main/llm_bookreco
Let’s Get Started
You will need to first get the API key from Hugging face, which you can find here.
Next we can start creating the vector embedding from the csv file we just created.
The embedding process might take a while. In this example, we will only sample the top 1000 documents.
CPU times: total: 58min 47s
Wall time: 6min 3s
Langchain provides various of different abstraction for data loader and vector store loader. Documentation can be found here. With Langchain, we will be able to save, load, merge the vector embedding.
Book: Eleanor & Park
Genres: ['Young Adult', 'Romance', 'Contemporary', 'Fiction', 'Realistic Fiction', 'Audiobook', 'Teen']
Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']
Book: The Paper Bag Princess
Genres: ['Picture Books', 'Childrens', 'Fantasy', 'Fiction', 'Dragons', 'Classics', 'Fairy Tales']
Book: The Velveteen Rabbit
Genres: ['Classics', 'Childrens', 'Fiction', 'Picture Books', 'Fantasy', 'Animals', 'Young Adult']
CPU times: total: 578 ms
Wall time: 69 ms
NLP Data Preprocessing
Performing NLP data cleaning for texts when creating a vector store is generally recommended, but the extent of cleaning depends on your specific use case.
- Improved Vector Representation: Raw text with inconsistencies like punctuation, special characters, and typos can lead to noisy and inaccurate vector representations. Cleaning helps the model focus on the core meaning of the text.
- Enhanced Similarity Search: When searching for similar vectors in your store, a cleaner representation allows for more accurate comparisons. Imagine searching for documents about “cooking” — you wouldn’t want results dominated by entries with typos like “coking”.
- Reduced Storage Requirements: Removing unnecessary elements like stop words (common words like “the” or “a”) can help reduce the size of your vector representations, leading to more efficient storage.
In this session, we are going to perform some Natural Language Processing (NLP) data preprocessing to the description part, and we can compare the result.
Average length of description before and after data cleaning
956.1368537740602, 915.5401592260405
Average length of description before and after preprocessing
915.5401592260405, 900.0581477375794
Next, we perform similar step as what we previously have for the original data to generate the docsearch. We noticed that the embedding processing time is shorter.
CPU times: total: 44min 29s
Wall time: 4min 35s
Book: Inkheart (Inkworld, #1)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Middle Grade', 'Childrens', 'Adventure', 'Magic']
Book: Artemis Fowl (Artemis Fowl, #1)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Middle Grade', 'Childrens', 'Science Fiction', 'Adventure']
Book: Alexander and the Terrible, Horrible, No Good, Very Bad Day
Genres: ['Picture Books', 'Childrens', 'Fiction', 'Classics', 'Realistic Fiction', 'Humor', 'Kids']
Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']
CPU times: total: 469 ms
Wall time: 43 ms
Let’s compare the two embedding created from original description and the preprocessed description.
original docsearch:
Book: Eleanor & Park
Genres: ['Young Adult', 'Romance', 'Contemporary', 'Fiction', 'Realistic Fiction', 'Audiobook', 'Teen']
score = 63.062381744384766
Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']
score = 66.79731750488281
Book: The Paper Bag Princess
Genres: ['Picture Books', 'Childrens', 'Fantasy', 'Fiction', 'Dragons', 'Classics', 'Fairy Tales']
score = 67.06227111816406
Book: Looking for Alaska
Genres: ['Young Adult', 'Fiction', 'Contemporary', 'Romance', 'Realistic Fiction', 'Coming Of Age', 'Teen']
score = 67.72935485839844
----------------------
preprocessed docsearch:
Book: Inkheart (Inkworld, #1)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Middle Grade', 'Childrens', 'Adventure', 'Magic']
score = 65.75355529785156
Book: Charlie and the Chocolate Factory (Charlie Bucket, #1)
Genres: ['Childrens', 'Fiction', 'Fantasy', 'Classics', 'Young Adult', 'Middle Grade', 'Humor']
score = 68.92379760742188
Book: The Sea of Monsters (Percy Jackson and the Olympians, #2)
Genres: ['Fantasy', 'Young Adult', 'Mythology', 'Fiction', 'Middle Grade', 'Adventure', 'Greek Mythology']
score = 69.14126586914062
Book: The Magician's Nephew (Chronicles of Narnia, #6)
Genres: ['Fantasy', 'Classics', 'Fiction', 'Young Adult', 'Childrens', 'Middle Grade', 'Adventure']
score = 70.03553771972656
Conclusion
The document retrieval is an essential component of RAG, and vector store provides the functionality to do the semantic search efficiently. Langchain is a very powerful abstraction tool in helping us build the vector store, retrieve the relevant document and also provides the functionality to communicate with the LLMs.
Different embedding model and similarity calculation method will give different result. We need to examine the business need to tune the embedding model to find the relevant documents to the given query.
Next we will communicate with the LLMs to complete the RAG book recommendation.
Thank you for reading, and have a great day!