In this project, I will try and create a local retrieval augmented generation (RAG) app for academic literature. RAG was first introduced in Lewis et al. (2020) and can be summarized as an approach or a technique that leverages large language models (LLMs) to effeciently retrieve and synthesize factual information in a (local) database while minimizing hallucinations. In other words, it is a tool that allows the user to search for or synthesize text across many text files (e.g., PDF files) using the chat feature of a LLM (think of it like having a local ChatGPT but ChatGPT's knowledge base is limited to your text files). It involves vectorizing text documents (e.g., using Sentence-BERT), indexing them (e.g., using FAISS) and then using a LLM with a chat feature (e.g., LLaMA 3).
In a recent NLP conference in Varna, Bulgaria, three Bulgarian computer scientists presented two BERT- and GPT-based text classification models for Bulgarian that outperformed previous classifiers for Bulgarian.
Are the electorates of GRÜNE Schweiz and the Grünliberale Partei Schweiz similar? How do they differ? These are two of the questions Lukas Rudolph (University of Konstanz) and I will answer in this project.