Master's Thesis in Machine Learning: Chunking in Multilingual Technical Documentation
Background
Retrieval-Augmented Generation (RAG) combines external documents with large language models (LLMs) to improve answer accuracy by grounding outputs in factual sources. In RAG, a corpus of documents is first chunked into smaller segments, embedded into vectors, and stored in a vector database. At query time, the system retrieves the most relevant chunks and feeds them to the LLM. The chunking strategy, how documents are split is crucial. It affects retrieval accuracy, context quality, and ultimately the correctness of generated answers.
Our specific focus is a multilingual repository of technical documentation (e.g.,
User/Operator/Service manuals, spare parts inventories, price lists) in languages such as English, Swedish, and German, which contains numerous diagrams (not directly vectorized). User queries will typically request specific components, identified by number or name, and anticipate detailed information (e.g., specifications, pricing, instructions), or seek documentation relevant to machine troubleshooting.
Objective
The main problem is: How can we chunk multilingual technical documents to maximize the accuracy of a RAG system when answering part-specific queries? We will examine the following aspects:
- Chunking Strategy: Evaluate multiple chunking methods (fixed-size, structural,
semantic, hierarchical, etc.) and their variants (overlap ratios, sizes). Determine which
yields the most relevant retrievals for typical queries (e.g. “Find part X – what is its price and specs?”). In particular, test document-aware splitting using manuals’ TOCs/headings vs. purely text-based splits. - Multilingual Handling: Assess whether translating non-English content or using
multilingual embeddings is more effective. For example, compare indexing
German/Swedish docs in original language vs. translated to English, measuring any
change in retrieval accuracy. - Retrieval Accuracy Metrics: Define metrics suitable for the domain, such as retrieval recall of the correct part descriptions and precision of retrieved chunks. We may adapt standard IR metrics (e.g. recall/precision over tokens or passages) to our evaluation tasks. For generation quality, we can check whether the LLM’s answers include correct part numbers/prices.
- Vector Store and Tools: We will use Weaviate as the vector database (as currently deployed). Its multilingual vectorizers (e.g. Cohere, OpenAI) will be leveraged. If
Weaviate limits arise, alternatives like Pinecone or Milvus could be considered, but
initially we target optimizing within the existing Weaviate setup. The thesis will also
explore any Weaviate-specific features (e.g. metadata search, filters) to aid retrieval.
Who are we looking for?
- Educational Background: Currently pursuing a Master's degree in a relevant technical discipline, such as AI/ML or Computer Science.
- RAG Knowledge: Strong theoretical foundation in the Retrieval-Augmented Generation (RAG) architecture and its core principles.
- LLM Experience: Practical experience interacting with Large Language Models (LLMs) via API.
- Vector Database Understanding: Good theoretical grasp of vector databases and their operational mechanisms. Prior experience with specific vector stores (e.g., Weaviate) is considered a bonus.
- Programming Proficiency: Expertise in Python for essential tasks, including data
manipulation, implementing custom chunking algorithms, and developing the RAG
pipeline. - NLP Familiarity: Working knowledge of Natural Language Processing (NLP) concepts, especially text embedding and multilingual models.
Provided Resources
- Dataset of real-world documents including, technical manuals, spare part specifications, and price lists in the form of PDFs.
- AI tools such as Cursor for ease of development and building experiments.
About Hypertype
We’re building the next generation of AI products and AI Agents in the Customer Support space. Focused on manufacturing but with clients across other sectors like fintech, health tech and more we are giving hundreds of companies across the globe the capacity to transform their customer journey with the latest advancements in AI. Our customers proudly recognise us for delivering the "highest quality answers in the market... outsmarting Gemini, Fin, Microsoft Copilot etc". —a testament to our unwavering commitment to excellence and precision in every interaction.
We are a group of people that work the extra mile together, that fearlessly push the boundaries to new heights and that deeply own what they deliver. We look for those who are ambitious, humble and fast to build together things humankind has not witnessed yet.
- Department
- Hyperbrain AI
- Locations
- Stockholm
Stockholm
About Hypertype
We solve urgent pains of billions email and chats conversations. Our vision is to build the smartest experience for customer communications with real-time data.