Master's Thesis in Machine Learning: Chunking in Multilingual Technical Documentation

Title

Optimizing Chunking Strategies for Multilingual RAG Systems in Technical Documentation

Background

Retrieval-Augmented Generation (RAG) combines external documents with large language models (LLMs) to improve answer accuracy by grounding outputs in factual sources. In RAG, a corpus of documents is first chunked into smaller segments, embedded into vectors, and stored in a vector database. At query time, the system retrieves the most relevant chunks and feeds them to the LLM. The chunking strategy, how documents are split is crucial. It affects retrieval accuracy, context quality, and ultimately the correctness of generated answers.

Our specific focus is a multilingual repository of technical documentation (e.g.,
User/Operator/Service manuals, spare parts inventories, price lists) in languages such as English, Swedish, and German, which contains numerous diagrams (not directly vectorized). User queries will typically request specific components, identified by number or name, and anticipate detailed information (e.g., specifications, pricing, instructions), or seek documentation relevant to machine troubleshooting.

Objective

The main problem is: How can we chunk multilingual technical documents to maximize the accuracy of a RAG system when answering part-specific queries? We will examine the following aspects:

Chunking Strategy: Evaluate multiple chunking methods (fixed-size, structural, semantic, hierarchical, etc.) and their variants (overlap ratios, sizes). Determine which yields the most relevant retrievals for typical queries (e.g. “Find part X – what is its price and specs?”). In particular, test document-aware splitting using manuals’ TOCs/headings vs. purely text-based splits.
Multilingual Handling: Assess whether translating non-English content or using multilingual embeddings is more effective. For example, compare indexing German/Swedish docs in original language vs. translated to English, measuring any change in retrieval accuracy.
Retrieval Accuracy Metrics: Define metrics suitable for the domain, such as retrieval recall of the correct part descriptions and precision of retrieved chunks. We may adapt standard IR metrics (e.g. recall/precision over tokens or passages) to our evaluation tasks. For generation quality, we can check whether the LLM’s answers include correct part numbers/prices.
Vector Store and Tools: We will use Weaviate as the vector database (as currently deployed). Its multilingual vectorizers (e.g. Cohere, OpenAI) will be leveraged. If Weaviate limits arise, alternatives like Pinecone or Milvus could be considered, but initially we target optimizing within the existing Weaviate setup. The thesis will also explore any Weaviate-specific features (e.g. metadata search, filters) to aid retrieval.

Who are we looking for?

Educational Background: Currently pursuing a Master's degree in a relevant technical discipline, such as AI/ML or Computer Science.
RAG Knowledge: Strong theoretical foundation in the Retrieval-Augmented Generation (RAG) architecture and its core principles.
LLM Experience: Practical experience interacting with Large Language Models (LLMs) via API.
Vector Database Understanding: Good theoretical grasp of vector databases and their operational mechanisms. Prior experience with specific vector stores (e.g., Weaviate) is considered a bonus.
Programming Proficiency: Expertise in Python for essential tasks, including data
manipulation, implementing custom chunking algorithms, and developing the RAG
pipeline.
NLP Familiarity: Working knowledge of Natural Language Processing (NLP) concepts, especially text embedding and multilingual models.

Provided Resources

Dataset of real-world documents including, technical manuals, spare part specifications, and price lists in the form of PDFs.
AI tools such as Cursor for ease of development and building experiments.

About Hypertype

We’re building the next generation of AI products and AI Agents in the Customer Support space. Focused on manufacturing but with clients across other sectors like fintech, health tech and more we are giving hundreds of companies across the globe the capacity to transform their customer journey with the latest advancements in AI. Our customers proudly recognise us for delivering the "highest quality answers in the market... outsmarting Gemini, Fin, Microsoft Copilot etc". —a testament to our unwavering commitment to excellence and precision in every interaction.

We are a group of people that work the extra mile together, that fearlessly push the boundaries to new heights and that deeply own what they deliver. We look for those who are ambitious, humble and fast to build together things humankind has not witnessed yet.

Master's Thesis in Machine Learning: Chunking in Multilingual Technical Documentation

Title

Background

Objective

Who are we looking for?

Provided Resources

About Hypertype

Our open opportunities

Stockholm

About Hypertype

Master's Thesis in Machine Learning: Chunking in Multilingual Technical Documentation