Contexts & Machines: How Document Parsing Shapes RAG results
- Published on
- Authors
-
-
- Name
- alessio
-
Retrieval-Augmented Generation (RAG) pipelines have shown their effectiveness in exploring complex documents. However, their performance hinges on the quality of the retrieved context, which depends on well-structured document inputs. Real-world documents often contain unstructured elements – images, tables, multi-column text, etc. – making parsing and chunking a critical challenge. Poor document processing can degrade retrieval quality, increasing the risk of hallucinations in LLM responses.
How different document parsing and chunking strategies impact RAG pipeline performance?
In our talk at Berlin Buzzwords 2025 we present the results of a study conducted to evaluate different PDF parsing and document chunking strategies – spanning both open-source and commercial-grade solutions – to determine their impact on RAG performance. Using a dataset of complex documents and LLM-generated question/answer pairs, we apply several evaluation metrics to quantify how different parsing techniques affect the relevance of retrieved information and response accuracy. Our findings reveal that parsing and chunking strategies significantly shape RAG output quality and that the most effective approach may depend on the nature of your documents.