How to prepare your PDF docs to Smabbler API RAG using LlamaParse?

Described PDF docs processing will be presented using the LlamaParse lib.

Following example will be done using Python, due to LlamaParse native support for this language.


Getting started

  1. Collect documents: gather all PDF documents you wish to process and place them in a single directory.

  2. Create an account: sign up for a free account on LlamaIndex and obtain an API Key here.

  3. Install LlamaParse: run the following commands to ensure you have the latest version of LlamaParse installed.

pip uninstall llama-index

pip install -U llama-index --upgrade --no-cache-dir --force-reinstall

pip install llama-parse

Data processing

Before running the following code, ensure you have correctly set the necessary variables:

  • api_key: your LlamaIndex API Key, as obtained in the "Getting Started" section.

To process the documents, run the following code in the directory containing your PDFs. All PDF documents in the directory will be processed, and the output will be saved to a CSV file.

Document text will be parsed up to 2 000 000 characters due to Smabbler's maximum text length limit.

import nest_asyncio
import csv
import os

nest_asyncio.apply()

from llama_parse import LlamaParse

maxFileTextLength=2000000

parser = LlamaParse(
    api_key="***your-api-key-here***", 
    result_type="text", 
    num_workers=4, 
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

def getFileText(filename):
 filepath = f"./{filename}"
 parsedDoc = parser.load_data(filepath)

 texts=[]

 for node in parsedDoc:
   texts.append(node.text)

 mergedText="\r\n".join(texts)

 return mergedText[:maxFileTextLength]

def getPdfFiles():
  files = [f for f in os.listdir('.') if os.path.isfile(f) & f.endswith(".pdf")]
  return files

def exportFileTextsToCsv(fileTexts):
  with open("file-export.csv", "w", newline='',encoding="utf-8") as outfile:
    csvwriter = csv.writer(outfile, delimiter=';',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(["filename","text"])
    
    for fileText in fileTexts:
     csvwriter.writerow([fileText[0],fileText[1]]);

fileTexts=[]

files=getPdfFiles()

for file in files:
 fileText=getFileText(file)
 fileTexts.append((file,fileText))

exportFileTextsToCsv(fileTexts)

As a result, file-export.csv file will be produced. This file can be used in next steps of RAG model building.

Last updated