How to prepare your PDF docs to Smabbler API RAG using Open-Parse?

Described PDF docs processing will be presented using the Open-Parse libarrow-up-right.

Following example will be done using Python, due to Open-Parse native support for this language.

Getting started

  1. Collect documents: gather all PDF documents you wish to process and place them in a single directory.

  2. Install Open-Parse:

pip install openparse

Data processing

To process the documents, run the following code in the directory containing your PDFs. All PDF documents in the directory will be processed, and the output will be saved to a CSV file.

circle-info

Document text will be parsed up to 2 000 000 characters due to Smabbler's maximum text length limit.

    import openparse
    import csv
    import os

    maxFileTextLength=2000000

    def getFileText(filename):
    filepath = f"./{filename}"
    parser = openparse.DocumentParser()
    parsedDoc = parser.parse(filepath)

    texts=[]

    for node in parsedDoc.nodes:
    texts.append(node.text)

    mergedText="\r\n".join(texts)

    return mergedText[:maxFileTextLength]

    def getPdfFiles():
    files = [f for f in os.listdir('.') if os.path.isfile(f) & f.endswith(".pdf")]
    return files

    def exportFileTextsToCsv(fileTexts):
    with open("file-export.csv", "w", newline='',encoding="utf-8") as outfile:
        csvwriter = csv.writer(outfile, delimiter=';',
                                quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csvwriter.writerow(["filename","text"])
        
        for fileText in fileTexts:
        csvwriter.writerow([fileText[0],fileText[1]]);

    fileTexts=[]

    files=getPdfFiles()

    for file in files:
    fileText=getFileText(file)
    fileTexts.append((file,fileText))

    exportFileTextsToCsv(fileTexts)

As a result, file-export.csv file will be produced. This file can be used in next steps of RAG model building.

Last updated