How to prepare your PDF docs to Smabbler API RAG using Unstructured.io?

Described PDF docs processing will be presented using the Unstructured API.

Following example will be done using Python, due to Unstructured publishing a library for batch document ingest in this language.

Getting started

  1. Create either a Free Tier account or a regular account (on the landing page click on the "Get started for free" button). Depending on the account type the API URL is different - for the Free Tier you can obtain the url here, for the regular account you can get it from the personal dashboard.

  2. Collect documents: gather all PDF documents you wish to process and place them in a single directory.

  3. Install Unstructured ingest library and its extension for processing PDF files:

pip install unstructured-ingest
pip install "unstructured-ingest[pdf]"

Data processing

Before running the following code please make sure to properly set the variables in the code:

  • apiKey should contain your API Key obtained as described in the "Getting started" section,

  • apiUrl should contain the API Url as described in the "Getting started" section,

  • inputDirectory should point to the directory where you have collected your PDF documents,

  • outputDirectory should point to an existing directory where you wish to store the output files created by Unstructured during processing and the resulting CSV file.

All PDF documents in the directory will be processed, and the output will be saved to a CSV file.

import csv
import os
import json
import unstructured_ingest
import unstructured_client

from unstructured_client.models import operations, shared
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes import FiltererConfig
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
maxFileTextLength=20000

apiKey = "YOUR API KEY HERE"
apiUrl = "https://api.unstructuredapp.io/general/v0/general"
inputDirectory = "D:\\unstructured-input"
outputDirectory = "D:\\unstructured-output" 

def getOutputFiles():
  files = [f for f in os.listdir(outputDirectory) if os.path.isfile(f"{outputDirectory}/{f}") & f.endswith(".json")]
  return files


def getFileText(filename):
 with open(f"{outputDirectory}/{filename}", "rb") as f:
    data = json.load(f)
    texts=[]

    for node in data:
        texts.append(node["text"])

    mergedText="\r\n".join(texts)

    return mergedText[:maxFileTextLength]

def exportFileTextsToCsv(fileTexts):
  with open(f"{outputDirectory}/file-export.csv", "w", newline='',encoding="utf-8") as outfile:
    csvwriter = csv.writer(outfile, delimiter=';',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(["filename","text"])
    
    for fileText in fileTexts:
     csvwriter.writerow([fileText[0],fileText[1]]);

if __name__ == "__main__":
    Pipeline.from_configs(
        filterer_config=FiltererConfig(file_glob=["*.pdf"]),
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=inputDirectory),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=apiKey,
            partition_endpoint=apiUrl,
            strategy="hi_res",
            metadata_include=["filename"],
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=outputDirectory)
    ).run()

    fileTexts=[]

    files=getOutputFiles()

    for file in files:
        fileText=getFileText(file)
        fileTexts.append((file,fileText))

    exportFileTextsToCsv(fileTexts)

As a result, file-export.csv file will be produced. This file can be used in next steps of RAG model building.