How to prepare your PDF docs to Smabbler API RAG using Unstructured.io?
Described PDF docs processing will be presented using the Unstructured API.
Following example will be done using Python, due to Unstructured publishing a library for batch document ingest in this language.
Getting started
Create either a Free Tier account or a regular account (on the landing page click on the "Get started for free" button). Depending on the account type the API URL is different - for the Free Tier you can obtain the url here, for the regular account you can get it from the personal dashboard.
Collect documents: gather all PDF documents you wish to process and place them in a single directory.
Install Unstructured ingest library and its extension for processing PDF files:
Data processing
Before running the following code please make sure to properly set the variables in the code:
apiKey
should contain your API Key obtained as described in the "Getting started" section,apiUrl
should contain the API Url as described in the "Getting started" section,inputDirectory
should point to the directory where you have collected your PDF documents,outputDirectory
should point to an existing directory where you wish to store the output files created by Unstructured during processing and the resulting CSV file.
All PDF documents in the directory will be processed, and the output will be saved to a CSV file.
Document text will be parsed up to 20,000 characters due to Smabbler's maximum text length limit.
As a result, file-export.csv file will be produced. This file can be used in next steps of RAG model building.