1. The Problem

In many professional workflows, we often receive lengthy documents, such as insurance policies, research papers, contracts, or compliance reports, which can range in length from 50 to 200 pages.

We faced a real-life challenge:

Given a 140-page PDF, how can we quickly identify the relevant pages that contain useful information?

Manually scanning such large PDFs is inefficient, time-consuming, and prone to errors. Especially when:

  • The information is scattered randomly
  • Some PDFs are scanned images (not searchable)
  • We only need 5-10 pages from the entire file

This triggered the need for an automated system that can analyze all pages and return only the relevant ones.

2. Creating the Dataset — Manual Labeling First

Before building any model or writing code, we started with the most important step: understanding the problem and manually labeling data.

We collected content from multiple long PDF documents — these included insurance policies, legal agreements, and general multi-page reports. From these documents, we manually selected small text sections (usually paragraphs) and assigned labels based on their content:

Relevant — paragraphs that contained useful or critical information
Irrelevant — background, redundant, or general-purpose content

These labeled sections were added to an Excel file, where each row had:

text — the paragraph content
label — either relevant or irrelevant

This process helped us create a high-quality, supervised dataset, small but meaningful. It formed the foundation for training a machine learning model to detect relevance automatically in future documents.

3. The Core Idea

We designed a system that reads each page of a PDF, breaks it into smaller paragraphs, and uses a machine learning model to classify each paragraph as relevant or general.

If enough paragraphs in a page are relevant → the page is considered useful.
This approach works on both:

  • Text-based PDFs: directly extractable text
  • Scanned/image PDFs: where we need OCR to extract content

The result is a list of page numbers that can be directly reviewed or processed, saving time and improving accuracy.

4. Training the Relevance Classifier

We used Scikit-learn to train a text classification model. The model’s job was to take in a paragraph and predict whether it was relevant.

We chose a simple but powerful approach:

  • Vectorize the text using CountVectorizer
  • Use Multinomial Naive Bayes for classification

model_pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])

We split the data into training and test sets, trained the model, and evaluated it using accuracy and a classification report.

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2)
model_pipeline.fit(X_train, y_train)

Once satisfied, we saved the model using joblib

5. Extracting Text from the PDF

PDFs come in two types:

  • Text-based PDFs — easy to extract using PyPDF2
  • Scanned/image-based PDFs — required OCR using Azure Form Recognizer

We wrote a small utility to detect the type of PDF:

  • For text-based PDFs, we used pypdf2 to extract text.
  • For scanned PDFs, we used Azure Form Recognizer’s prebuilt-read model to perform OCR and extract each line of text.

6. Predicting Relevance Page-wise

Once we had text content for each page, we split each page into paragraphs using line breaks:

paragraphs = text.split('\n')

Then, for each paragraph, we predicted its label using our model:

prediction = model.predict([paragraph])[0]

If more than 25% of the paragraphs in a page were marked relevant, we considered the entire page relevant.

paragraphs = text.split('\\n')
relevant = 0
for para in paragraphs:
if model.predict([para])[0] == 'relevant':
relevant += 1

if relevant > len(paragraphs) / 4:
relevant_pages.append(page_number)

This gave us two things:

  • A list of relevant page numbers
  • A breakdown of relevance per page

7. Outcome & Impact

Once the model was trained and integrated, we used it inside a larger product workflow.

Here’s what changed:

  • Given a large document, we now automatically detect only the relevant pages
  • These selected pages are passed on to other AI models for data extraction
  • Since unnecessary pages are filtered out early, extraction becomes faster, more accurate, and less noisy.

This significantly reduces the load on downstream systems and improves both performance and precision.