Overview

What the Application Does

This app provides a complete pipeline for digitizing and processing PVT reports:

  • PDF/Image/Excel Import: Upload report files in various formats.

  • OCR Processing: Extract textual data from documents using Optical Character Recognition (OCR).

  • Table Reconstruction: Identify and reconstruct tables from the extracted data.

  • Data Mapping: Map the extracted data to standard formats for analysis.

  • Data Merging: Combine data from multiple sources into a unified format for analysis.

Key Features

  • Automated document processing pipeline

  • Advanced table detection and reconstruction

  • Custom categorization for PVT report data

  • Bulk processing capabilities

  • API integration for seamless data handling

High-Level Workflow

High-Level Workflow Diagram
PDF Reports → Digitization Pipeline → Structured Data → Analysis Tools → Export
     ↓              ↓                      ↓                ↓              ↓
[Input]      [OCR + Structure]      [Database]      [Streamlit UI]   [Output]

Technology Stack

  • Frontend: Streamlit (multi-page application)

  • OCR Engine: DocTR (Document Text Recognition)

  • Computer Vision: YOLO models for structure detection

  • Backend: Custom Azure-based API

  • Data Processing: Pandas, NumPy

  • Testing: Pytest

Next Steps