App Structure

The Fluidsdata Digitization and OCR application follows a modular architecture designed for scalable document processing and OCR operations. This document provides a detailed breakdown of the application’s organization and component responsibilities.

Overview

The application is structured into several key components that work together to provide a complete pipeline for digitizing and processing PVT reports. The architecture follows separation of concerns principles, with clear boundaries between UI, processing logic, data access, and utility functions.

Root Structure

fluidsdata.ocr/
├── app/                    # Main application code
├── docs/                   # Documentation (Sphinx)
├── legacy/                 # Legacy code and migration scripts
├── media/                  # Media files (images, documents)
├── requirements.txt        # Python dependencies
└── README.md               # Project overview (TODO)

Core Application Components

The main application code is organized under the app/ directory with the following structure (see API Reference for complete API documentation):

Configuration & Entry Points

app/
├── config.py              # Application configuration
└── main.py                # Main entry point for Streamlit
config.py

Contains application-wide configuration settings including:

  • API endpoints and credentials

  • OCR engine parameters

  • File processing limits

  • Cache settings

  • Azure storage configuration

main.py

The primary Streamlit application entry point that:

  • Initializes the app session

  • Authenticates users

  • Configures page routing

  • Sets up the UI framework

  • Manages global state

API Layer

app/api/
├── pvtTest.py             # API logic for PVT report test data
└── sampleTest.py          # API logic specifically for sample test data

The API layer provides helper functions for accessing the backend API endpoints and handling data retrieval and submission.

Core Library

Note

The core library contains the bulk of the data analysis logic, as well as some session management and backend integration code. This library is not as modular as other parts of the application as it was initially developed as its own monolithic application. Future refactoring efforts should aim to improve modularity and separation of concerns.

app/library/
├── fdCommon.py                 # Common utilities and helpers
├── fdConfiguration.py          # Configuration management
├── fdDataAccess.py             # Data access/backend layer
├── fdDigitizationSession.py    # Session management
├── fdExcelProcessing.py        # Excel file processing
├── fdFeatures.py               # Feature-based categorization of data (not currently used)
├── fdMapping.py                # Data mapping operations
├── fdNormalization.py          # Data normalization
├── fdTableMatching.py          # Table matching logic
├── fdTestData.py               # Test data handling
├── fdTestSpecific.py           # Test-specific logic
├── fdValidation.py             # Data validation
├── fdReport_with_win32com.py   # Report generation with win32com
└── fdReport.py                 # Report generation

Key Library Components:

  • fdDigitizationSession: Manages the complete digitization workflow, tracking file processing state and maintaining session context

  • fdDataAccess: Provides data persistence and retrieval operations, interfacing with databases and file systems

  • fdExcelProcessing: Specialized Excel file handling for PVT report data extraction and validation

  • fdMapping: Contains data mapping logic for categorizing and structuring extracted data

  • fdNormalization: Implements data normalization techniques for consistent data representation

  • fdCommon: Contains common utilities and helper functions used across the application

  • fdValidation: Implements data validation rules and quality checks

  • fdReport: Houses the constructor and logic for the Report class, an object which is used to represent a PVT report in the application

  • fdTableMatching: Implements table matching logic for PVT report data

Data Models

The data models section defines the core data structures and processing logic used for processing documents and managing OCR results.

app/models/
├── session.py              # Session state models
├── coord_normalization.py  # Coordinate normalization utilities
├── table_extractor.py      # Table extraction utilities
├── table_reconstructor.py  # Table reconstruction utilities
├── ocr_statistics.py       # OCR statistics and metrics
├── term_dict.py            # Terminology dictionary management (deprecated)
├── term_matcher.py         # Terminology matching utilities (deprecated)
├── term_processor.py       # Terminology processing logic. Combination of term_dict and term_matcher
├── ocr.py                  # OCR entry point and results
└── yolo_models/            # YOLO vision model weights

Model Responsibilities:

  • session: Defines AppSession class for managing application state across user interactions

  • ocr: Contains the implementation of the docTr OCR engine, along with logic to reformat results into a more concise format

  • table_extractor: Implements table extraction logic using YOLO models for detection and structure recognition

  • table_reconstructor: Reconstructs detected tables into structured dataframes

  • coord_normalization: Provides utilities for normalizing coordinates between different document formats and resolutions

UI Pages

Warning

Due to a conflict between Sphinx (documentation tool) and Streamlit (UI framework), the UI pages are not documented in the API reference.

The UI pages are organized under the app/pages/ directory and serve as the top level of UI organization by leveraging streamlit’s multi-page capabilities.

app/pages/
├── Dashboard.py             # Main dashboard page (currently just a placeholder populated with the memory dashboard)
├── File_Upload.py           # File upload page for PDF and image files
├── Manage_Configuration.py  # Configuration management page
├── Process_Reports.py       # Report processing page for managing reports
└── Test_Data.py             # Test data management page (uploading to api)

Page Descriptions:

  • Dashboard: In the future should contain useful metrics and statistics with visualization. Currently being used as a temporary spot for the memory overview.

  • File Upload: Contains all the file upload and OCR UI components.

  • Manage Configuration: Provides an interface for managing application configuration settings.

  • Process Reports: Facilitates the processing and management of generated reports.

  • Test Data: Enables the uploading and management of test data for the application.

App Services

The services layer encapsulates the core logic and operations of the OCR pipeline and file upload process and is responsible for managing the interaction between the UI and the underlying data processing components.

app/services/
├── file_service.py         # File management operations
├── ocr_service.py          # OCR and table extraction handler
└── digital_service.py      # Digital text extraction service (not in use)

Service Layer Functions:

  • file_service: Handles file upload, validation, format conversion, and metadata management

  • ocr_service: Coordinates OCR operations using DocTR engine, manages text/table extraction workflows and returns structured results (dataframes)

User Interface Components

app/ui/
├── batch_processing.py            # Batch processing UI components (deprecated)
├── file_selection.py              # PDF and image file selection and upload UI
├── file_selection_excel.py        # Excel file selection and upload UI
├── process_file.py                # OCR processing and results (deprecated)
└── manual_selection.py            # Manual table extraction UI (deprecated)

UI Components:

  • file_selection: Provides file upload interface for PDF and image files, including validation and format checks

  • file_selection_excel: Specialized interface for Excel file uploads, handling specific validation and processing

Utility Functions

app/utils/
├── bbox_utils.py          # Bounding box utilities
├── file_utils.py          # File processing utilities
├── image_utils.py         # Image manipulation functions
├── ocr_utils.py           # OCR-related utilities
├── digital_utils.py       # Digital text extraction utilities (deprecated)
└── dictionary_utils.py    # Dictionary utilities for term matching (deprecated)

Utility Categories:

  • file_utils: File format detection, conversion, and metadata extraction

  • image_utils: Image preprocessing, enhancement, and format conversion for OCR optimization

  • ocr_utils: OCR-specific utilities

  • bbox_utils: Bounding box calculations and utilities for table detection and visualization

Testing Infrastructure

Thus far in development, unit tests have been limited to the fdCommon file that contains some utility and generic functions. Future work should include more comprehensive unit tests for the core library and services to ensure reliability and maintainability.

app/tests/
└── test_fdCommon.py      # Unit tests for common utilities

UX Library

The UX library provides user interface components and utilities for the data analysis portion of the application.

Note

Also a remenant of the original monolithic application, the UX library is not as modular as it could be. It should be refactored to combine with other UI components in the future.

app/ux_library/
├── fdAuthorization.py        # User authentication and authorization UI components
├── fdEditTable.py            # Table editing UI components
├── fdExcelFileImport.py      # Excel file import UI components
├── fdManageSamples.py        # Sample management UI components
├── fdMapColumns.py           # Column mapping UI components
├── fdMapHeader.py            # Header mapping UI components
├── fdMappingProcess.py       # Mapping process UI components
├── fdNavigation.py           # Navigation UI components
├── fdReportFilters.py        # Report filtering UI components
├── fdReports.py              # Report generation UI components
├── fdReportTables.py         # Report table UI components
├── fdReviewData.py           # Review data UI components
├── fdSavedTable.py           # Saved table UI components
├── fdTestData.py             # Test data UI components (replaced with the Test Data page)
├── fdTestFiles.py            # Test files UI components
├── fdUIFunctions.py          # UI functions UI components
└── fdValidationUX.py         # Validation UX components

Upload/Processing Data Flow

The upload process (including OCR and table extraction) follows this high-level data flow:

┌─────── File Service ───────┐   ┌──── OCR Service ────┐
File Upload → Image Processing → OCR → Table Extraction

Detailed Flow:

  1. Input Processing: Files uploaded through Streamlit interface, validated by file_service

  2. Image Preparation: Images extracted and preprocessed using image_utils

  3. OCR Extraction: Text extraction performed by ocr_service using DocTR

  4. Structure Detection: YOLO models identify tables and document structure via table_extractor

  5. Data Reconstruction: table_reconstructor reconstructs tabular data from OCR results

Technology Integration

The application integrates several key technologies:

OCR Engine: DocTR (Document Text Recognition) [repo] [docs]
  • High-accuracy text extraction

  • Multi-language support

  • Confidence scoring for quality assessment

  • Two-stage processing (detection and recognition) with configurable parameters and models for each step

Computer Vision: YOLO (You Only Look Once) [docs] [structure model] [detection model]
  • Real-time object detection for document structure

  • Table boundary identification

  • YOLOv10 for table detection and YOLOv8 for table structure recognition

Warning

The licensing for these specifically trained models is open source and free to use for internal use and other open source projects. To put into production code, either a license must be obtained (https://www.ultralytics.com/yolo) or a custom model must be fully trained.

Frontend Framework: Streamlit [docs]
  • Rapid prototyping capabilities

  • Interactive data visualization

  • Multi-page application support

Note

In the future, the dependency on Streamlit should be reduced or removed as we are reaching the ceiling of what Streamlit can do. Either by wrapping backend logic in an API or by reworking the entire UI to be a stateful web application.

Cloud Integration: Azure Blob Storage [docs]
  • Scalable file storage

  • Secure data handling

  • API-based access patterns

Data Processing: Pandas & NumPy [pandas docs] [numpy docs]
  • Efficient data manipulation

  • Statistical analysis capabilities

  • Export format flexibility

Performance Considerations

Image Caching:

A large memory bottleneck of the application is the caching of images in memory. In the session file, the LazyImageManager class is used to handle image caching and ensure proper loading and cleanup. Similar logic could be extended to other memory-intensive components of the app to improve performance and reduce memory usage.

Batch Processing:

Batch processing was initially implemented but is currently deprecated. Future improvements could reintroduce this feature to speed up processing speed, primarily when accelerated hardware is available. (see Installation for more details on hardware acceleration)

Security & Data Handling

Data Security
  • Secure file upload validation

  • Temporary file cleanup procedures

  • Access control for sensitive operations

Privacy Protection
  • Local processing options

  • Configurable data retention policies

  • Audit trails for data operations