App Structure¶

The Fluidsdata Digitization and OCR application follows a modular architecture designed for scalable document processing and OCR operations. This document provides a detailed breakdown of the application’s organization and component responsibilities.

Overview¶

The application is structured into several key components that work together to provide a complete pipeline for digitizing and processing PVT reports. The architecture follows separation of concerns principles, with clear boundaries between UI, processing logic, data access, and utility functions.

Root Structure¶

fluidsdata.ocr/
├── app/                    # Main application code
├── docs/                   # Documentation (Sphinx)
├── legacy/                 # Legacy code and migration scripts
├── media/                  # Media files (images, documents)
├── requirements.txt        # Python dependencies
└── README.md               # Project overview (TODO)

Core Application Components¶

The main application code is organized under the app/ directory with the following structure (see API Reference for complete API documentation):

Configuration & Entry Points¶

app/
├── config.py              # Application configuration
└── main.py                # Main entry point for Streamlit

config.py

Contains application-wide configuration settings including:

API endpoints and credentials
OCR engine parameters
File processing limits
Cache settings
Azure storage configuration

main.py

The primary Streamlit application entry point that:

Initializes the app session
Authenticates users
Configures page routing
Sets up the UI framework
Manages global state

API Layer¶

app/api/
├── pvtTest.py             # API logic for PVT report test data
└── sampleTest.py          # API logic specifically for sample test data

The API layer provides helper functions for accessing the backend API endpoints and handling data retrieval and submission.

Core Library¶

Note

The core library contains the bulk of the data analysis logic, as well as some session management and backend integration code. This library is not as modular as other parts of the application as it was initially developed as its own monolithic application. Future refactoring efforts should aim to improve modularity and separation of concerns.

app/library/
├── fdCommon.py                 # Common utilities and helpers
├── fdConfiguration.py          # Configuration management
├── fdDataAccess.py             # Data access/backend layer
├── fdDigitizationSession.py    # Session management
├── fdExcelProcessing.py        # Excel file processing
├── fdFeatures.py               # Feature-based categorization of data (not currently used)
├── fdMapping.py                # Data mapping operations
├── fdNormalization.py          # Data normalization
├── fdTableMatching.py          # Table matching logic
├── fdTestData.py               # Test data handling
├── fdTestSpecific.py           # Test-specific logic
├── fdValidation.py             # Data validation
├── fdReport_with_win32com.py   # Report generation with win32com
└── fdReport.py                 # Report generation

Key Library Components:

fdDigitizationSession: Manages the complete digitization workflow, tracking file processing state and maintaining session context
fdDataAccess: Provides data persistence and retrieval operations, interfacing with databases and file systems
fdExcelProcessing: Specialized Excel file handling for PVT report data extraction and validation
fdMapping: Contains data mapping logic for categorizing and structuring extracted data
fdNormalization: Implements data normalization techniques for consistent data representation
fdCommon: Contains common utilities and helper functions used across the application
fdValidation: Implements data validation rules and quality checks
fdReport: Houses the constructor and logic for the Report class, an object which is used to represent a PVT report in the application
fdTableMatching: Implements table matching logic for PVT report data

Data Models¶

The data models section defines the core data structures and processing logic used for processing documents and managing OCR results.

app/models/
├── session.py              # Session state models
├── coord_normalization.py  # Coordinate normalization utilities
├── table_extractor.py      # Table extraction utilities
├── table_reconstructor.py  # Table reconstruction utilities
├── ocr_statistics.py       # OCR statistics and metrics
├── term_dict.py            # Terminology dictionary management (deprecated)
├── term_matcher.py         # Terminology matching utilities (deprecated)
├── term_processor.py       # Terminology processing logic. Combination of term_dict and term_matcher
├── ocr.py                  # OCR entry point and results
└── yolo_models/            # YOLO vision model weights

Model Responsibilities:

session: Defines AppSession class for managing application state across user interactions
ocr: Contains the implementation of the docTr OCR engine, along with logic to reformat results into a more concise format
table_extractor: Implements table extraction logic using YOLO models for detection and structure recognition
table_reconstructor: Reconstructs detected tables into structured dataframes
coord_normalization: Provides utilities for normalizing coordinates between different document formats and resolutions

UI Pages¶

Warning

Due to a conflict between Sphinx (documentation tool) and Streamlit (UI framework), the UI pages are not documented in the API reference.

The UI pages are organized under the app/pages/ directory and serve as the top level of UI organization by leveraging streamlit’s multi-page capabilities.

app/pages/
├── Dashboard.py             # Main dashboard page (currently just a placeholder populated with the memory dashboard)
├── File_Upload.py           # File upload page for PDF and image files
├── Manage_Configuration.py  # Configuration management page
├── Process_Reports.py       # Report processing page for managing reports
└── Test_Data.py             # Test data management page (uploading to api)

Page Descriptions:

Dashboard: In the future should contain useful metrics and statistics with visualization. Currently being used as a temporary spot for the memory overview.
File Upload: Contains all the file upload and OCR UI components.
Manage Configuration: Provides an interface for managing application configuration settings.
Process Reports: Facilitates the processing and management of generated reports.
Test Data: Enables the uploading and management of test data for the application.

App Services¶

The services layer encapsulates the core logic and operations of the OCR pipeline and file upload process and is responsible for managing the interaction between the UI and the underlying data processing components.

app/services/
├── file_service.py         # File management operations
├── ocr_service.py          # OCR and table extraction handler
└── digital_service.py      # Digital text extraction service (not in use)

Service Layer Functions:

file_service: Handles file upload, validation, format conversion, and metadata management
ocr_service: Coordinates OCR operations using DocTR engine, manages text/table extraction workflows and returns structured results (dataframes)

User Interface Components¶

app/ui/
├── batch_processing.py            # Batch processing UI components (deprecated)
├── file_selection.py              # PDF and image file selection and upload UI
├── file_selection_excel.py        # Excel file selection and upload UI
├── process_file.py                # OCR processing and results (deprecated)
└── manual_selection.py            # Manual table extraction UI (deprecated)

UI Components:

file_selection: Provides file upload interface for PDF and image files, including validation and format checks
file_selection_excel: Specialized interface for Excel file uploads, handling specific validation and processing

Utility Functions¶

app/utils/
├── bbox_utils.py          # Bounding box utilities
├── file_utils.py          # File processing utilities
├── image_utils.py         # Image manipulation functions
├── ocr_utils.py           # OCR-related utilities
├── digital_utils.py       # Digital text extraction utilities (deprecated)
└── dictionary_utils.py    # Dictionary utilities for term matching (deprecated)

Utility Categories:

file_utils: File format detection, conversion, and metadata extraction
image_utils: Image preprocessing, enhancement, and format conversion for OCR optimization
ocr_utils: OCR-specific utilities
bbox_utils: Bounding box calculations and utilities for table detection and visualization

Testing Infrastructure¶

Thus far in development, unit tests have been limited to the fdCommon file that contains some utility and generic functions. Future work should include more comprehensive unit tests for the core library and services to ensure reliability and maintainability.

app/tests/
└── test_fdCommon.py      # Unit tests for common utilities

UX Library¶

The UX library provides user interface components and utilities for the data analysis portion of the application.

Note

Also a remenant of the original monolithic application, the UX library is not as modular as it could be. It should be refactored to combine with other UI components in the future.

app/ux_library/
├── fdAuthorization.py        # User authentication and authorization UI components
├── fdEditTable.py            # Table editing UI components
├── fdExcelFileImport.py      # Excel file import UI components
├── fdManageSamples.py        # Sample management UI components
├── fdMapColumns.py           # Column mapping UI components
├── fdMapHeader.py            # Header mapping UI components
├── fdMappingProcess.py       # Mapping process UI components
├── fdNavigation.py           # Navigation UI components
├── fdReportFilters.py        # Report filtering UI components
├── fdReports.py              # Report generation UI components
├── fdReportTables.py         # Report table UI components
├── fdReviewData.py           # Review data UI components
├── fdSavedTable.py           # Saved table UI components
├── fdTestData.py             # Test data UI components (replaced with the Test Data page)
├── fdTestFiles.py            # Test files UI components
├── fdUIFunctions.py          # UI functions UI components
└── fdValidationUX.py         # Validation UX components

Upload/Processing Data Flow¶

The upload process (including OCR and table extraction) follows this high-level data flow:

┌─────── File Service ───────┐   ┌──── OCR Service ────┐
File Upload → Image Processing → OCR → Table Extraction

Detailed Flow:

Input Processing: Files uploaded through Streamlit interface, validated by file_service
Image Preparation: Images extracted and preprocessed using image_utils
OCR Extraction: Text extraction performed by ocr_service using DocTR
Structure Detection: YOLO models identify tables and document structure via table_extractor
Data Reconstruction: table_reconstructor reconstructs tabular data from OCR results

Technology Integration¶

The application integrates several key technologies:

OCR Engine: DocTR (Document Text Recognition) [repo] [docs]

High-accuracy text extraction
Multi-language support
Confidence scoring for quality assessment
Two-stage processing (detection and recognition) with configurable parameters and models for each step

Computer Vision: YOLO (You Only Look Once) [docs] [structure model] [detection model]

Real-time object detection for document structure
Table boundary identification
YOLOv10 for table detection and YOLOv8 for table structure recognition

Warning

The licensing for these specifically trained models is open source and free to use for internal use and other open source projects. To put into production code, either a license must be obtained (https://www.ultralytics.com/yolo) or a custom model must be fully trained.

Frontend Framework: Streamlit [docs]

Rapid prototyping capabilities
Interactive data visualization
Multi-page application support

Note

In the future, the dependency on Streamlit should be reduced or removed as we are reaching the ceiling of what Streamlit can do. Either by wrapping backend logic in an API or by reworking the entire UI to be a stateful web application.

Cloud Integration: Azure Blob Storage [docs]

Scalable file storage
Secure data handling
API-based access patterns

Data Processing: Pandas & NumPy [pandas docs] [numpy docs]

Efficient data manipulation
Statistical analysis capabilities
Export format flexibility

Performance Considerations¶

Image Caching:

A large memory bottleneck of the application is the caching of images in memory. In the session file, the LazyImageManager class is used to handle image caching and ensure proper loading and cleanup. Similar logic could be extended to other memory-intensive components of the app to improve performance and reduce memory usage.

Batch Processing:

Batch processing was initially implemented but is currently deprecated. Future improvements could reintroduce this feature to speed up processing speed, primarily when accelerated hardware is available. (see Installation for more details on hardware acceleration)

Security & Data Handling¶

Data Security

Secure file upload validation
Temporary file cleanup procedures
Access control for sensitive operations

Privacy Protection

Local processing options
Configurable data retention policies
Audit trails for data operations