File Handling

This section provides an overview of the file handling system in the Fluidsdata Digitization and OCR application, covering the architecture and workflows for file upload, storage, processing, and management.

System Architecture

The file handling system is designed around a three-tier architecture that provides scalable file management capabilities:

Core Components

  • FileService (file_service): Centralized file operations including upload, download, synchronization, and deletion

  • LazyImageManager (session): Intelligent image caching system with memory-efficient processing

  • FileProcessor (file_selection): UI workflow orchestration with real-time progress tracking

Supported File Formats

The system handles multiple document and image formats:

  • PDF Documents: Multi-page documents with automatic page extraction and metadata analysis

  • Image Files: PNG, JPG, and JPEG formats with quality preservation

  • Excel Files: XLSX format with support for table extraction and data analysis

Integration Points

The file handling system serves as the foundation layer for:

Note

All file operations are designed to work both locally and with Azure Blob Storage, providing automatic synchronization between local session state and cloud resources.

File Upload and Processing Workflow

Upload Process Overview

The file upload workflow follows a structured process designed for reliability and performance:

  1. Client Upload: Users upload files through the Streamlit interface with real-time validation

  2. Local Processing: Files are immediately processed and stored in the session’s LazyImageManager

  3. Metadata Extraction: Automatic extraction of page count, dimensions, and file type information

  4. Cloud Synchronization: Files are uploaded to Azure Blob Storage for persistence and sharing

  5. Session Integration: File references and processing containers are initialized for downstream operations

For detailed implementation, see the upload_local_files method in file_service.

Preprocessing Pipeline

Before OCR or table detection, files undergo preprocessing:

  • Image Extraction: PDF pages are converted to high-quality images using PyMuPDF

  • Quality Enhancement: Automatic deskewing, upscaling, and sharpening for optimal OCR results

  • Caching Strategy: Processed images are cached using LRU eviction for performance

  • Memory Management: Intelligent cache sizing prevents memory overflow during batch operations

The preprocessing is handled by the prepare_images method, with progress callbacks for UI feedback.

Azure Blob Storage Integration

Cloud Storage Architecture

The system provides Azure Blob Storage integration:

Upload Operations

Files are uploaded to designated Azure containers with automatic error handling and retry logic. The upload_to_azure method handles batch uploads.

Download and Synchronization

The sync_files_from_azure method ensures local session state matches cloud storage, automatically downloading file metadata and preparing thumbnails for files stored in Azure.

File Management

The delete_file method provides comprehensive cleanup of all file-related data structures, including Azure storage references and local session state.

For complete Azure integration details, see Azure Backend. For source code: file_service and fdDataAccess.

Hybrid Operation Modes

The system supports two deployment configurations:

  • Local Mode: Files stored and processed entirely on local filesystem

  • Cloud Mode: Full Azure Blob Storage integration with local caching

Note

Cloud mode is the default for deployment and development.

LazyImageManager

Intelligent Caching System

The LazyImageManager (session) implements a sophisticated multi-tier caching strategy:

Cache Hierarchy
  • Raw File Storage: Original uploaded file data maintained in memory

  • Raw Image Cache: Extracted images before preprocessing (LRU managed)

  • Processed Image Cache: Enhanced images after deskewing and upscaling (LRU managed)

  • PDF Document Cache: Open PyMuPDF documents for efficient multi-page access

Memory Optimization
  • Lazy Loading: Images loaded only when requested, minimizing memory footprint

  • LRU Eviction: Least Recently Used cache eviction prevents memory overflow

  • Resource Management: Automatic closure of PDF documents when cache limits exceeded or the file is no longer needed

  • Thread Safety: Concurrent access protection for multi-threaded operations

Image Processing Pipeline

The manager handles format-specific processing:

  • PDF Processing: Converts PDF pages to images

  • Image Processing: Direct loading with format validation and error handling

  • Batch Operations: Optimized processing of multiple pages with progress tracking

  • Quality Enhancement: Automatic image preprocessing for optimal OCR results

See the LazyImageManager class documentation in session for implementation details.

User Interface Integration

FileProcessor Workflow Management

The FileProcessor class (file_selection) handles file processing workflows:

Progress Tracking

Real-time progress updates during file processing with step-by-step status reporting and error handling. Users receive immediate feedback on upload status, preprocessing progress, and processing completion.

Batch Processing

Support for processing multiple files simultaneously with individual progress tracking and error isolation. Failed files don’t affect the processing of other files in the batch.

Error Recovery

Comprehensive error handling. Processing errors are captured and reported without affecting system stability.

Status Management

The system uses structured data classes (ProcessStep, ProcessedFile) to track processing state and provide detailed status information to users.

Integration with Processing Components

OCR Workflow Integration

The file handling system provides the foundation for OCR operations:

  • Image Preparation: Automatic preprocessing for optimal OCR accuracy

  • Format Optimization: Conversion of documents to formats suitable for OCR processing

  • Progress Coordination: Synchronized progress reporting between file handling and OCR operations

  • Results Management: Coordination of OCR results storage and retrieval

See OCR Process for detailed OCR integration workflows.

Table Detection Coordination

Seamless integration with table detection and reconstruction:

  • Image Provisioning: Automatic provision of processed images to table detection models

  • Coordinate Management: Handling of table bounding boxes and coordinate transformations

  • Results Integration: Coordination between table detection results and file-specific storage

  • Manual Override Support: Support for manual table selection and refinement workflows

See Table Reconstruction for detailed table processing integration.

Shared Session Coordination

All file handling operations coordinate through shared session state:

  • OCR Results: Page-specific OCR results stored with standardized naming conventions

  • Table Data: Table bounding boxes and reconstruction results linked to source pages

  • Processing Status: Coordinated status tracking across all processing components

  • Batch Operations: Unified batch processing coordination across all analysis types

Step-by-Step Processing

def _process_step(self, file_path: str, step: str, detect_tables: bool = False):
    """Execute a single processing step with progress tracking."""
    try:
        if step == 'image_preparation':
            def image_preparation_progress_callback(status: str, message: str):
                self.update_step_status(self._current_containers, 'image_preparation',
                                      status, message)

            FileService.prepare_images(self.session, file_path,
                                     progress_callback=image_preparation_progress_callback)
            result = {'pages_prepared': len(self.session.active_processing[file_path])}

        elif step == 'ocr':
            def ocr_progress_callback(status: str, message: str, substep: str):
                self.update_step_status(self._current_containers, 'ocr', status,
                                      f"{substep}: {message}")

            OCRService.process_ocr_extraction(self.session, self.dig, self.term_processor,
                                            file_path, progress_callback=ocr_progress_callback,
                                            detect_tables=detect_tables)
            result = {
                'pages_processed': self.session.ocr_stats[file_path].total_pages_processed,
                'words_extracted': self.session.ocr_stats[file_path].total_words_extracted,
                'tables_detected': self.session.ocr_stats[file_path].total_tables_detected
            }

        return result

Session State Structure

File State Transitions

Files progress through several states during processing:

  1. Upload: Added to raw_files and file_metadata

  2. Preparation: Images preprocessed and stored in active_processing

  3. Processing: OCR/table detection with results in ocr_results / table_bboxes

  4. Completion: Moved to processed_files_ocr

Session Cleanup

def cleanup_processed_file(self, file_path: str):
    """Clean up processing data for a completed file."""
    if file_path in self.active_processing:
        del self.active_processing[file_path]

    self.image_manager.clear_cache()

File Format Support

The system supports multiple file formats with automatic type detection and specialized handling.

Supported Formats

  • PDF

  • PNG

  • JPG/JPEG

  • XLSX (Excel)

Note

There is a separate UI page for uploading and processing Excel files (file_selection_excel).

Format Detection

@staticmethod
def get_file_type(file_path: str) -> str:
    """Determine the type of file (image or pdf)."""
    if file_path.lower().endswith('.pdf'):
        return 'pdf'
    elif file_path.lower().endswith(('.png', '.jpg', '.jpeg')):
        return 'image'
    else:
        return 'unknown'

PDF Handling

PDF files receive specialized handling for multi-page operations:

def _load_single_pdf_page(self, file_path: str, page_num: int) -> Optional[np.ndarray]:
    """Load a single page from a PDF document."""
    try:
        # Manage PDF document cache
        if file_path not in self._pdf_docs:
            if len(self._pdf_docs) >= self.max_open_pdfs:
                oldest_pdf = self._pdf_access_order.pop(0)
                self._pdf_docs[oldest_pdf].close()
                del self._pdf_docs[oldest_pdf]

            # Open new PDF document
            doc = pymupdf.open(stream=self._raw_files[file_path], filetype='pdf')
            self._pdf_docs[file_path] = doc
            self._pdf_access_order.append(file_path)

        # Load specific page
        doc = self._pdf_docs[file_path]
        if page_num < len(doc):
            page = doc.load_page(page_num)
            pix = page.get_pixmap()
            img_data = pix.tobytes("ppm")
            img = Image.open(io.BytesIO(img_data))
            return np.array(img)

    except Exception as e:
        print(f"Error loading PDF page {page_num} from {file_path}: {e}")

    return None

Image Loading

Single images use the load_images utility in image_utils:

def _load_single_image(self, file_path: str) -> Optional[np.ndarray]:
    """Load a single image file."""
    if file_path in self._raw_files:
        try:
            file_obj = io.BytesIO(self._raw_files[file_path])
            images = image_utils.load_images(file_obj)
            return images[0] if images else None
        except Exception as e:
            print(f"Error loading image {file_path}: {e}")

    return None

Integration with Processing Pipeline

The file handling system integrates with the OCR and table reconstruction components:

OCR Integration

# Prepare images for OCR processing
FileService.prepare_images(session, file_path, progress_callback=callback)

# Process OCR with preprocessed images
OCRService.process_ocr_extraction(session, dig, term_processor, file_path)

Table Detection Integration

# Images are automatically available to table detection models
table_extractor = session.table_extractor
for page_num in range(page_count):
    image = session.image_manager.get_image(file_path, page_num, preprocess=True)
    tables = table_extractor.detect_tables(image)

Session State Coordination

The file handling system coordinates with other components through shared session state:

  • OCR Results: Stored in session.ocr_results with page-specific keys

  • Table Bounding Boxes: Stored in session.table_bboxes for manual refinement

  • Processing Status: Tracked in session.active_processing for UI updates

  • Batch Processing: Coordinated through session.batch_tagged_files

Note

The file handling system serves as the foundation for all document processing operations, providing file storage, caching, and integration between local operations and cloud storage.