File Handling¶

This section provides an overview of the file handling system in the Fluidsdata Digitization and OCR application, covering the architecture and workflows for file upload, storage, processing, and management.

System Architecture¶

The file handling system is designed around a three-tier architecture that provides scalable file management capabilities:

Core Components

FileService (file_service): Centralized file operations including upload, download, synchronization, and deletion
LazyImageManager (session): Intelligent image caching system with memory-efficient processing
FileProcessor (file_selection): UI workflow orchestration with real-time progress tracking

Supported File Formats

The system handles multiple document and image formats:

PDF Documents: Multi-page documents with automatic page extraction and metadata analysis
Image Files: PNG, JPG, and JPEG formats with quality preservation
Excel Files: XLSX format with support for table extraction and data analysis

Integration Points

The file handling system serves as the foundation layer for:

OCR text extraction workflows (OCR Process)
Table detection and reconstruction (Table Reconstruction)

Note

All file operations are designed to work both locally and with Azure Blob Storage, providing automatic synchronization between local session state and cloud resources.

File Upload and Processing Workflow¶

Upload Process Overview

The file upload workflow follows a structured process designed for reliability and performance:

Client Upload: Users upload files through the Streamlit interface with real-time validation
Local Processing: Files are immediately processed and stored in the session’s LazyImageManager
Metadata Extraction: Automatic extraction of page count, dimensions, and file type information
Cloud Synchronization: Files are uploaded to Azure Blob Storage for persistence and sharing
Session Integration: File references and processing containers are initialized for downstream operations

For detailed implementation, see the upload_local_files method in file_service.

Preprocessing Pipeline

Before OCR or table detection, files undergo preprocessing:

Image Extraction: PDF pages are converted to high-quality images using PyMuPDF
Quality Enhancement: Automatic deskewing, upscaling, and sharpening for optimal OCR results
Caching Strategy: Processed images are cached using LRU eviction for performance
Memory Management: Intelligent cache sizing prevents memory overflow during batch operations

The preprocessing is handled by the prepare_images method, with progress callbacks for UI feedback.

Azure Blob Storage Integration¶

Cloud Storage Architecture

The system provides Azure Blob Storage integration:

Upload Operations: Files are uploaded to designated Azure containers with automatic error handling and retry logic. The upload_to_azure method handles batch uploads.
Download and Synchronization: The sync_files_from_azure method ensures local session state matches cloud storage, automatically downloading file metadata and preparing thumbnails for files stored in Azure.
File Management: The delete_file method provides comprehensive cleanup of all file-related data structures, including Azure storage references and local session state.

For complete Azure integration details, see Azure Backend. For source code: file_service and fdDataAccess.

Hybrid Operation Modes

The system supports two deployment configurations:

Local Mode: Files stored and processed entirely on local filesystem
Cloud Mode: Full Azure Blob Storage integration with local caching

Note

Cloud mode is the default for deployment and development.

LazyImageManager¶

Intelligent Caching System

The LazyImageManager (session) implements a sophisticated multi-tier caching strategy:

Cache Hierarchy

Raw File Storage: Original uploaded file data maintained in memory
Raw Image Cache: Extracted images before preprocessing (LRU managed)
Processed Image Cache: Enhanced images after deskewing and upscaling (LRU managed)
PDF Document Cache: Open PyMuPDF documents for efficient multi-page access

Memory Optimization

Lazy Loading: Images loaded only when requested, minimizing memory footprint
LRU Eviction: Least Recently Used cache eviction prevents memory overflow
Resource Management: Automatic closure of PDF documents when cache limits exceeded or the file is no longer needed
Thread Safety: Concurrent access protection for multi-threaded operations

Image Processing Pipeline

The manager handles format-specific processing:

PDF Processing: Converts PDF pages to images
Image Processing: Direct loading with format validation and error handling
Batch Operations: Optimized processing of multiple pages with progress tracking
Quality Enhancement: Automatic image preprocessing for optimal OCR results

See the LazyImageManager class documentation in session for implementation details.

User Interface Integration¶

FileProcessor Workflow Management

The FileProcessor class (file_selection) handles file processing workflows:

Progress Tracking: Real-time progress updates during file processing with step-by-step status reporting and error handling. Users receive immediate feedback on upload status, preprocessing progress, and processing completion.
Batch Processing: Support for processing multiple files simultaneously with individual progress tracking and error isolation. Failed files don’t affect the processing of other files in the batch.
Error Recovery: Comprehensive error handling. Processing errors are captured and reported without affecting system stability.
Status Management: The system uses structured data classes (ProcessStep, ProcessedFile) to track processing state and provide detailed status information to users.

Integration with Processing Components¶

OCR Workflow Integration

The file handling system provides the foundation for OCR operations:

Image Preparation: Automatic preprocessing for optimal OCR accuracy
Format Optimization: Conversion of documents to formats suitable for OCR processing
Progress Coordination: Synchronized progress reporting between file handling and OCR operations
Results Management: Coordination of OCR results storage and retrieval

See OCR Process for detailed OCR integration workflows.

Table Detection Coordination

Seamless integration with table detection and reconstruction:

Image Provisioning: Automatic provision of processed images to table detection models
Coordinate Management: Handling of table bounding boxes and coordinate transformations
Results Integration: Coordination between table detection results and file-specific storage
Manual Override Support: Support for manual table selection and refinement workflows

See Table Reconstruction for detailed table processing integration.

Shared Session Coordination

All file handling operations coordinate through shared session state:

OCR Results: Page-specific OCR results stored with standardized naming conventions
Table Data: Table bounding boxes and reconstruction results linked to source pages
Processing Status: Coordinated status tracking across all processing components
Batch Operations: Unified batch processing coordination across all analysis types

Step-by-Step Processing

def _process_step(self, file_path: str, step: str, detect_tables: bool = False):
    """Execute a single processing step with progress tracking."""
    try:
        if step == 'image_preparation':
            def image_preparation_progress_callback(status: str, message: str):
                self.update_step_status(self._current_containers, 'image_preparation',
                                      status, message)

            FileService.prepare_images(self.session, file_path,
                                     progress_callback=image_preparation_progress_callback)
            result = {'pages_prepared': len(self.session.active_processing[file_path])}

        elif step == 'ocr':
            def ocr_progress_callback(status: str, message: str, substep: str):
                self.update_step_status(self._current_containers, 'ocr', status,
                                      f"{substep}: {message}")

            OCRService.process_ocr_extraction(self.session, self.dig, self.term_processor,
                                            file_path, progress_callback=ocr_progress_callback,
                                            detect_tables=detect_tables)
            result = {
                'pages_processed': self.session.ocr_stats[file_path].total_pages_processed,
                'words_extracted': self.session.ocr_stats[file_path].total_words_extracted,
                'tables_detected': self.session.ocr_stats[file_path].total_tables_detected
            }

        return result

Session State Structure¶

File State Transitions

Files progress through several states during processing:

Upload: Added to raw_files and file_metadata
Preparation: Images preprocessed and stored in active_processing
Processing: OCR/table detection with results in ocr_results / table_bboxes
Completion: Moved to processed_files_ocr

Session Cleanup

def cleanup_processed_file(self, file_path: str):
    """Clean up processing data for a completed file."""
    if file_path in self.active_processing:
        del self.active_processing[file_path]

    self.image_manager.clear_cache()

File Format Support¶

The system supports multiple file formats with automatic type detection and specialized handling.

Supported Formats¶

PDF
PNG
JPG/JPEG
XLSX (Excel)

Note

There is a separate UI page for uploading and processing Excel files (file_selection_excel).

Format Detection

@staticmethod
def get_file_type(file_path: str) -> str:
    """Determine the type of file (image or pdf)."""
    if file_path.lower().endswith('.pdf'):
        return 'pdf'
    elif file_path.lower().endswith(('.png', '.jpg', '.jpeg')):
        return 'image'
    else:
        return 'unknown'

PDF Handling

PDF files receive specialized handling for multi-page operations:

def _load_single_pdf_page(self, file_path: str, page_num: int) -> Optional[np.ndarray]:
    """Load a single page from a PDF document."""
    try:
        # Manage PDF document cache
        if file_path not in self._pdf_docs:
            if len(self._pdf_docs) >= self.max_open_pdfs:
                oldest_pdf = self._pdf_access_order.pop(0)
                self._pdf_docs[oldest_pdf].close()
                del self._pdf_docs[oldest_pdf]

            # Open new PDF document
            doc = pymupdf.open(stream=self._raw_files[file_path], filetype='pdf')
            self._pdf_docs[file_path] = doc
            self._pdf_access_order.append(file_path)

        # Load specific page
        doc = self._pdf_docs[file_path]
        if page_num < len(doc):
            page = doc.load_page(page_num)
            pix = page.get_pixmap()
            img_data = pix.tobytes("ppm")
            img = Image.open(io.BytesIO(img_data))
            return np.array(img)

    except Exception as e:
        print(f"Error loading PDF page {page_num} from {file_path}: {e}")

    return None

Image Loading

Single images use the load_images utility in image_utils:

def _load_single_image(self, file_path: str) -> Optional[np.ndarray]:
    """Load a single image file."""
    if file_path in self._raw_files:
        try:
            file_obj = io.BytesIO(self._raw_files[file_path])
            images = image_utils.load_images(file_obj)
            return images[0] if images else None
        except Exception as e:
            print(f"Error loading image {file_path}: {e}")

    return None

Integration with Processing Pipeline¶

The file handling system integrates with the OCR and table reconstruction components:

OCR Integration

# Prepare images for OCR processing
FileService.prepare_images(session, file_path, progress_callback=callback)

# Process OCR with preprocessed images
OCRService.process_ocr_extraction(session, dig, term_processor, file_path)

Table Detection Integration

# Images are automatically available to table detection models
table_extractor = session.table_extractor
for page_num in range(page_count):
    image = session.image_manager.get_image(file_path, page_num, preprocess=True)
    tables = table_extractor.detect_tables(image)

Session State Coordination

The file handling system coordinates with other components through shared session state:

OCR Results: Stored in session.ocr_results with page-specific keys
Table Bounding Boxes: Stored in session.table_bboxes for manual refinement
Processing Status: Tracked in session.active_processing for UI updates
Batch Processing: Coordinated through session.batch_tagged_files

Note

The file handling system serves as the foundation for all document processing operations, providing file storage, caching, and integration between local operations and cloud storage.