File Handling ================= This section provides an overview of the file handling system in the Fluidsdata Digitization and OCR application, covering the architecture and workflows for file upload, storage, processing, and management. System Architecture ------------------- The file handling system is designed around a three-tier architecture that provides scalable file management capabilities: **Core Components** - **FileService** (:ref:`file_service`): Centralized file operations including upload, download, synchronization, and deletion - **LazyImageManager** (:ref:`session`): Intelligent image caching system with memory-efficient processing - **FileProcessor** (:ref:`file_selection`): UI workflow orchestration with real-time progress tracking **Supported File Formats** The system handles multiple document and image formats: - **PDF Documents**: Multi-page documents with automatic page extraction and metadata analysis - **Image Files**: PNG, JPG, and JPEG formats with quality preservation - **Excel Files**: XLSX format with support for table extraction and data analysis **Integration Points** The file handling system serves as the foundation layer for: - OCR text extraction workflows (:doc:`ocr`) - Table detection and reconstruction (:doc:`table_reconstruction`) .. note:: All file operations are designed to work both locally and with Azure Blob Storage, providing automatic synchronization between local session state and cloud resources. File Upload and Processing Workflow ----------------------------------- **Upload Process Overview** The file upload workflow follows a structured process designed for reliability and performance: 1. **Client Upload**: Users upload files through the Streamlit interface with real-time validation 2. **Local Processing**: Files are immediately processed and stored in the session's LazyImageManager 3. **Metadata Extraction**: Automatic extraction of page count, dimensions, and file type information 4. **Cloud Synchronization**: Files are uploaded to Azure Blob Storage for persistence and sharing 5. **Session Integration**: File references and processing containers are initialized for downstream operations For detailed implementation, see the ``upload_local_files`` method in :ref:`file_service`. **Preprocessing Pipeline** Before OCR or table detection, files undergo preprocessing: - **Image Extraction**: PDF pages are converted to high-quality images using PyMuPDF - **Quality Enhancement**: Automatic deskewing, upscaling, and sharpening for optimal OCR results - **Caching Strategy**: Processed images are cached using LRU eviction for performance - **Memory Management**: Intelligent cache sizing prevents memory overflow during batch operations The preprocessing is handled by the ``prepare_images`` method, with progress callbacks for UI feedback. Azure Blob Storage Integration ------------------------------ **Cloud Storage Architecture** The system provides Azure Blob Storage integration: **Upload Operations** Files are uploaded to designated Azure containers with automatic error handling and retry logic. The ``upload_to_azure`` method handles batch uploads. **Download and Synchronization** The ``sync_files_from_azure`` method ensures local session state matches cloud storage, automatically downloading file metadata and preparing thumbnails for files stored in Azure. **File Management** The ``delete_file`` method provides comprehensive cleanup of all file-related data structures, including Azure storage references and local session state. For complete Azure integration details, see :doc:`/architecture/azure_backend`. For source code: :ref:`file_service` and :ref:`fdDataAccess`. **Hybrid Operation Modes** The system supports two deployment configurations: - **Local Mode**: Files stored and processed entirely on local filesystem - **Cloud Mode**: Full Azure Blob Storage integration with local caching .. note:: Cloud mode is the default for deployment and development. LazyImageManager ----------------------------- **Intelligent Caching System** The LazyImageManager (:ref:`session`) implements a sophisticated multi-tier caching strategy: **Cache Hierarchy** - **Raw File Storage**: Original uploaded file data maintained in memory - **Raw Image Cache**: Extracted images before preprocessing (LRU managed) - **Processed Image Cache**: Enhanced images after deskewing and upscaling (LRU managed) - **PDF Document Cache**: Open PyMuPDF documents for efficient multi-page access **Memory Optimization** - **Lazy Loading**: Images loaded only when requested, minimizing memory footprint - **LRU Eviction**: Least Recently Used cache eviction prevents memory overflow - **Resource Management**: Automatic closure of PDF documents when cache limits exceeded or the file is no longer needed - **Thread Safety**: Concurrent access protection for multi-threaded operations **Image Processing Pipeline** The manager handles format-specific processing: - **PDF Processing**: Converts PDF pages to images - **Image Processing**: Direct loading with format validation and error handling - **Batch Operations**: Optimized processing of multiple pages with progress tracking - **Quality Enhancement**: Automatic image preprocessing for optimal OCR results See the LazyImageManager class documentation in :ref:`session` for implementation details. User Interface Integration -------------------------- **FileProcessor Workflow Management** The FileProcessor class (:ref:`file_selection`) handles file processing workflows: **Progress Tracking** Real-time progress updates during file processing with step-by-step status reporting and error handling. Users receive immediate feedback on upload status, preprocessing progress, and processing completion. **Batch Processing** Support for processing multiple files simultaneously with individual progress tracking and error isolation. Failed files don't affect the processing of other files in the batch. **Error Recovery** Comprehensive error handling. Processing errors are captured and reported without affecting system stability. **Status Management** The system uses structured data classes (ProcessStep, ProcessedFile) to track processing state and provide detailed status information to users. Integration with Processing Components --------------------------------------- **OCR Workflow Integration** The file handling system provides the foundation for OCR operations: - **Image Preparation**: Automatic preprocessing for optimal OCR accuracy - **Format Optimization**: Conversion of documents to formats suitable for OCR processing - **Progress Coordination**: Synchronized progress reporting between file handling and OCR operations - **Results Management**: Coordination of OCR results storage and retrieval See :doc:`ocr` for detailed OCR integration workflows. **Table Detection Coordination** Seamless integration with table detection and reconstruction: - **Image Provisioning**: Automatic provision of processed images to table detection models - **Coordinate Management**: Handling of table bounding boxes and coordinate transformations - **Results Integration**: Coordination between table detection results and file-specific storage - **Manual Override Support**: Support for manual table selection and refinement workflows See :doc:`table_reconstruction` for detailed table processing integration. **Shared Session Coordination** All file handling operations coordinate through shared session state: - **OCR Results**: Page-specific OCR results stored with standardized naming conventions - **Table Data**: Table bounding boxes and reconstruction results linked to source pages - **Processing Status**: Coordinated status tracking across all processing components - **Batch Operations**: Unified batch processing coordination across all analysis types **Step-by-Step Processing** .. code-block:: python def _process_step(self, file_path: str, step: str, detect_tables: bool = False): """Execute a single processing step with progress tracking.""" try: if step == 'image_preparation': def image_preparation_progress_callback(status: str, message: str): self.update_step_status(self._current_containers, 'image_preparation', status, message) FileService.prepare_images(self.session, file_path, progress_callback=image_preparation_progress_callback) result = {'pages_prepared': len(self.session.active_processing[file_path])} elif step == 'ocr': def ocr_progress_callback(status: str, message: str, substep: str): self.update_step_status(self._current_containers, 'ocr', status, f"{substep}: {message}") OCRService.process_ocr_extraction(self.session, self.dig, self.term_processor, file_path, progress_callback=ocr_progress_callback, detect_tables=detect_tables) result = { 'pages_processed': self.session.ocr_stats[file_path].total_pages_processed, 'words_extracted': self.session.ocr_stats[file_path].total_words_extracted, 'tables_detected': self.session.ocr_stats[file_path].total_tables_detected } return result Session State Structure ~~~~~~~~~~~~~~~~~~~~~~~ **File State Transitions** Files progress through several states during processing: 1. **Upload**: Added to ``raw_files`` and ``file_metadata`` 2. **Preparation**: Images preprocessed and stored in ``active_processing`` 3. **Processing**: OCR/table detection with results in ``ocr_results`` / ``table_bboxes`` 4. **Completion**: Moved to ``processed_files_ocr`` **Session Cleanup** .. code-block:: python def cleanup_processed_file(self, file_path: str): """Clean up processing data for a completed file.""" if file_path in self.active_processing: del self.active_processing[file_path] self.image_manager.clear_cache() File Format Support ------------------- The system supports multiple file formats with automatic type detection and specialized handling. Supported Formats ~~~~~~~~~~~~~~~~~ - PDF - PNG - JPG/JPEG - XLSX (Excel) .. note:: There is a separate UI page for uploading and processing Excel files (:ref:`file_selection_excel`). **Format Detection** .. code-block:: python @staticmethod def get_file_type(file_path: str) -> str: """Determine the type of file (image or pdf).""" if file_path.lower().endswith('.pdf'): return 'pdf' elif file_path.lower().endswith(('.png', '.jpg', '.jpeg')): return 'image' else: return 'unknown' **PDF Handling** PDF files receive specialized handling for multi-page operations: .. code-block:: python def _load_single_pdf_page(self, file_path: str, page_num: int) -> Optional[np.ndarray]: """Load a single page from a PDF document.""" try: # Manage PDF document cache if file_path not in self._pdf_docs: if len(self._pdf_docs) >= self.max_open_pdfs: oldest_pdf = self._pdf_access_order.pop(0) self._pdf_docs[oldest_pdf].close() del self._pdf_docs[oldest_pdf] # Open new PDF document doc = pymupdf.open(stream=self._raw_files[file_path], filetype='pdf') self._pdf_docs[file_path] = doc self._pdf_access_order.append(file_path) # Load specific page doc = self._pdf_docs[file_path] if page_num < len(doc): page = doc.load_page(page_num) pix = page.get_pixmap() img_data = pix.tobytes("ppm") img = Image.open(io.BytesIO(img_data)) return np.array(img) except Exception as e: print(f"Error loading PDF page {page_num} from {file_path}: {e}") return None **Image Loading** Single images use the ``load_images`` utility in :ref:`image_utils`: .. code-block:: python def _load_single_image(self, file_path: str) -> Optional[np.ndarray]: """Load a single image file.""" if file_path in self._raw_files: try: file_obj = io.BytesIO(self._raw_files[file_path]) images = image_utils.load_images(file_obj) return images[0] if images else None except Exception as e: print(f"Error loading image {file_path}: {e}") return None Integration with Processing Pipeline ------------------------------------ The file handling system integrates with the OCR and table reconstruction components: **OCR Integration** .. code-block:: python # Prepare images for OCR processing FileService.prepare_images(session, file_path, progress_callback=callback) # Process OCR with preprocessed images OCRService.process_ocr_extraction(session, dig, term_processor, file_path) **Table Detection Integration** .. code-block:: python # Images are automatically available to table detection models table_extractor = session.table_extractor for page_num in range(page_count): image = session.image_manager.get_image(file_path, page_num, preprocess=True) tables = table_extractor.detect_tables(image) **Session State Coordination** The file handling system coordinates with other components through shared session state: - **OCR Results**: Stored in ``session.ocr_results`` with page-specific keys - **Table Bounding Boxes**: Stored in ``session.table_bboxes`` for manual refinement - **Processing Status**: Tracked in ``session.active_processing`` for UI updates - **Batch Processing**: Coordinated through ``session.batch_tagged_files`` .. note:: The file handling system serves as the foundation for all document processing operations, providing file storage, caching, and integration between local operations and cloud storage.