File Handling¶
This section provides an overview of the file handling system in the Fluidsdata Digitization and OCR application, covering the architecture and workflows for file upload, storage, processing, and management.
System Architecture¶
The file handling system is designed around a three-tier architecture that provides scalable file management capabilities:
Core Components
FileService (file_service): Centralized file operations including upload, download, synchronization, and deletion
LazyImageManager (session): Intelligent image caching system with memory-efficient processing
FileProcessor (file_selection): UI workflow orchestration with real-time progress tracking
Supported File Formats
The system handles multiple document and image formats:
PDF Documents: Multi-page documents with automatic page extraction and metadata analysis
Image Files: PNG, JPG, and JPEG formats with quality preservation
Excel Files: XLSX format with support for table extraction and data analysis
Integration Points
The file handling system serves as the foundation layer for:
OCR text extraction workflows (OCR Process)
Table detection and reconstruction (Table Reconstruction)
Note
All file operations are designed to work both locally and with Azure Blob Storage, providing automatic synchronization between local session state and cloud resources.
File Upload and Processing Workflow¶
Upload Process Overview
The file upload workflow follows a structured process designed for reliability and performance:
Client Upload: Users upload files through the Streamlit interface with real-time validation
Local Processing: Files are immediately processed and stored in the session’s LazyImageManager
Metadata Extraction: Automatic extraction of page count, dimensions, and file type information
Cloud Synchronization: Files are uploaded to Azure Blob Storage for persistence and sharing
Session Integration: File references and processing containers are initialized for downstream operations
For detailed implementation, see the upload_local_files method in file_service.
Preprocessing Pipeline
Before OCR or table detection, files undergo preprocessing:
Image Extraction: PDF pages are converted to high-quality images using PyMuPDF
Quality Enhancement: Automatic deskewing, upscaling, and sharpening for optimal OCR results
Caching Strategy: Processed images are cached using LRU eviction for performance
Memory Management: Intelligent cache sizing prevents memory overflow during batch operations
The preprocessing is handled by the prepare_images method, with progress callbacks for UI feedback.
Azure Blob Storage Integration¶
Cloud Storage Architecture
The system provides Azure Blob Storage integration:
- Upload Operations
Files are uploaded to designated Azure containers with automatic error handling and retry logic. The
upload_to_azuremethod handles batch uploads.- Download and Synchronization
The
sync_files_from_azuremethod ensures local session state matches cloud storage, automatically downloading file metadata and preparing thumbnails for files stored in Azure.- File Management
The
delete_filemethod provides comprehensive cleanup of all file-related data structures, including Azure storage references and local session state.
For complete Azure integration details, see Azure Backend. For source code: file_service and fdDataAccess.
Hybrid Operation Modes
The system supports two deployment configurations:
Local Mode: Files stored and processed entirely on local filesystem
Cloud Mode: Full Azure Blob Storage integration with local caching
Note
Cloud mode is the default for deployment and development.
LazyImageManager¶
Intelligent Caching System
The LazyImageManager (session) implements a sophisticated multi-tier caching strategy:
- Cache Hierarchy
Raw File Storage: Original uploaded file data maintained in memory
Raw Image Cache: Extracted images before preprocessing (LRU managed)
Processed Image Cache: Enhanced images after deskewing and upscaling (LRU managed)
PDF Document Cache: Open PyMuPDF documents for efficient multi-page access
- Memory Optimization
Lazy Loading: Images loaded only when requested, minimizing memory footprint
LRU Eviction: Least Recently Used cache eviction prevents memory overflow
Resource Management: Automatic closure of PDF documents when cache limits exceeded or the file is no longer needed
Thread Safety: Concurrent access protection for multi-threaded operations
Image Processing Pipeline
The manager handles format-specific processing:
PDF Processing: Converts PDF pages to images
Image Processing: Direct loading with format validation and error handling
Batch Operations: Optimized processing of multiple pages with progress tracking
Quality Enhancement: Automatic image preprocessing for optimal OCR results
See the LazyImageManager class documentation in session for implementation details.
User Interface Integration¶
FileProcessor Workflow Management
The FileProcessor class (file_selection) handles file processing workflows:
- Progress Tracking
Real-time progress updates during file processing with step-by-step status reporting and error handling. Users receive immediate feedback on upload status, preprocessing progress, and processing completion.
- Batch Processing
Support for processing multiple files simultaneously with individual progress tracking and error isolation. Failed files don’t affect the processing of other files in the batch.
- Error Recovery
Comprehensive error handling. Processing errors are captured and reported without affecting system stability.
- Status Management
The system uses structured data classes (ProcessStep, ProcessedFile) to track processing state and provide detailed status information to users.
Integration with Processing Components¶
OCR Workflow Integration
The file handling system provides the foundation for OCR operations:
Image Preparation: Automatic preprocessing for optimal OCR accuracy
Format Optimization: Conversion of documents to formats suitable for OCR processing
Progress Coordination: Synchronized progress reporting between file handling and OCR operations
Results Management: Coordination of OCR results storage and retrieval
See OCR Process for detailed OCR integration workflows.
Table Detection Coordination
Seamless integration with table detection and reconstruction:
Image Provisioning: Automatic provision of processed images to table detection models
Coordinate Management: Handling of table bounding boxes and coordinate transformations
Results Integration: Coordination between table detection results and file-specific storage
Manual Override Support: Support for manual table selection and refinement workflows
See Table Reconstruction for detailed table processing integration.
Shared Session Coordination
All file handling operations coordinate through shared session state:
OCR Results: Page-specific OCR results stored with standardized naming conventions
Table Data: Table bounding boxes and reconstruction results linked to source pages
Processing Status: Coordinated status tracking across all processing components
Batch Operations: Unified batch processing coordination across all analysis types
Step-by-Step Processing
def _process_step(self, file_path: str, step: str, detect_tables: bool = False):
"""Execute a single processing step with progress tracking."""
try:
if step == 'image_preparation':
def image_preparation_progress_callback(status: str, message: str):
self.update_step_status(self._current_containers, 'image_preparation',
status, message)
FileService.prepare_images(self.session, file_path,
progress_callback=image_preparation_progress_callback)
result = {'pages_prepared': len(self.session.active_processing[file_path])}
elif step == 'ocr':
def ocr_progress_callback(status: str, message: str, substep: str):
self.update_step_status(self._current_containers, 'ocr', status,
f"{substep}: {message}")
OCRService.process_ocr_extraction(self.session, self.dig, self.term_processor,
file_path, progress_callback=ocr_progress_callback,
detect_tables=detect_tables)
result = {
'pages_processed': self.session.ocr_stats[file_path].total_pages_processed,
'words_extracted': self.session.ocr_stats[file_path].total_words_extracted,
'tables_detected': self.session.ocr_stats[file_path].total_tables_detected
}
return result
Session State Structure¶
File State Transitions
Files progress through several states during processing:
Upload: Added to
raw_filesandfile_metadataPreparation: Images preprocessed and stored in
active_processingProcessing: OCR/table detection with results in
ocr_results/table_bboxesCompletion: Moved to
processed_files_ocr
Session Cleanup
def cleanup_processed_file(self, file_path: str):
"""Clean up processing data for a completed file."""
if file_path in self.active_processing:
del self.active_processing[file_path]
self.image_manager.clear_cache()
File Format Support¶
The system supports multiple file formats with automatic type detection and specialized handling.
Supported Formats¶
PDF
PNG
JPG/JPEG
XLSX (Excel)
Note
There is a separate UI page for uploading and processing Excel files (file_selection_excel).
Format Detection
@staticmethod
def get_file_type(file_path: str) -> str:
"""Determine the type of file (image or pdf)."""
if file_path.lower().endswith('.pdf'):
return 'pdf'
elif file_path.lower().endswith(('.png', '.jpg', '.jpeg')):
return 'image'
else:
return 'unknown'
PDF Handling
PDF files receive specialized handling for multi-page operations:
def _load_single_pdf_page(self, file_path: str, page_num: int) -> Optional[np.ndarray]:
"""Load a single page from a PDF document."""
try:
# Manage PDF document cache
if file_path not in self._pdf_docs:
if len(self._pdf_docs) >= self.max_open_pdfs:
oldest_pdf = self._pdf_access_order.pop(0)
self._pdf_docs[oldest_pdf].close()
del self._pdf_docs[oldest_pdf]
# Open new PDF document
doc = pymupdf.open(stream=self._raw_files[file_path], filetype='pdf')
self._pdf_docs[file_path] = doc
self._pdf_access_order.append(file_path)
# Load specific page
doc = self._pdf_docs[file_path]
if page_num < len(doc):
page = doc.load_page(page_num)
pix = page.get_pixmap()
img_data = pix.tobytes("ppm")
img = Image.open(io.BytesIO(img_data))
return np.array(img)
except Exception as e:
print(f"Error loading PDF page {page_num} from {file_path}: {e}")
return None
Image Loading
Single images use the load_images utility in image_utils:
def _load_single_image(self, file_path: str) -> Optional[np.ndarray]:
"""Load a single image file."""
if file_path in self._raw_files:
try:
file_obj = io.BytesIO(self._raw_files[file_path])
images = image_utils.load_images(file_obj)
return images[0] if images else None
except Exception as e:
print(f"Error loading image {file_path}: {e}")
return None
Integration with Processing Pipeline¶
The file handling system integrates with the OCR and table reconstruction components:
OCR Integration
# Prepare images for OCR processing
FileService.prepare_images(session, file_path, progress_callback=callback)
# Process OCR with preprocessed images
OCRService.process_ocr_extraction(session, dig, term_processor, file_path)
Table Detection Integration
# Images are automatically available to table detection models
table_extractor = session.table_extractor
for page_num in range(page_count):
image = session.image_manager.get_image(file_path, page_num, preprocess=True)
tables = table_extractor.detect_tables(image)
Session State Coordination
The file handling system coordinates with other components through shared session state:
OCR Results: Stored in
session.ocr_resultswith page-specific keysTable Bounding Boxes: Stored in
session.table_bboxesfor manual refinementProcessing Status: Tracked in
session.active_processingfor UI updatesBatch Processing: Coordinated through
session.batch_tagged_files
Note
The file handling system serves as the foundation for all document processing operations, providing file storage, caching, and integration between local operations and cloud storage.