Table Reconstruction ===================== This section describes the table reconstruction capabilities of the Fluidsdata Digitization and OCR application. Overview -------- Table reconstruction is a critical component of the document digitization pipeline that converts detected table regions and OCR-extracted text into structured, machine-readable data formats (pandas DataFrames). The process combines computer vision techniques for spatial table detection and structure recognition with advanced clustering algorithms to accurately reconstruct tabular data from scanned documents. The table reconstruction pipeline consists of several interconnected stages: 1. **Document Layout Analysis**: Detection of table regions within document pages 2. **Table Structure Recognition**: Identification of table components (rows, columns, cells) 3. **Word-to-Table Mapping**: Assignment of OCR text to appropriate table cells 4. **Data Reconstruction**: Generation of structured DataFrames from mapped content 5. **Quality Optimization**: Iterative refinement to improve reconstruction accuracy Computer Vision Models ---------------------- The application employs two specialized YOLO (You Only Look Once) models for table detection and structure recognition: YOLOv10 Layout Analyzer ~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Document layout analysis and table region detection **Model**: Custom-trained YOLOv10 architecture (``yolov10x_best.pt``) **Capabilities**: - Detects multiple document elements: tables, text blocks, captions, headers, footers, formulas, and images - Operates at document page level to identify table boundaries - Provides confidence scores for each detected element - Handles rotated and complex table layouts .. note:: While this detection model has the capability to detect various document elements, we are currently focused on table detection as the primary target. Future enhancements may expand its use to other document elements such as headers, footers, titles, and captions. **Detection Classes**: Here's an overview of all objects detectable by the YOLOv10 model: - **Table** - primary target - Caption - Footnote - Formula - List-item - Page_footer - Page_header - Picture - Section_header - Text - Title **Confidence Thresholds**: Confidence thresholds are set to balance precision and recall for table detection. The model is very rarely 100% confident in its predictions, so we must navigate the trade-off between false positives and missed detections. The current strategy is to use a low threshold (0.2, which means that the model will reject any detections below 20% confidence) for the ``Table`` class to avoid missing any valid tables, and then implementing additional filtering after the detection step to refine results based on the context of the detected tables. See line 29 of :ref:`table_extractor` for implementation. YOLOv8 Table Structure Extractor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Detailed table structure analysis within detected table regions **Models**: - Detection model: ``best_detection.pt`` (table boundary detection, currently replaced by YOLOv10) - Structure model: ``best_structure.pt`` (internal table structure) **Structure Detection Classes**: - ``table column``: Vertical divisions within tables - ``table row``: Horizontal divisions within tables - ``table column header``: Header cells for columns - ``table projected row header``: Row labels/headers - ``table spanning cell``: Cells that span multiple rows/columns **Confidence Thresholds**: For the majority of documents, using stricter structure detection thresholds yields better results. There is post-model logic that iteratively improves the table structure based on detected components, but this primarily works by `adding` rows or columns rather than removing them. It is therefore ideal to start with a table that has exact or fewer components than required, rather than too many. This is not a perfect solution, and so there is an included configuration option to relax the structure detection thresholds that is used for manual table detection in difficult documents. .. code-block:: python # Strict mode (default) structure_class_thresholds = { 'table column header': 0.5, 'table column': 0.7, 'table projected row header': 0.95, 'table row': 0.7, 'table spanning cell': 0.7, } # Relaxed mode (for difficult documents/manual table definition) structure_class_thresholds = { 'table column': 0.5, # Lower thresholds 'table row': 0.5, # ... reduced requirements } Table Detection Workflow ------------------------ Two-Stage Detection Process ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Stage 1: Layout Analysis** When ``detect_tables=True``, the system first uses YOLOv10 to identify table regions: 1. **Image Processing**: Document page converted to an upscaled/sharpened image suitable for model input 2. **Region Detection**: YOLOv10 model predicts bounding boxes for all document elements 3. **Table Filtering**: Regions classified as "Table" above confidence threshold are extracted 4. **Crop Generation**: Table regions are cropped with padding for detailed analysis, with the cropped images saved to tempfiles for the next stage **Stage 2: Structure Recognition** Each detected table region undergoes detailed structure analysis: 1. **Crop Analysis**: YOLOv8 structure model analyzes the cropped table image 2. **Component Detection**: Identifies rows, columns, headers, and special cells 3. **Line Extraction**: Calculates row and column boundary lines from detected components 4. **Metadata Generation**: Creates structured information about table dimensions and layout Alternative Single-Stage Process ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When ``detect_tables=False``, the entire page is treated as a potential table, bypassing detection and cropping: 1. **Full-Page Analysis**: YOLOv8 structure model analyzes the complete document page 2. **Structure Extraction**: Identifies table-like structures anywhere on the page 3. **Flexible Detection**: Captures tables that might not have clear boundaries 4. **Fallback Mode**: Useful for documents with unclear table demarcation, or when a catch-all approach is desired Word-to-Table Mapping --------------------- The ``TableReconstructor`` class in :ref:`table_reconstructor` handles the complex task of assigning OCR-detected words to their appropriate table cells using hybrid clustering techniques. Coordinate Transformation ~~~~~~~~~~~~~~~~~~~~~~~~~ Since OCR operates directly on PDF documents while table detection works on images, coordinate transformation is essential (implementation in :ref:`ocr_service`): .. code-block:: python transformer = CoordinateTransformer( pdf_width=pdf_size[1], # Note that the docTr DocumentFile dimensions are (height, width) for some reason pdf_height=pdf_size[0], image_width=image_size[0], image_height=image_size[1] ) # Transform OCR word coordinates to match table detection space transformed_words = transformer.transform_words_list(page_words) Spatial Clustering Algorithms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Hybrid Clustering Approach**: The system employs a two-stage clustering strategy combining DBSCAN and Agglomerative Clustering: 1. **DBSCAN Phase**: - Identifies core word clusters and noise points - Parameters: ``eps=10.0``, ``min_samples=2`` - Handles irregular word distributions and outliers DBSCAN is particularly effective at high-level spatial clustering, allowing it to identify the core column structure while being less sensitive to noise and outliers. 2. **Agglomerative Phase**: - Provides structured grouping of non-noise points - Uses linkage criteria to form coherent rows and columns - Ensures consistent table structure Agglomerative clustering is more suited to refining the clusters formed by DBSCAN, ensuring that the word-to-word structures detected in rows and columns are coherent and well-defined. **Row Detection**: .. code-block:: python def group_words_into_rows(self, words, num_rows): # Extract y-coordinates (vertical position) y_coords = np.array([ (word['geometry'][0][1] + word['geometry'][1][1]) / 2 for word in words ]).reshape(-1, 1) # Apply hybrid clustering labels, actual_num_rows = self.hybrid_clustering(y_coords, num_clusters=num_rows) # Group and sort words by cluster # ... **Column Boundary Detection**: .. code-block:: python def identify_column_boundaries(self, words, num_cols=2): # Calculate word center points word_centers = np.array([ (left + right) / 2 for left, right in zip(word_left_coords, word_right_coords) ]).reshape(-1, 1) # Cluster word centers to identify column groupings labels, actual_num_cols = self.hybrid_clustering(word_centers, num_clusters=num_cols) # Calculate boundaries between column clusters # ... Overlap-Based Word Assignment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Words are assigned to table cells based on spatial overlap with detected table structures: .. code-block:: python @staticmethod def match_words_to_tables(words, tables): for table_idx, table in enumerate(tables): table_bbox = table['bbox'] table_words = [] for word in words: word_bbox = word['geometry'] word_mid_x = (word_bbox[0][0] + word_bbox[1][0]) / 2 word_mid_y = (word_bbox[0][1] + word_bbox[1][1]) / 2 # Check if word center falls within table boundaries if (table_bbox[0] <= word_mid_x <= table_bbox[2] and table_bbox[1] <= word_mid_y <= table_bbox[3]): # Add relative positioning for cell assignment word_info = word.copy() word_info['relative_x'] = (word_mid_x - table_bbox[0]) / (table_bbox[2] - table_bbox[0]) word_info['relative_y'] = (word_mid_y - table_bbox[1]) / (table_bbox[3] - table_bbox[1]) table_words.append(word_info) Data Reconstruction Process --------------------------- Table Structure Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The reconstruction process includes optimization to handle various table layouts and OCR quality issues: **Configuration Strategy System**: The system tries multiple reconstruction strategies and selects the best result: 1. **Baseline Configuration**: Uses detected row/column counts as-is 2. **Row Adjustment**: Dynamically determines row count from word clustering 3. **Column Addition**: Incrementally adds columns to reduce cell overcrowding 4. **Combined Approach**: Applies both row and column adjustments **Quality Scoring**: Each configuration is evaluated using a "quality score" based on: .. code-block:: python def count_double_number_cells(rows, column_boundaries): # Identify cells with multiple numeric values (usually indicates poor column separation) double_number_cells = 0 for row in rows: columns = self.assign_words_to_columns(row, column_boundaries) for col in columns: if len(col) > 1 and all(word['text'].replace('.', '', 1).isdigit() for word in col): double_number_cells += 1 return double_number_cells **Iterative Refinement**: .. code-block:: python def iterative_refinement(start_cols, start_rows): current_score = float('inf') refinement_steps = 0 while refinement_steps < MAX_REFINEMENT_STEPS: # Try current configuration config, score = try_configuration(current_cols, current_rows) if score < current_score: current_score = score # If problems detected, try improvements if double_cells > 0: # Try adding columns col_config, col_score = try_configuration(current_cols + 1) if col_score < current_score: current_cols += 1 improvements_made = True # Stop if no improvements or perfect score if not improvements_made or current_score == 0: break As mentioned earlier, this refinement process performs best when the initial table structure has `fewer components than required`, rather than too many. At the end of the day, it is not perfect, but could be potentially improved with more sophisticated heuristics or machine learning models trained on annotated datasets. Error Handling and Edge Cases ----------------------------- Exception Management ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Model Inference Failures**: When a very poor table or a false positive makes it through to the clustering stage, the system will reject any tables that fail either of the two clustering methods and quietly raise an exception. If you see a printed error message pertaining to a clustering or layout analysis failure, it is very likely not indicative of a bug, but rather a table that the system was unable to process and will ignore. Clearer error messages should be added in the future. .. code-block:: python try: table_crops, _ = layout_analyzer.analyze_layout(page_image) except Exception as e: print(f"Error during layout analysis: {e}") table_crops = [] # Fallback to structure-only analysis **Coordinate Mismatches**: This error is no longer common, but can still occur if the bounding box format is incorrect or malformed. .. code-block:: python if isinstance(table['bbox'], (tuple, list)) and len(table['bbox']) == 4: bbox = tuple(map(float, table['bbox'])) formatted_bboxes.append([bbox]) else: print(f"Warning: Malformed bbox detected: {table['bbox']}") formatted_bboxes.append([0, 0, 0, 0]) # Default bbox **Empty Detection Results**: .. code-block:: python if words[1] == []: # No words matched to table continue # Skip empty tables if not extracted_tables: # No tables detected # Process entire page as potential table content table_info = self.analyze_single_table_YOLO(image, (0, 0, image.width, image.height)) Quality Assurance Mechanisms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Confidence Filtering**: All detections are filtered by confidence thresholds to ensure quality: .. code-block:: python # Only process high-confidence detections if conf < self.structure_class_thresholds.get(cls_name, 0.5): continue **Boundary Validation**: .. code-block:: python # Ensure coordinates are within image bounds x1 = max(0, x1 - self.crop_padding) y1 = max(0, y1 - self.crop_padding) x2 = min(image.width, x2 + self.crop_padding) y2 = min(image.height, y2 + self.crop_padding) **Structure Consistency Checks**: .. code-block:: python # Validate detected table structure if num_columns < 1: num_columns = 2 # Minimum reasonable column count if num_rows < 1: num_rows = 1 # Minimum reasonable row count Performance Optimization ------------------------ Memory Management ~~~~~~~~~~~~~~~~~ **Temporary File Handling**: .. code-block:: python temp_files = [] try: # Model processing temp_path = f"table_{uuid.uuid4()}.jpg" temp_files.append(temp_path) table_crop.save(temp_path) # ... processing finally: # Cleanup temporary files for temp_file in temp_files: if os.path.exists(temp_file): os.remove(temp_file) **Model Caching**: .. code-block:: python _cached_models = {} def get_model(model_name): if model_name not in _cached_models: _cached_models[model_name] = load_model(model_name) return _cached_models[model_name] Hardware Acceleration ~~~~~~~~~~~~~~~~~~~~~ **GPU Detection and Utilization**: Much like the code in :ref:`ocr`, :ref:`table_extractor` supports cross-platform GPU acceleration for model inference and automatically detects the best available device (CPU is fallback). .. code-block:: python # Cross-platform GPU support if not _macOS: self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") else: self.device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu") # Move models to appropriate device self.detection_model = YOLO(detection_model_path).to(self.device) self.structure_model = YOLO(structure_model_path).to(self.device) **Batch Processing Support**: The architecture supports future batch processing enhancements for improved throughput on multi-table documents. Integration with OCR Pipeline ----------------------------- .. code-block:: python # In OCRService.process_ocr_extraction() # 1. Get OCR results in PDF coordinate space ocr_results = perform_ocr(path) # 2. Transform to image coordinate space for table detection transformer = CoordinateTransformer( pdf_width=pdf_size[1], pdf_height=pdf_size[0], image_width=image_size[0], image_height=image_size[1] ) transformed_words = transformer.transform_words_list(page_words) # 3. Perform table detection on image extracted_tables = table_extractor.extract_tables_YOLO(page_image) # 4. Match transformed words to detected tables table_reconstructor = TableReconstructor(transformed_words) matched_tables = table_reconstructor.match_words_to_tables(transformed_words, extracted_tables) Manual Table Selection ~~~~~~~~~~~~~~~~~~~~~~~~~ In some cases, automatic table detection may not yield satisfactory results, especially for complex or poorly formatted tables. In these situations, the user can manually define the bounding box for a table though the UI, which will then be processed as a table crop for structure recognition. This process also allows for the user to specify the number of columns and/or rows in the table, enabling the system to reconstruct the table accurately without the need for iterative refinement. .. note:: Manual table selection does not require OCR to be rerun on the document. Session State Integration ~~~~~~~~~~~~~~~~~~~~~~~~~ Reconstruction results are stored in multiple session dictionaries for different use cases: .. code-block:: python # Core reconstruction data session.processed_tables[file_path][page_num] = processed_tables # Visualization components session.ocr_bboxes[file_path][page_num] = formatted_bboxes session.table_row_lines[file_path][page_num] = row_lines session.table_column_lines[file_path][page_num] = column_lines # Final structured output session.ocr_results[file_path][page_num] = dataframes Future Enhancements ------------------- Model Improvements ~~~~~~~~~~~~~~~~~~ - **Fine-tuning Opportunities**: The current models could benefit from domain-specific fine-tuning on annotated PVT report datasets - **Full Utilization of YOLOv10**: Leveraging YOLOv10 for detecting more than just tables (headers, titles, etc.) - **Ensemble Methods**: Combining multiple detection approaches for improved accuracy - **Confidence Calibration**: Better alignment between model confidence scores and actual accuracy Algorithm Enhancements ~~~~~~~~~~~~~~~~~~~~~~~ - **Semantic Understanding**: Incorporation of domain knowledge for better cell content interpretation - **Multi-page Table Handling**: Support for tables that span multiple document pages. This is somewhat handled later in the pipeline, but could be integrated more seamlessly into the table reconstruction process. Performance Optimizations ~~~~~~~~~~~~~~~~~~~~~~~~~ - **Parallel Processing**: Concurrent processing of multiple tables within a page - **Model Quantization**: Reduced model size for faster inference - **Model Variants**: Exploring different model architectures and sizes for specific use cases and system specs