Table Reconstruction¶

This section describes the table reconstruction capabilities of the Fluidsdata Digitization and OCR application.

Overview¶

Table reconstruction is a critical component of the document digitization pipeline that converts detected table regions and OCR-extracted text into structured, machine-readable data formats (pandas DataFrames). The process combines computer vision techniques for spatial table detection and structure recognition with advanced clustering algorithms to accurately reconstruct tabular data from scanned documents.

The table reconstruction pipeline consists of several interconnected stages:

Document Layout Analysis: Detection of table regions within document pages
Table Structure Recognition: Identification of table components (rows, columns, cells)
Word-to-Table Mapping: Assignment of OCR text to appropriate table cells
Data Reconstruction: Generation of structured DataFrames from mapped content
Quality Optimization: Iterative refinement to improve reconstruction accuracy

Computer Vision Models¶

The application employs two specialized YOLO (You Only Look Once) models for table detection and structure recognition:

YOLOv10 Layout Analyzer¶

Purpose: Document layout analysis and table region detection

Model: Custom-trained YOLOv10 architecture (yolov10x_best.pt)

Capabilities:

Detects multiple document elements: tables, text blocks, captions, headers, footers, formulas, and images
Operates at document page level to identify table boundaries
Provides confidence scores for each detected element
Handles rotated and complex table layouts

Note

While this detection model has the capability to detect various document elements, we are currently focused on table detection as the primary target. Future enhancements may expand its use to other document elements such as headers, footers, titles, and captions.

Detection Classes: Here’s an overview of all objects detectable by the YOLOv10 model:

Table - primary target
Caption
Footnote
Formula
List-item
Page_footer
Page_header
Picture
Section_header
Text
Title

Confidence Thresholds:

Confidence thresholds are set to balance precision and recall for table detection. The model is very rarely 100% confident in its predictions, so we must navigate the trade-off between false positives and missed detections.

The current strategy is to use a low threshold (0.2, which means that the model will reject any detections below 20% confidence) for the Table class to avoid missing any valid tables, and then implementing additional filtering after the detection step to refine results based on the context of the detected tables.

See line 29 of table_extractor for implementation.

YOLOv8 Table Structure Extractor¶

Purpose: Detailed table structure analysis within detected table regions

Models:

Detection model: best_detection.pt (table boundary detection, currently replaced by YOLOv10)
Structure model: best_structure.pt (internal table structure)

Structure Detection Classes:

table column: Vertical divisions within tables
table row: Horizontal divisions within tables
table column header: Header cells for columns
table projected row header: Row labels/headers
table spanning cell: Cells that span multiple rows/columns

Confidence Thresholds:

For the majority of documents, using stricter structure detection thresholds yields better results. There is post-model logic that iteratively improves the table structure based on detected components, but this primarily works by adding rows or columns rather than removing them. It is therefore ideal to start with a table that has exact or fewer components than required, rather than too many.

This is not a perfect solution, and so there is an included configuration option to relax the structure detection thresholds that is used for manual table detection in difficult documents.

# Strict mode (default)
structure_class_thresholds = {
    'table column header': 0.5,
    'table column': 0.7,
    'table projected row header': 0.95,
    'table row': 0.7,
    'table spanning cell': 0.7,
}

# Relaxed mode (for difficult documents/manual table definition)
structure_class_thresholds = {
    'table column': 0.5,        # Lower thresholds
    'table row': 0.5,
    # ... reduced requirements
}

Table Detection Workflow¶

Two-Stage Detection Process¶

Stage 1: Layout Analysis

When detect_tables=True, the system first uses YOLOv10 to identify table regions:

Image Processing: Document page converted to an upscaled/sharpened image suitable for model input
Region Detection: YOLOv10 model predicts bounding boxes for all document elements
Table Filtering: Regions classified as “Table” above confidence threshold are extracted
Crop Generation: Table regions are cropped with padding for detailed analysis, with the cropped images saved to tempfiles for the next stage

Stage 2: Structure Recognition

Each detected table region undergoes detailed structure analysis:

Crop Analysis: YOLOv8 structure model analyzes the cropped table image
Component Detection: Identifies rows, columns, headers, and special cells
Line Extraction: Calculates row and column boundary lines from detected components
Metadata Generation: Creates structured information about table dimensions and layout

Alternative Single-Stage Process¶

When detect_tables=False, the entire page is treated as a potential table, bypassing detection and cropping:

Full-Page Analysis: YOLOv8 structure model analyzes the complete document page
Structure Extraction: Identifies table-like structures anywhere on the page
Flexible Detection: Captures tables that might not have clear boundaries
Fallback Mode: Useful for documents with unclear table demarcation, or when a catch-all approach is desired

Word-to-Table Mapping¶

The TableReconstructor class in table_reconstructor handles the complex task of assigning OCR-detected words to their appropriate table cells using hybrid clustering techniques.

Coordinate Transformation¶

Since OCR operates directly on PDF documents while table detection works on images, coordinate transformation is essential (implementation in ocr_service):

transformer = CoordinateTransformer(
    pdf_width=pdf_size[1], # Note that the docTr DocumentFile dimensions are (height, width) for some reason
    pdf_height=pdf_size[0],
    image_width=image_size[0],
    image_height=image_size[1]
)

# Transform OCR word coordinates to match table detection space
transformed_words = transformer.transform_words_list(page_words)

Spatial Clustering Algorithms¶

Hybrid Clustering Approach:

The system employs a two-stage clustering strategy combining DBSCAN and Agglomerative Clustering:

DBSCAN Phase:
- Identifies core word clusters and noise points
- Parameters: eps=10.0, min_samples=2
- Handles irregular word distributions and outliers
DBSCAN is particularly effective at high-level spatial clustering, allowing it to identify the core column structure while being less sensitive to noise and outliers.
Agglomerative Phase:
- Provides structured grouping of non-noise points
- Uses linkage criteria to form coherent rows and columns
- Ensures consistent table structure
Agglomerative clustering is more suited to refining the clusters formed by DBSCAN, ensuring that the word-to-word structures detected in rows and columns are coherent and well-defined.

Row Detection:

def group_words_into_rows(self, words, num_rows):
    # Extract y-coordinates (vertical position)
    y_coords = np.array([
        (word['geometry'][0][1] + word['geometry'][1][1]) / 2
        for word in words
    ]).reshape(-1, 1)

    # Apply hybrid clustering
    labels, actual_num_rows = self.hybrid_clustering(y_coords, num_clusters=num_rows)

    # Group and sort words by cluster
    # ...

Column Boundary Detection:

def identify_column_boundaries(self, words, num_cols=2):
    # Calculate word center points
    word_centers = np.array([
        (left + right) / 2
        for left, right in zip(word_left_coords, word_right_coords)
    ]).reshape(-1, 1)

    # Cluster word centers to identify column groupings
    labels, actual_num_cols = self.hybrid_clustering(word_centers, num_clusters=num_cols)

    # Calculate boundaries between column clusters
    # ...

Overlap-Based Word Assignment¶

Words are assigned to table cells based on spatial overlap with detected table structures:

@staticmethod
def match_words_to_tables(words, tables):
    for table_idx, table in enumerate(tables):
        table_bbox = table['bbox']
        table_words = []

        for word in words:
            word_bbox = word['geometry']
            word_mid_x = (word_bbox[0][0] + word_bbox[1][0]) / 2
            word_mid_y = (word_bbox[0][1] + word_bbox[1][1]) / 2

            # Check if word center falls within table boundaries
            if (table_bbox[0] <= word_mid_x <= table_bbox[2] and
                table_bbox[1] <= word_mid_y <= table_bbox[3]):

                # Add relative positioning for cell assignment
                word_info = word.copy()
                word_info['relative_x'] = (word_mid_x - table_bbox[0]) / (table_bbox[2] - table_bbox[0])
                word_info['relative_y'] = (word_mid_y - table_bbox[1]) / (table_bbox[3] - table_bbox[1])
                table_words.append(word_info)

Data Reconstruction Process¶

Table Structure Optimization¶

The reconstruction process includes optimization to handle various table layouts and OCR quality issues:

Configuration Strategy System:

The system tries multiple reconstruction strategies and selects the best result:

Baseline Configuration: Uses detected row/column counts as-is
Row Adjustment: Dynamically determines row count from word clustering
Column Addition: Incrementally adds columns to reduce cell overcrowding
Combined Approach: Applies both row and column adjustments

Quality Scoring:

Each configuration is evaluated using a “quality score” based on:

def count_double_number_cells(rows, column_boundaries):
    # Identify cells with multiple numeric values (usually indicates poor column separation)
    double_number_cells = 0
    for row in rows:
        columns = self.assign_words_to_columns(row, column_boundaries)
        for col in columns:
            if len(col) > 1 and all(word['text'].replace('.', '', 1).isdigit() for word in col):
                double_number_cells += 1

    return double_number_cells

Iterative Refinement:

def iterative_refinement(start_cols, start_rows):
    current_score = float('inf')
    refinement_steps = 0

    while refinement_steps < MAX_REFINEMENT_STEPS:
        # Try current configuration
        config, score = try_configuration(current_cols, current_rows)

        if score < current_score:
            current_score = score

        # If problems detected, try improvements
        if double_cells > 0:
            # Try adding columns
            col_config, col_score = try_configuration(current_cols + 1)
            if col_score < current_score:
                current_cols += 1
                improvements_made = True

        # Stop if no improvements or perfect score
        if not improvements_made or current_score == 0:
            break

As mentioned earlier, this refinement process performs best when the initial table structure has fewer components than required, rather than too many. At the end of the day, it is not perfect, but could be potentially improved with more sophisticated heuristics or machine learning models trained on annotated datasets.

Error Handling and Edge Cases¶

Exception Management¶

Model Inference Failures:

When a very poor table or a false positive makes it through to the clustering stage, the system will reject any tables that fail either of the two clustering methods and quietly raise an exception. If you see a printed error message pertaining to a clustering or layout analysis failure, it is very likely not indicative of a bug, but rather a table that the system was unable to process and will ignore. Clearer error messages should be added in the future.

try:
    table_crops, _ = layout_analyzer.analyze_layout(page_image)
except Exception as e:
    print(f"Error during layout analysis: {e}")
    table_crops = []
    # Fallback to structure-only analysis

Coordinate Mismatches:

This error is no longer common, but can still occur if the bounding box format is incorrect or malformed.

if isinstance(table['bbox'], (tuple, list)) and len(table['bbox']) == 4:
    bbox = tuple(map(float, table['bbox']))
    formatted_bboxes.append([bbox])
else:
    print(f"Warning: Malformed bbox detected: {table['bbox']}")
    formatted_bboxes.append([0, 0, 0, 0])  # Default bbox

Empty Detection Results:

if words[1] == []:  # No words matched to table
    continue  # Skip empty tables

if not extracted_tables:  # No tables detected
    # Process entire page as potential table content
    table_info = self.analyze_single_table_YOLO(image, (0, 0, image.width, image.height))

Quality Assurance Mechanisms¶

Confidence Filtering:

All detections are filtered by confidence thresholds to ensure quality:

# Only process high-confidence detections
if conf < self.structure_class_thresholds.get(cls_name, 0.5):
    continue

Boundary Validation:

# Ensure coordinates are within image bounds
x1 = max(0, x1 - self.crop_padding)
y1 = max(0, y1 - self.crop_padding)
x2 = min(image.width, x2 + self.crop_padding)
y2 = min(image.height, y2 + self.crop_padding)

Structure Consistency Checks:

# Validate detected table structure
if num_columns < 1:
    num_columns = 2  # Minimum reasonable column count

if num_rows < 1:
    num_rows = 1  # Minimum reasonable row count

Performance Optimization¶

Memory Management¶

Temporary File Handling:

temp_files = []
try:
    # Model processing
    temp_path = f"table_{uuid.uuid4()}.jpg"
    temp_files.append(temp_path)
    table_crop.save(temp_path)
    # ... processing
finally:
    # Cleanup temporary files
    for temp_file in temp_files:
        if os.path.exists(temp_file):
            os.remove(temp_file)

Model Caching:

_cached_models = {}

def get_model(model_name):
    if model_name not in _cached_models:
        _cached_models[model_name] = load_model(model_name)
    return _cached_models[model_name]

Hardware Acceleration¶

GPU Detection and Utilization:

Much like the code in ocr, table_extractor supports cross-platform GPU acceleration for model inference and automatically detects the best available device (CPU is fallback).

# Cross-platform GPU support
if not _macOS:
    self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
else:
    self.device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

# Move models to appropriate device
self.detection_model = YOLO(detection_model_path).to(self.device)
self.structure_model = YOLO(structure_model_path).to(self.device)

Batch Processing Support:

The architecture supports future batch processing enhancements for improved throughput on multi-table documents.

Integration with OCR Pipeline¶

# In OCRService.process_ocr_extraction()

# 1. Get OCR results in PDF coordinate space
ocr_results = perform_ocr(path)

# 2. Transform to image coordinate space for table detection
transformer = CoordinateTransformer(
    pdf_width=pdf_size[1],
    pdf_height=pdf_size[0],
    image_width=image_size[0],
    image_height=image_size[1]
)
transformed_words = transformer.transform_words_list(page_words)

# 3. Perform table detection on image
extracted_tables = table_extractor.extract_tables_YOLO(page_image)

# 4. Match transformed words to detected tables
table_reconstructor = TableReconstructor(transformed_words)
matched_tables = table_reconstructor.match_words_to_tables(transformed_words, extracted_tables)

Manual Table Selection¶

In some cases, automatic table detection may not yield satisfactory results, especially for complex or poorly formatted tables. In these situations, the user can manually define the bounding box for a table though the UI, which will then be processed as a table crop for structure recognition.

This process also allows for the user to specify the number of columns and/or rows in the table, enabling the system to reconstruct the table accurately without the need for iterative refinement.

Note

Manual table selection does not require OCR to be rerun on the document.

Session State Integration¶

Reconstruction results are stored in multiple session dictionaries for different use cases:

# Core reconstruction data
session.processed_tables[file_path][page_num] = processed_tables

# Visualization components
session.ocr_bboxes[file_path][page_num] = formatted_bboxes
session.table_row_lines[file_path][page_num] = row_lines
session.table_column_lines[file_path][page_num] = column_lines

# Final structured output
session.ocr_results[file_path][page_num] = dataframes

Future Enhancements¶

Model Improvements¶

Fine-tuning Opportunities: The current models could benefit from domain-specific fine-tuning on annotated PVT report datasets
Full Utilization of YOLOv10: Leveraging YOLOv10 for detecting more than just tables (headers, titles, etc.)
Ensemble Methods: Combining multiple detection approaches for improved accuracy
Confidence Calibration: Better alignment between model confidence scores and actual accuracy

Algorithm Enhancements¶

Semantic Understanding: Incorporation of domain knowledge for better cell content interpretation
Multi-page Table Handling: Support for tables that span multiple document pages. This is somewhat handled later in the pipeline, but could be integrated more seamlessly into the table reconstruction process.

Performance Optimizations¶

Parallel Processing: Concurrent processing of multiple tables within a page
Model Quantization: Reduced model size for faster inference
Model Variants: Exploring different model architectures and sizes for specific use cases and system specs