API Reference: Connected Component Cleanup¶
Overview¶
Connected component cleanup functions remove floating/disconnected blobs from segmentations. Two approaches are provided based on different use cases:
- Per-Label Cleanup (
keep_largest_component) - Clean each label independently - Structure Cleanup (
keep_largest_structure) - Treat multiple labels as one connected structure
Common use case: Neural networks (e.g., CardioForm) produce segmentations with valid anatomy plus floating garbage chunks that share labels with the main anatomy. These functions clean up such artifacts before refinement.
Quick Start¶
from pycemrg_image_analysis.utilities.postprocessing import (
keep_largest_structure_by_name
)
# Clean entire CardioForm output (all labels)
cleaned = keep_largest_structure_by_name(cardioform_seg)
# Or clean specific multi-label structure
cleaned = keep_largest_structure_by_name(
seg,
label_names=["LA_BP_label", "LPV1_label", "RPV1_label"],
label_manager=label_manager
)
When to Use Which Function¶
Use keep_largest_component when:¶
- ✅ Labels are separate anatomical structures (don't touch each other)
- ✅ Each label independently has small floating blobs
- ✅ Example: Aorta (7) has floating pieces AND LA (4) has floating pieces, but they're unrelated
Use keep_largest_structure when:¶
- ✅ Labels form ONE connected anatomical structure
- ✅ Floating chunks share labels with main anatomy
- ✅ You want to remove anything not connected to main structure
- ✅ Example: LA + PVs form connected atrium, but there are lung chunks also labeled LA/PV
Spatial Operations Layer¶
Module: pycemrg_image_analysis.utilities.components
These functions operate directly on sitk.Image objects and handle spatial connectivity using SimpleITK algorithms. Metadata (spacing, origin, direction) is preserved.
keep_largest_component()¶
Keep only the largest connected component for each label independently.
Signature:
Parameters:
- image: Input segmentation image
- label_values: List of integer label values to process independently
Returns: New image with only largest component per label preserved
Behavior:
- Each label in label_values is processed separately
- For each label: finds all connected components, keeps only largest
- Labels not in label_values are completely untouched
- Labels with single component are unchanged (no wasted computation)
- Labels in label_values but not in image are silently skipped
Example:
from pycemrg_image_analysis.utilities.components import keep_largest_component
# Label 7 has main aorta + small floating blob
# Label 4 has main LA + small floating blob
# Clean both independently
cleaned = keep_largest_component(seg_image, [7, 4])
Algorithm: For each label: 1. Extract binary mask for that label 2. Run connected component analysis 3. Relabel components by size (largest = 1) 4. Keep only component 1 5. Restore to original label value
keep_largest_structure()¶
Keep only the largest connected structure across multiple labels.
Signature:
def keep_largest_structure(
image: sitk.Image,
label_values: Optional[List[int]] = None
) -> sitk.Image
Parameters:
- image: Input segmentation image
- label_values: Labels forming the structure. If None, uses all non-zero labels.
Returns: New image with only largest structure, preserving internal labels
Behavior:
- Treats all specified labels as ONE anatomical structure
- Finds largest connected blob considering ALL labels together
- Removes floating chunks even if they have valid label values
- Preserves all internal label values within kept structure
- Default (label_values=None) processes entire segmentation
Example:
from pycemrg_image_analysis.utilities.components import keep_largest_structure
# CardioForm output has LA (4) + PVs (8,9,10,11) connected,
# plus garbage in lungs also labeled 4,8,9,10,11
cleaned = keep_largest_structure(seg, [4, 8, 9, 10, 11])
# Keeps only main LA+PV structure
# Clean entire segmentation (all labels)
cleaned = keep_largest_structure(cardioform_seg) # label_values=None
Algorithm: 1. Create binary mask (any specified label = foreground) 2. Run connected component analysis on binary mask 3. Relabel components by size (largest = 1) 4. Multiply original image by largest component mask 5. Result: main structure with all labels, floating chunks zeroed
Semantic Wrappers Layer¶
Module: pycemrg_image_analysis.utilities.postprocessing
These functions translate semantic label names to integer values, then delegate to the spatial operations layer. Designed for orchestrator use.
keep_largest_component_by_name()¶
Keep largest component per label using semantic names.
Signature:
def keep_largest_component_by_name(
image: sitk.Image,
label_names: List[str],
label_manager: LabelManager
) -> sitk.Image
Parameters:
- image: Input segmentation image
- label_names: List of label names to process (e.g., ["Ao_BP_label", "LA_BP_label"])
- label_manager: LabelManager for name → value lookup
Returns: New image with only largest components per label
Raises: KeyError if label name not found in LabelManager (logged as warning, label skipped)
Example:
from pycemrg_image_analysis.utilities.postprocessing import (
keep_largest_component_by_name
)
cleaned = keep_largest_component_by_name(
seg_image,
["Ao_BP_label", "LA_BP_label"],
label_manager
)
keep_largest_structure_by_name()¶
Keep largest multi-label structure using semantic names.
Signature:
def keep_largest_structure_by_name(
image: sitk.Image,
label_names: Optional[List[str]] = None,
label_manager: Optional[LabelManager] = None
) -> sitk.Image
Parameters:
- image: Input segmentation image
- label_names: List of label names forming structure. If None, uses all non-zero labels.
- label_manager: LabelManager for name → value lookup. Required if label_names provided.
Returns: New image with only largest structure
Raises:
- ValueError if label_names provided but label_manager is None
- KeyError if label name not found (logged as warning, label skipped)
Example:
from pycemrg_image_analysis.utilities.postprocessing import (
keep_largest_structure_by_name
)
# Clean entire segmentation (all labels)
cleaned = keep_largest_structure_by_name(cardioform_seg)
# Clean specific structure
cleaned = keep_largest_structure_by_name(
seg_image,
["LA_BP_label", "LPV1_label", "LPV2_label", "RPV1_label", "RPV2_label"],
label_manager
)
Orchestrator Usage Patterns¶
Pattern 1: Clean Entire CardioForm Output¶
from pycemrg_image_analysis.utilities.postprocessing import (
keep_largest_structure_by_name
)
# In AnatomyRefinementRunner.run()
cardioform_seg = load_image(subject.paths.seg_whole_heart_label4_handled)
# Clean all floating components
cleaned_seg = keep_largest_structure_by_name(cardioform_seg)
# Continue with relabeling and refinement
working_image = relabel_image(cleaned_seg, mapping)
Pattern 2: Clean Specific Problematic Labels¶
# Clean only known problematic labels (configurable)
labels_to_clean = ["Ao", "LA_bp", "RA_bp"] # From config
cleaned_seg = keep_largest_structure_by_name(
cardioform_seg,
label_names=labels_to_clean,
label_manager=cardioform_labels
)
Pattern 3: Per-Label Cleanup for Independent Structures¶
# If aorta and pulmonary artery are separate and both have floating blobs
cleaned_seg = keep_largest_component_by_name(
seg,
["Ao_BP_label", "PArt_BP_label"],
label_manager
)
Pattern 4: Conditional Cleanup Based on Tracker Flags¶
# Only clean if tracker indicates floating components
if subject.tracker.has_flag("FLOATING_ANATOMY"):
labels_to_clean = subject.tracker.get_floating_labels()
cleaned_seg = keep_largest_structure_by_name(
cardioform_seg,
label_names=labels_to_clean,
label_manager=cardioform_labels
)
else:
cleaned_seg = cardioform_seg
Complete Pipeline Example¶
from pycemrg.data import LabelManager, LabelMapper
from pycemrg_image_analysis.utilities import load_image, save_image
from pycemrg_image_analysis.utilities.postprocessing import (
keep_largest_structure_by_name,
relabel_image
)
def run_anatomy_refinement(subject):
"""Complete pipeline with component cleanup."""
# 1. Load CardioForm output
cardioform_seg = load_image(subject.paths.seg_whole_heart_label4_handled)
cardioform_mgr = LabelManager("config/anatomy_labels.yaml")
# 2. Clean floating components
cleaned_seg = keep_largest_structure_by_name(cardioform_seg)
# 3. Translate to image-analysis standard
image_analysis_mgr = LabelManager("config/labels.yaml")
mapper = LabelMapper(source=cardioform_mgr, target=image_analysis_mgr)
mapping = mapper.get_source_to_target_mapping()
working_image = relabel_image(cleaned_seg, mapping)
# 4. Continue with refinement
working_image = run_myocardium(working_image, ...)
working_image = run_valves(working_image, ...)
# 5. Save result
save_image(working_image, subject.paths.seg_anatomy_refined)
Edge Cases and Behavior¶
Missing Labels¶
# Labels in list but not in image are silently skipped (no error)
cleaned = keep_largest_component(seg, [5, 99]) # 99 doesn't exist
# → Label 5 cleaned, label 99 ignored
Single Component¶
# Labels with only one component are returned unchanged
# (No wasted computation running connected components)
cleaned = keep_largest_component(seg, [5]) # Label 5 has one blob
# → Image unchanged for label 5
Empty Label List¶
# Empty list returns image unchanged
cleaned = keep_largest_component(seg, [])
# → Original image returned
All Labels Default¶
# None = all non-zero labels
cleaned = keep_largest_structure(seg) # label_values=None
# → Uses all labels in image (excluding 0)
Tie in Component Size¶
# If two components have same size, keeps first one found
# (Determined by SimpleITK's internal ordering)
Performance Notes¶
- Per-label (
keep_largest_component): O(N × L) where N = image size, L = number of labels - Structure (
keep_largest_structure): O(N) - single connected component pass - Uses SimpleITK's optimized ConnectedComponent filter (efficient for 3D medical images)
- Handles anisotropic voxels correctly via metadata preservation
- Face connectivity (not diagonal) - standard for medical segmentations
Common Pitfalls¶
❌ Wrong: Using per-label for multi-label structure¶
# Don't do this for LA+PV structure with floating chunks:
cleaned = keep_largest_component(seg, [4, 8, 9, 10, 11])
# → Cleans each label independently, won't remove chunks that are
# the largest blob of their specific label
✅ Right: Use structure cleanup¶
# Do this instead:
cleaned = keep_largest_structure(seg, [4, 8, 9, 10, 11])
# → Treats 4+8+9+10+11 as ONE structure, removes floating chunks
❌ Wrong: Forgetting to specify label_manager with names¶
# This will raise ValueError:
cleaned = keep_largest_structure_by_name(
seg,
label_names=["LA_BP_label"],
label_manager=None # ← Error!
)
✅ Right: Provide label_manager¶
cleaned = keep_largest_structure_by_name(
seg,
label_names=["LA_BP_label"],
label_manager=label_manager
)
See Also¶
- Examples:
examples/connected_component_cleanup.py - Tests:
tests/unit/test_components.py - Related:
remove_labels(),keep_labels()for label-based cleanup without connectivity