Skip to content

Conversation

@JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Dec 5, 2025

  • Introduces a lightweight diagnostic tool for identifying data loading
    bottlenecks in PyTorch training pipelines.
  • This change adds Loader Evaluator inside pytorch DALI plugin, a
    jupyter notebook tutorial, and a documentation page with tests
  • LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring
    Two operation modes: 'log' (normal iteration with metrics) and 'replay'
    (cached batches for ideal performance simulation)
  • PerformanceMetrics class for detailed performance tracking and bottleneck
    analysis
  • In-memory batch caching for replay mode to simulate ideal data loading
  • Comprehensive test suite and documentation with example notebook
  • The tool helps users compare real vs. ideal data loading performance and
    identify optimization opportunities.

Authored-by: Albert Wolant [email protected]

Category:

New feature (non-breaking change which adds functionality)

Description:

  • Introduces a lightweight diagnostic tool for identifying data loading
    bottlenecks in PyTorch training pipelines.
  • This change adds Loader Evaluator inside pytorch DALI plugin, a
    jupyter notebook tutorial, and a documentation page with tests
  • LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring
    Two operation modes: 'log' (normal iteration with metrics) and 'replay'
    (cached batches for ideal performance simulation)
  • PerformanceMetrics class for detailed performance tracking and bottleneck
    analysis
  • In-memory batch caching for replay mode to simulate ideal data loading
  • Comprehensive test suite and documentation with example notebook
  • The tool helps users compare real vs. ideal data loading performance and
    identify optimization opportunities.

Additional information:

Affected modules and functionalities:

  • new module in Pytorch plugin
  • new example
  • new test for it
  • new documentation page describing the overall idea

Key points relevant for the review:

  • overall idea and flow

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
      • test_pytorch_loader_evaluator.py
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
      • pytorch_data_loader_evaluator.rst
    • Jupyter
      • pytorch_data_loader_evaluator.ipynb
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4299

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@JanuszL JanuszL marked this pull request as draft December 5, 2025 10:39
@JanuszL JanuszL force-pushed the loader_Evaluator branch 2 times, most recently from 6766528 to a008b7c Compare December 5, 2025 10:45
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39670512]: BUILD STARTED

@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39670654]: BUILD STARTED

@greptile-apps
Copy link

greptile-apps bot commented Dec 5, 2025

Greptile Overview

Greptile Summary

This PR adds a new LoaderEvaluator diagnostic tool to the PyTorch plugin for identifying data loading bottlenecks in training pipelines.

Key Changes:

  • LoaderEvaluator class wraps PyTorch DataLoader with two modes: "log" (normal iteration with metrics) and "replay" (cached batches for ideal performance simulation)
  • Comprehensive test suite covering initialization, modes, metrics collection, and edge cases
  • Documentation includes RST pages and a Jupyter notebook tutorial demonstrating the bottleneck detection workflow

Implementation Notes:

  • The tool caches batches during construction in replay mode, then replays them to simulate zero-overhead data loading
  • Performance metrics include batch times, total time, and cache status
  • Thread-safe caching using threading.Lock

Minor Issue Found:

  • Unused variable batch_count in _replay_mode_iter() method

Confidence Score: 4/5

  • This PR is safe to merge - adds a new self-contained diagnostic tool with comprehensive tests and no changes to existing functionality.
  • Score of 4 reflects a well-structured new feature with good test coverage and documentation. Only minor issue is an unused variable that doesn't affect functionality.
  • Minor cleanup needed in dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py (unused variable).

Important Files Changed

File Analysis

Filename Score Overview
dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py 4/5 Core LoaderEvaluator class with caching and metrics collection. Contains one unused variable batch_count.
dali/test/python/test_pytorch_loader_evaluator.py 5/5 Comprehensive test suite covering log/replay modes, edge cases, and metrics collection.
docs/examples/frameworks/pytorch/loader_evaluator/pytorch_data_loader_evaluator.ipynb 5/5 Well-documented Jupyter notebook demonstrating bottleneck detection workflow.
docs/plugins/pytorch_data_loader_evaluator.rst 5/5 Detailed documentation explaining the tool's purpose, usage, and comparison with other profiling tools.

Sequence Diagram

sequenceDiagram
    participant User as Training Code
    participant LE as LoaderEvaluator
    participant DL as PyTorch DataLoader
    participant Cache as Batch Cache

    alt Log Mode
        User->>LE: __iter__()
        LE->>DL: iter(dataloader)
        loop For each batch
            LE->>DL: next()
            DL-->>LE: batch
            Note over LE: Record batch_time
            LE-->>User: yield batch
        end
        Note over LE: Set end_time
    else Replay Mode (Construction)
        User->>LE: __init__(mode="replay")
        LE->>DL: iter(dataloader)
        loop Cache batches
            DL-->>LE: batch
            LE->>Cache: append(batch)
        end
        Note over LE: cache_ready = True
    end

    alt Replay Mode (Iteration)
        User->>LE: __iter__()
        loop For original_length iterations
            LE->>Cache: get batch[i % cache_size]
            Cache-->>LE: cached batch
            Note over LE: Record batch_time
            LE-->>User: yield batch
        end
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/metrics.py, line 137-145 (link)

    logic: json.dump() will fail with TypeError because get_summary() returns numpy types (np.float64 from np.mean(), np.std(), etc.) which are not JSON serializable.

    Consider converting numpy types to native Python types before serialization, e.g.:

    def save_metrics(self, filename: str):
        """Save metrics to a JSON file."""
        metrics = self.get_summary()
        bottlenecks = self.identify_bottlenecks()
    
        # Convert numpy types to native Python types for JSON serialization
        def convert_numpy(obj):
            if isinstance(obj, np.ndarray):
                return obj.tolist()
            elif isinstance(obj, (np.floating, np.integer)):
                return obj.item()
            elif isinstance(obj, dict):
                return {k: convert_numpy(v) for k, v in obj.items()}
            elif isinstance(obj, list):
                return [convert_numpy(i) for i in obj]
            return obj
    
        output = convert_numpy({"metrics": metrics, "bottlenecks": bottlenecks})
    
        with open(filename, "w") as f:
            json.dump(output, f, indent=2)
    
        print(f"Performance metrics saved to: {filename}")
  2. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/metrics.py, line 100-102 (link)

    style: Potential division by zero if all batch_times are 0 (edge case).

  3. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/__init__.py, line 1 (link)

    style: Missing Apache 2.0 license header that other files in this repository have. Same applies to loader.py and metrics.py.

9 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@mdabek-nvidia mdabek-nvidia self-assigned this Dec 5, 2025
- Introduces a lightweight diagnostic tool for identifying data loading
  bottlenecks in PyTorch training pipelines.
- This change adds Loader Evaluator inside pytorch DALI plugin, a
  jupyter notebook tutorial, and a documentation page with tests
- LoaderEvaluator class wraps PyTorch DataLoader with performance monitoring
  Two operation modes: 'log' (normal iteration with metrics) and 'replay'
  (cached batches for ideal performance simulation)
- PerformanceMetrics class for detailed performance tracking and bottleneck
  analysis
- In-memory batch caching for replay mode to simulate ideal data loading
- Comprehensive test suite and documentation with example notebook
- The tool helps users compare real vs. ideal data loading performance and
  identify optimization opportunities.

Authored-by: Albert Wolant <[email protected]>
Signed-off-by: Janusz Lisiecki <[email protected]>
@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

@greptileai please review again.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/__init__.py, line 16 (link)

    syntax: Missing metrics.py module. This import will raise ModuleNotFoundError: No module named 'nvidia.dali.plugin.pytorch.loader_evaluator.metrics' at runtime.

    Either create the missing metrics.py file with the PerformanceMetrics class, or remove this import and the corresponding __all__ entry.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@JanuszL JanuszL force-pushed the loader_Evaluator branch 3 times, most recently from c414911 to b7d6958 Compare December 5, 2025 12:43
Signed-off-by: Janusz Lisiecki <[email protected]>
@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39675479]: BUILD STARTED

@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

@greptileai please review again.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 85 (link)

    style: Type hint assumes specific batch structure but DataLoader can return any type. Should this support generic batch types instead of assuming Tuple[TorchTensor, TorchTensor]?

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  2. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 146-163 (link)

    style: Double break logic could be simplified with a single loop condition

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

@greptileai please review again.

Signed-off-by: Janusz Lisiecki <[email protected]>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. dali/python/nvidia/dali/plugin/pytorch/loader_evaluator/loader.py, line 143 (link)

    style: Unused variable batch_count - initialized but never used.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39676718]: BUILD STARTED

@@ -0,0 +1,451 @@
{
Copy link
Collaborator

@mdabek-nvidia mdabek-nvidia Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #10.    import sys

I think that sys is not used anywhere.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -0,0 +1,451 @@
{
Copy link
Collaborator

@mdabek-nvidia mdabek-nvidia Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # Wrap your dataloader with LoaderEvaluator (this is the only change!)

Nitpick: there is also a change below dataloader became monitored_dataloader


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it should be:

dataloader = LoaderEvaluator(
    dataloader, mode="replay", num_cached_batches=len(dataloader) // 10
)

instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be clear, that we are just "updating" the dataloader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Signed-off-by: Janusz Lisiecki <[email protected]>
Signed-off-by: Janusz Lisiecki <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39670654]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39676718]: BUILD FAILED

@JanuszL
Copy link
Contributor Author

JanuszL commented Dec 5, 2025

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39693133]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [39693133]: BUILD FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants