Performance insights report (Beta)
The performance insights report highlights bottlenecks, including RDU versus Host, critical sections, and critical stages. It also provides initial steps for conducting a deeper analysis of issues.
Locate the report
The performance insights report is available in .JSON format at the following location in your output folder:
/reports/analysis/summary.json.
Read the report
{
"Overview": [
"Host takes 66.5 % of the total latency.",
"RDU takes 33.5 % of the total latency.",
"This model could be Host bound."
],
"Host": [
{
"Top 3 time-consuming processes": [
"SAMBA takes 42.2 % of the host latency. You may take a look at PYINSTRUMENT DATA tab in SambaTune UI OVERVIEW page.",
"XFER takes 35.6 % of the host latency. You may check tensors in <path_to_report>.",
"RUN SETUP takes 13.3 % of the host latency."
]
}
],
"RDU": [
{
"chip 0, section 2 takes 56.3 % of the RDU latency. The total DDR bandwidth is 95.41 GB/s.": {
"Analysis": [
"Detailed information can be found in <path_to_report>."
],
"Top 3 time-consuming stages": [
"Detailed information can be found in <path_to_report>."
]
}
},
{
"chip 0, section 1 takes 28.8 % of the RDU latency. The total DDR bandwidth is 88.92 GB/s.": {
"Analysis": [
"Detailed information can be found in <path_to_report>."
],
"Top 3 time-consuming stages": [
"stage 140 takes 19.83 % of the section latency. Template names are dlrm__top_mlp__2__linear_bwd_loss_grad_b. Node names are tlir.Buffer1228, tlir.Linear1558, tlir.Buffer1235.",
"stage 132 takes 14.28 % of the section latency. Template names are dlrm__top_mlp__6__linear_bwd_loss_grad_b. Node names are tlir.Buffer1210, tlir.Linear1548, tlir.Buffer1217.",
"stage 136 takes 5.83 % of the section latency. Template names are dlrm__top_mlp__4__linear_bwd_loss_grad_b. Node names are tlir.Buffer1219, tlir.Linear1553, tlir.Buffer1226.",
"Detailed information can be found in <path_to_report>."
]
}
},
{
"chip 0, section 0 takes 15.0 % of the RDU latency. The total DDR bandwidth is 93.36 GB/s.": {
"Analysis": [
"Detailed information can be found in <path_to_report>."
],
"Top 3 time-consuming stages": [
"stage 18 takes 3.72 % of the section latency. Node names are tlir.Buffer950, tlir.Buffer967.",
"stage 3 takes 3.72 % of the section latency. Node names are tlir.Buffer965, tlir.Buffer967.",
"stage 25 takes 3.72 % of the section latency. Node names are tlir.Buffer943, tlir.Buffer967.",
"Detailed information can be found in <path_to_report>."
]
}
}
]
}
The report is structured into three parts: Overview, Host, and RDU.
-
The Overview presents the percentage of time consumed by Host and RDU and identifies potential bottlenecks.
-
The Host section presents the top 3 time-consuming processes, expressed as a percentage of total Host latency. The section offers suggestions such as using pyinstrument, the standard call-stack profiler in Python, or using tensor data.
-
The RDU section presents the top 3 time-consuming sections and their total DDR bandwidth. For each section, it shows analysis and top 3 time-consuming stages as well as their template names and node names.