PBSNet: A Pseudo Bilateral Segmentation Network for Real-Time Semantic Segmentation

"This work proposes PBSNet for real-time semantic segmentation. While certain components of the overall model architecture present some novelty, it's not clear to this reviewer the key problem this work is trying to address compared to state of the art models. More clarity on this in the introduction section will be helpful for the readers. Additionally the proposed original contributions are not technically justified in the model description. Eg. 1) The authors mention that the proposed bi-directional vertical propagation structure improves communication between low-level and high-level information, but in the description it's not clear how. 2) It's mentioned that the proposed approach improves the segmentation accuracy of similar targets, but nothing is later described to demonstrate this. In Section 3 that has the description of the proposed approach in 3.1 - 3.4 while the technical descriptions are well done, it's not very clear how each of the modules impact the outcome of the subsequent modules. Results, generally are limited but convincing enough. However the shown examples don't fully substantiate the claims made by the authors original contributions. It would be good to take 1-2 examples and walk through the impact of the each of the modules of the PBSnet to understand their significance and individual impact. Currently while the results show improvement its not really clear to readers which aspects of the PBSnet are really contributing to them. In this manuscript, authors have proposed a pseudo bilateral segmentation network (PBSNet) that can extract rich spatial and semantic features from a single path, without adding computational cost or time consumption. Their scheme utilizes a SEM to explore the relationship between high-level semantic features, an interchange module (IM) to enhance feature representation through bi-directional vertical propagation and adaptive spatial attention, and an attention fusion module (AFM) to aggregate multi-scale features to produce the final segmentation prediction. They have claimed that their approach outperforms over state-of-the-art methods. This is an interesting work, and the manuscript is generally written well. However, it needs a revision based on the following comments to enhance the quality and visibility of their work. 1. Fig. 1 caption is inappropriate, especially "Some segmentation results..." 2. Fig 2 is confusing. Could you use different colors for 3X3 conv maps. Write expansion of all abbreviations used, eg: GAPS, SAB,CAB etc. in the caption. 3.Write expansion of all abbreviations provided in Figures in all figure captions. 4. It is possible for their approach to perform on videos/streaming or real-time imaging scenarios? Currently they have shown the capability of their approach only in static images. 5. I believe authors should mention about the potential extended applicability of their approach for other scenarios such as medical imaging such as ophthalmic OCT imaging, where automatic and precise segmentation of retinal layers could be assessed using their approach [1-2]. Authors may add these comments in the text with suggested good references. [(1) https://doi.org/10.1167/tvst.11.8.11 (2) https://doi.org/10.1038/s41598-021-95320-z] 6. Could authors mention the specific needs of Adaptive Spatial Attention Module in Fig 4. ?"