Road segmentation

Road segmentation DEFAULT

High-resolution radar road segmentation using weakly supervised learning


  1. 1.

    Clements, L. M. & Kockelman, K. M. Economic effects of automated vehicles. Transp. Res. Rec.2606, 106–114 (2017).

    Article Google Scholar

  2. 2.

    Road Vehicles—Functional Safety—Part 1: Vocabulary (International Organization for Standardization, 2018);

  3. 3.

    Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles 1–5 (SAE International, 2018).

  4. 4.

    Yurtsever, E., Lambert, J., Carballo, A. & Takeda, K. A survey of autonomous driving: common practices and emerging technologies. IEEE Access8, 58443–58469 (2020).

    Article Google Scholar

  5. 5.

    Divakarla, K. P., Emadi, A. & Razavi, S. A cognitive advanced driver assistance systems architecture for autonomous-capable electrified vehicles. IEEE Trans. Transp. Electrif.5, 48–58 (2019).

    Article Google Scholar

  6. 6.

    Zhu, H., Yuen, K., Mihaylova, L. & Leung, H. Overview of environment perception for intelligent vehicles. IEEE Trans. Intell. Transp. Syst.18, 2584–2601 (2017)..

  7. 7.

    Pendleton, S. D. et al. Perception, planning, control, and coordination for autonomous vehicles. Machines5, 1–54 (2017).

    Article Google Scholar

  8. 8.

    Graves, D., Rezaee, K. & Scheideman, S. Perception as prediction using general value functions in autonomous driving applications. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems 1202–1209 (IEEE, 2019);

  9. 9.

    Zong, W., Zhang, C., Wang, Z., Zhu, J. & Chen, Q. Architecture design and implementation of an autonomous vehicle. IEEE Access6, 21956–21970 (2018).

    Article Google Scholar

  10. 10.

    Yang, D., Jiao, X., Jiang, K. & Cao, Z. Driving space for autonomous vehicles. Automot. Innov.2, 241–253 (2019).

    Article Google Scholar

  11. 11.

    Alvarez, J. M., Gevers, T., LeCun, Y. & Lopez, A. M. Road scene segmentation from a single image. In 12th European Conference on Computer Vision Vol. 7578, 376–389 (Springer, 2012)..

  12. 12.

    Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell.39, 640–651 (2017).

    Article Google Scholar

  13. 13.

    Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In 18th International Conference on Medical Image Computing and Computer-assisted Intervention Vol. 9351, 234–241 (Springer, 2015).

  14. 14.

    Jegou, S., Drozdzal, M., Vazquez, D., Romero, A. & Bengio, Y. The one hundred layers Tiramisu: fully convolutional densenets for semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops 1175–1183 (IEEE, 2017).

  15. 15.

    Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition 6230–6239 (IEEE, 2017).

  16. 16.

    Lin, G., Milan, A., Shen, C. & Reid, I. RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition 5168–5177 (IEEE, 2017).

  17. 17.

    Chen, L.-C., Papandreou, G., Schroff, F. & Adam, H. Rethinking atrous convolution for semantic image segmentation. Preprint at (2017).

  18. 18.

    Felzenszwalb, P. F. & Huttenlocher, D. P. Efficient graph-based image segmentation. Int. J. Comput. Vis.59, 167–181 (2004).

    Article Google Scholar

  19. 19.

    Tsutsui, S., Kerola, T., Saito, S. & Crandall, D. J. Minimizing supervision for free-space segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 1101–1110 (IEEE, 2018).

  20. 20.

    Tsutsui, S., Saito, S. & Kerola, T. Distantly supervised road segmentation. In 2017 IEEE International Conference on Computer Vision Workshops 174–181 (IEEE, 2017).

  21. 21.

    Chen, Y. H. et al. No more discrimination: cross city adaptation of road scene segmenters. In 2017 IEEE International Conference on Computer Vision 2011–2020 (IEEE, 2017).

  22. 22.

    Topudurti, K., Keefe, M., Wooliever, P. & Lewis, N. PointNet: deep learning on point sets for 3D classification and segmentation. Water Sci. Technol.30, 95–104 (2017).

    Article Google Scholar

  23. 23.

    Badrinarayanan, V., Kendall, A. & Cipolla, R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.39, 2481–2495 (2017).

    Article Google Scholar

  24. 24.

    Prophet, R., Li, G., Sturm, C. & Vossiek, M. Semantic segmentation on automotive radar maps. In 2019 IEEE Intelligent Vehicles Symposium 756–763 (IEEE, 2019).

  25. 25.

    Feng, Z., Zhang, S., Kunert, M. & Wiesbeck, W. Point cloud segmentation with a high-resolution automotive radar. In AmE 2019—Automotive meets Electronics 10th GMM Symposium 1–5 (IEEE, 2019).

  26. 26.

    Schumann, O., Hahn, M., Dickmann, J. & Wöhler, C. Semantic segmentation on radar point clouds. In 2018 21st International Conference on Information Fusion 2179–2186 (IEEE, 2018).

  27. 27.

    Sless, L., Cohen, G., Shlomo, B. El, Oron, S. Road scene understanding by occupancy grid learning. In 2019 IEEE/CVF International Conference on Computer Vision Workshop 1–9 (IEEE, 2019).

  28. 28.

    Lombacher, J., Laudt, K., Hahn, M., Dickmann, J. & Wohler, C. Semantic radar grids. In 2017 IEEE Intelligent Vehicles Symposium 1170–1175 (IEEE, 2017);

  29. 29.

    Barnes, D., Gadd, M., Murcutt, P., Newman, P. & Posner, I. The Oxford Radar RobotCar Dataset: A Radar Extension to the Oxford RobotCar Dataset (University of Oxford, 2019);

  30. 30.

    Williams, D., De Martini, D., Gadd, M., Marchegiani, L. & Newman, P. Keep off the Grass: Permissible Driving Routes from Radar with Weak Audio Supervision (University of Oxford, 2020).

  31. 31.

    Esteves, C., Allen-Blanchette, C., Zhou, X. & Daniilidis, K. Polar transformer networks. In 6th International Conference on Learning Representations 1–14 (DBLP, 2018).

  32. 32.

    Weston, R., Cen, S., Newman, P. & Posner, I. Probably unknown: deep inverse sensor modelling radar. In 2019 International Conference on Robotics and Automation 5446–5452 (IEEE, 2019).

  33. 33.

    Kaul, P., De Martini, D., Gadd, M. & Newman, P. RSS-Net: Weakly-supervised Multi-Class Semantic Segmentation with FMCW Radar (University of Oxford, 2020).

  34. 34.

    Nowruzi, F. E. et al. Deep Open Space Segmentation Using Automotive Radar 2–5 (IEEE, 2020).

  35. 35.

    Engelhardt, N., Perez, R. & Rao, Q. Occupancy grids generation using deep radar network for autonomous driving. In 2019 IEEE Intelligent Transportation Systems Conference 2866–2871 (IEEE, 2019)..

  36. 36.

    Li, J. & Stoica, P. MIMO RADAR (Wiley, 2008).

  37. 37.

    Springenberg, J. T., Dosovitskiy, A., Brox, T. & Riedmiller, M. Striving for simplicity: the all convolutional net. Preprint at (2014).

  38. 38.

    Ostyakov, P. et al. Label denoising with large ensembles of heterogeneous neural networks. In 2018 European Conference on Computer Vision 250–261 (2019).

  39. 39.

    Geyer, J. et al. A2D2: Audi Autonomous Driving Dataset (Audi, 2020);

  40. 40.

    Oktay, O. et al. Attention U-Net: learning where to look for the pancreas. Preprint at (2018)

  41. 41.

    Milletari, F., Navab, N. & Ahmadi, S. A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 4th International Conference on 3D Vision 565–571 (2016)..

  42. 42.

    Wang, Y. et al. Symmetric cross entropy for robust learning with noisy labels. In 2019 IEEE/CVF International Conference on Computer Vision 322–330 (2019).

  43. 43.

    Liao, W. MUSIC for multidimensional spectral estimation: stability and super-resolution. IEEE Trans. Signal Process.63, 6395–6406 (2015).

    MathSciNetArticle Google Scholar

Download references


We thank H. Damari for assembling the dataset and H. Omer, Z. Iluz, Y. Avargel, L. Korkidi, M. Raifel, K. Twizer, P. Fidelman and N. Orr for their insights and advice.

Author information


  1. Faculty of Engineering and the Institute for Nanotechnology and Advanced Materials, Bar-Ilan University, Ramat-Gan, Israel

    Itai Orr & Zeev Zalevsky

  2. Wisense Technologies, Tel Aviv, Israel

    Itai Orr & Moshik Cohen


I.O. conceived the study and conducted training. All authors contributed to the design of the study, interpreting the results and writing the manuscript.

Corresponding author

Correspondence to Itai Orr.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review informationNature Machine Intelligence thanks Zdenka Babić and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Additional sample results from the validation dataset.

Blurry images were caused by rain droplets on the camera lens. Left column shows range-Doppler maps in dB. Middle column shows the suggested radar-based DNN prediction overlayed on a camera image, values are confidence level on a scale of (0,1). Right column shows the corresponding camera pseudo label generated from a camera-based DNN, values are confidence level on a scale of (0,1).

Extended Data Fig. 2 Camera label projection to RADAR coordinate frame.

A sample of urban scene from the validation dataset showing the camera pseudo label projected on to cartesian coordinates. (a); camera image overlayed with a camera pseudo label, values are confidence level on a scale of (0,1). (b); displays the associated radar data in cartesian representation with values displayed in dB. (c); radar data in cartesian coordinates with values displayed in dB. Overlayed on top (in black) is the projected camera pseudo label. This sample frame further illustrates the lack of distinguishable features associated with common road delimiters such as sidewalks and curbstones. Note that the projected label minimum range is 4.5m due to the camera’s ground clearance.

Extended Data Fig. 3 Filter correlation heatmap.

Comparison between conventional CFAR and the suggested perception filter. The results were averaged over the validation dataset. Y axis represents the conventional CFAR threshold in dB and X axis represents the perception filter on a scale of (0,1). High correlation between the two filters would have created a diagonal heatmap with IoU values close to 1. However, these results show low correlation with low IoU values between the two methods which further suggests the perception filter eliminates data based on context as well as SNR.

Extended Data Fig. 4 Training methodology.

A camera-based DNN is trained on a publicly available dataset and used to create pseudo labels for a RADAR-based DNN. The radar and camera data are temporally synced and spatially overlapped. Radar pre-processing includes windowing and 2D FFT on the sweeps and samples dimensions to create a complex 2D array of range-Doppler maps. The radar model is trained using segmentation loss to identify the drivable area.

Extended Data Fig. 5 Model architecture.

Based on encoder-decoder Unet architecture with channel attention mechanism to encourage learnable cross channel correlations.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Orr, I., Cohen, M. & Zalevsky, Z. High-resolution radar road segmentation using weakly supervised learning. Nat Mach Intell3, 239–246 (2021).

Download citation

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative


This benchmark has been created in collaboration with Jannik Fritsch and Tobias Kuehnl from Honda Research Institute Europe GmbH. The road and lane estimation benchmark consists of 289 training and 290 test images. It contains three different categories of road scenes:

  • uu - urban unmarked (98/100)
  • um - urban marked (95/96)
  • umm - urban multiple marked lanes (96/94)
  • urban - combination of the three above

Ground truth has been generated by manual annotation of the images and is available for two different road terrain types: road - the road area, i.e, the composition of all lanes, and lane - the ego-lane, i.e., the lane the vehicle is currently driving on (only available for category "um"). Ground truth is provided for training images only.

We evaluate road and lane estimation performance in the bird's-eye-view space. For the classical pixel-based evaluation we use established measures as discussed in our ITSC 2013 publication. MaxF: Maximum F1-measure, AP: Average precision as used in PASCAL VOC challenges, PRE: Precision, REC: Recall, FPR: False Positive Rate, FNR: False Negative Rate (the four latter measures are evaluated at the working point MaxF), F1: F1 score, HR: Hit rate. For the novel behavior-based evaluation a corridor with the vehicle width (2.2m) is fitted to the lane estimation processing result and evaluation is performed for 3 different distance values: 20 m, 30 m, and 40 m. We refer to our ITSC 2013 publication for more details.
IMPORTANT NOTE: On 09.02.2015 we have improved the accuracy of the ground truth and re-calculated the results for all methods. Please download the devkit and the dataset with the improved ground truth for training again, if you have downloaded the files prior to 09.02.2015. Please consider reporting these new number for all future submissions. The last leaderboards right before the changes can be found here!

Important Policy Update: As more and more non-published work and re-implementations of existing work is submitted to KITTI, we have established a new policy: from now on, only submissions with significant novelty that are leading to a peer-reviewed paper in a conference or journal are allowed. Minor modifications of existing algorithms or student research projects are not allowed. Such work must be evaluated on a split of the training set. To ensure that our policy is adopted, new users must detail their status, describe their work and specify the targeted venue during registration. Furthermore, we will regularly delete all entries that are 6 months old but are still anonymous or do not have a paper associated with them. For conferences, 6 month is enough to determine if a paper has been accepted and to add the bibliography information. For longer review cycles, you need to resubmit your results.

Additional information used by the methods
  1. Ggplot violin plots
  2. Free reggae instrumental beats
  3. Bluebonnet meat market
  4. Download keynote theme

Segmentation of Roads in Aerial Images.

The onset of Convolutional Neural Networks (C.N.N.s) was a breakthrough in the field of computer vision as they radically changed the way computers “looked at” images. Machine vision has come a long way from where it began, but it is still at the bleeding edge of research today. Semantic Segmentation is the process of attributing every pixel in an image to a certain class. This class can be a dog, a car, or in our case roads.

The combined length of all the roads on our planet is about 33.5 million kilometres. Let me rephrase that — If we could arrange all the roads into a straight road, then we would have covered a quarter of the distance between Earth and Sun. Manually annotating each strand of the road is a Herculean task, if not an impossible one. This is where Deep Learning comes into the picture, and this is what we will accomplish through this project. To put it simply, we will train a Deep Learning model to identify roads in aerial images.

You can find a link to the source code at the end of this article. Please refer to the table of contents if you want to discern the scope of this article.
All the resources used in the project are publically available, therefore it is recommended that you follow along. This article covers both the practical and theoretical aspects of this project, and I hope that this will be an enjoyable learning experience for you.

  1. Data
    i. The type of data we need.
    ii. The dataset
    iii. Downloading the dataset.
  2. Preprocessing
  3. Neural Modelling
    i. About F.C.N
    ii. Network Architecture
  4. Training the Model
    i. Loss Function and Optimiser
    ii. Callbacks
    iii. Training the Model
  5. Testing the model
  6. Scopes of improvement
  7. Conclusion
  8. Links and References

Let’s get started.

Different type of machine learning models requires a different kind of data and more the data, the merrier it is. More data to train on means that our model will be able to learn more underlying patterns, and it will be able to distinguish outliers better as well.

i. The type of data we need.

Usually, for segmentation challenges, we need images along with their respective (preferably hand-drawn) maps. For this project, we require aerial images, along with their segmentation maps, where only the roads are indicated. The notion is that our model will focus on the white pixels which represent roads, and it will learn a correlation between the input image, and the output maps.

ii. The dataset

For this project, we will be using the Massachusetts Roads Dataset. This dataset contains 1171 aerial images, along with their respective maps. They are 1500 x 1500 in dimension and are in .tiff format. Please have look at the following sample.

iii. Downloading the dataset.

You can Start by cloning my GitHub repo and then use the script in the Src folder to download the dataset. In case you have an unreliable internet connection that keeps on fluctuating, then please use academic torrents to acquire the dataset. You can find the dataset here.

The quality of data hugely impacts our model’s performance, and therefore pre-processing is an important step towards making sure that our model receives the data in the right form. I tried multiple pre-processing techniques, and the following methods yielded the best results:

i. hand-picking: There are few images (~50) in the dataset where a big chunk of the aerial images are missing. Majority of the image contains white pixels, but they have complete segmentation maps. Since this can throw the model off, I manually removed them.

ii. Cropping instead of resizing: Training our model on large images is not only resource-intensive but is bound to take a lot of time as well. Resizing images to lower dimensions can be an answer, but resizing comes at a cost. Regardless of the interpolation method we choose while resizing, we end up losing information.

Therefore, we will crop out smaller, 256 x 256 images from the large images. Doing so leaves us with about 22,000 useful images and maps.

iii. Thresholding and binarizing the maps: Grayscale images are single-channel images that contain varying shades of grey. There are 256 possible grey intensity values that each pixel can take up with 0 representing a black pixel, and 255 representing a white one. In Semantic segmentation, we essentially predict this value for each pixel. Rather than giving 256 discrete options for the model to choose from, we will provide only two. As you would have noticed, our maps have just two colours: black and white. While the white pixels represent roads, the black pixels represent everything that isn’t a road.

A closer look at our dichromatic segmentation maps reveals that there are a lot of grey pixels when all we want are black and white. We will start by thresholding the pixel values at 100. Such that all the pixels which have a value above a certain threshold, are assigned the maximum value of 255, and all the other pixels are assigned zero. Doing so ensures that there are only two unique pixel values in the segmentation masks. Now, 0 and 255 is a wide range. By dividing all the maps by 255, we normalize the maps, and now we end up with only two values — 0 and 1.

iv. Packaging (Optional): I trained my model on Google Colab. A big, hearty thanks to Google for providing resources to thousands of Data Scientists and Machine Learning Engineers.

I have noticed that supplying images to the model from Gdrive during training (using ImageDataGenerator) ends up consuming extra time. However, this is not true if you are training the model on your system, as loading files is much faster in that case. I packaged the entire image and map array into two separate .h5py files and loaded them onto the RAM. Doing so sped up the training process.

Now that we have dealt with the data, its time we start modelling our Neural network. To accomplish our segmentation task, we will be using a Fully Convolutional Network. These kinds of networks are mostly composed of convolutional layers, and unlike the more traditional neural networks, fully connected layers are absent.

i. About F.C.N

Fully Convolutional Network was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg, Germany [1]. It was later realised that the scope of these networks is well beyond the medical realm. These networks can perform multiclass segmentation of any kind of object — be it segmenting people, cars or even buildings.

ii. Network Architecture

This project uses U-net, a fully convolutional neural network which is quite intuitively named. This network takes a 256x256 multichannel image and outputs a single-channel map of the same dimension.

A U-net has two parts — The encoder or the downsampling section, and the decoder or the up-sampling section. Just have a look at the following image.

Encoder: It is a.k.a. the downsampling section. This segment uses convolutional layers to learn the temporal features in an image and uses the pooling layers to downsample it. This part is accountable for learning about the objects in an image. In this case, this segment learns how a road looks like and can detect it. I added dropout layers which will randomly ignore neurons to prevent overfitting, and I added BatchNormalization to ensure that each layer can learn independently of the previous one.

Decoder: It is a.k.a. the Upsampling segment. Continuous pooling operations result in the loss of spatial information of the image. The model does know about the contents of the image, but it doesn’t know where it is. The whole idea behind the decoder network is to reconstruct the spatial data using the feature maps which we extracted in the previous step. We use Transposed convolutions to upsample the image. Unlike plain interpolation, Conv2DTranspose has learnable parameters.

Skip Connections: Direct connections between the layers in the encoder segment to the layers in the decoder section are called Skip connections. They are called skip connections because they bridge two layers while ignoring all the intermediate layers. Skip connections provide the spatial information to the upsampling layers and help them reconstruct the image and “put things into place” (quite literally).

Please use the following code to replicate the U-net.

Our U-net in all its shining glory.

i. Loss Function and Hyper-parameters

At a pixel level, this segmentation challenge can be considered as a binary classification problem where the model classifies whether each pixel is white(road) or black(not road). But we need a balanced dataset to facilitate proper segmentation, and since the number of black pixels in these images greatly outnumbers the white one, we have an imbalanced dataset.

There are a few different approaches to deal with the imbalanced data issue. In this challenge, we will use the Soft Dice Loss as it is based on the Dice Coefficient. Dice Coefficient is the measure of overlap between the predicted sample and the ground truth sample, and this value ranges between 0 and 1. Where 0 represents no overlap and 1 represents complete overlap.

Smooth Dice Loss is simply 1 — Dice Coefficient, this is done to create a minimizable Loss Function[2]. Please have a look at the following code for Dice Loss.

def dice_coef(y_true, y_pred, smooth = 1):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)

intersection = K.sum(y_true_f * y_pred_f)
dice = (2. * intersection + smooth) / (K.sum(y_true_f) K.sum(y_pred_f) + smooth)

return dicedef soft_dice_loss(y_true, y_pred):
return 1-dice_coef(y_true, y_pred)

You can see that, we are using a parameter called smooth, which has a default value of 1. By adding 1 to both the numerator and denominator, we ensure that a division by zero never occurs.

Accuracy Metric: Accuracy metrics tell us about the correctness of the generated segmentation maps. We will be using the Jaccard Index, aka Intersection over Union, to tell us how accurate the generated maps are. As the name suggests, Intersection over Union is the measure of the correctness of the segmentation maps. The numerator is the intersection between the predicted map and the ground truth label, while the denominator is the total area of both the ground truth label and segmentation map (calculated using Union operation). The following code snippet is used to calculate the Jaccard Index.

def IoU(y_pred, y_true):
I = tf.reduce_sum(y_pred * y_true, axis=(1, 2))
U = tf.reduce_sum(y_pred + y_true, axis=(1, 2)) - I
return tf.reduce_mean(I / U)

We compile the model using Adam as the optimizer. We will start with a learning rate of 0.00001 and we will set it to run for 100 epochs. We use Soft Dice Loss as the loss function and the Jaccard Index as the accuracy metric.

ii. Callbacks

A set of functions which can be invoked during the training process are called call back functions. In this project, we will be using four callbacks:

  1. ModelCheckpoint: Monitors validation loss and saves the weights of the model with the lowest validation loss.
  2. EarlyStopping: Monitors the validation loss and kills the training process if the validation loss does not increase after a certain number of epochs.
  3. ReduceLROnPlateau: Monitors Validation loss and reduces the learning rate if the validation loss doesn’t go lower after a certain number of epochs.
  4. TensorBoardColab: is a special version of Tensorboard tailored to work on Google Colab. We can monitor the accuracy and other metrics during the training process.

iii. Training the Model

We have done all the homework, and now it is time to fit the model. But before that, we will use traintestsplit() to split the data into train and test set which contains 17,780 and 4446 images respectively. Once the model starts training on the train data, you can maybe go for a run, because this is going to take some time. The good thing is that we won’t have to babysit the model, and you can come back to a trained model and exported weights.

The model runs for 57 epochs before Earlystopping kicked in and halted the training process. The minimum validation loss was 0.2352. You can observe the trend in the validation and training metrics in the following graph.

Our test set contains 4446 images and our model, can predict their segmentation maps in almost no time. Our model’s performance on the test set can be gauged using the Dice coefficient, which comes up to 0.59 (This value is between 0 and 1). There certainly is room for improvement. You can observe a few of the predicted outputs in the following image.

On a second look, you will notice that our model can segment parts of roads which the annotators missed out. In the following image, the square at the bottom right was skipped by the annotators, while our model was able to capture it. Our model successfully segments driveways, parking lots, and cul-de-sacs.

There were certain maps in which the roads weren’t completely visible, look at the following example. Our model was not able to detect the road on the left. Even though no model can churn out 100% accurate results, and there is a room for improvement always.

We can improve the performance of our model by taking certain measures, and they are as follows:

  1. Image Data Augmentation: It is the method off slightly distorting the images by applying various operations like colour shift, rotation etc. to generate more data.
  2. Use loss multipliers to deal with class imbalance: As mentioned earlier, we had a class imbalance problem, and to deal with it, we used Soft Dice Loss. We want to maximise our dice coefficient, but when compared to Binary Cross-entropy, the later has better gradients and therefore will be a good proxy for our custom loss function and can be easily maximised. The only problem is that Binary Cross-entropy, unlike Soft Dice loss, is not built to deal with the class imbalance and this results in jet black segmentation maps. However, if we apply class multipliers, so that, the model is incentivized to ignore frequently occurring classes, then we can use Binary Cross-entropy instead of Dice Loss. This will result in a smooth training experience.
  3. Using Pretrained models: pre-trained models can be fine-tuned for this problem, and they will act as the best feature extractors. Using transfer learning results in faster training times, and often yields superior segmentation maps.

In this project, we created a deep learning model that can segment roads from aerial images. We acquired the images and processed them to suit our needs. We created a U-net and learnt about its working. We used soft dice loss as out cost function and trained the model for 57 epochs. We then tested our model on the test set and observed a few samples.

Few Takeaways from this project:

  1. Cropping images instead of resizing them preserve the spacial information.
  2. Binarizing the segmentation maps reduces the number of distinct values in the map to two.
  3. Using ModelCheckpoint callback to save the model weights is a good idea. Just in case, if the program crashes during the training process, you can always reload the weights and resume training.
  4. Finally, if you ever hit a dead-end, then Slav Ivanov has written a comprehensive article which will help you overcome any deep learning related roadblocks.

This challenge was surely fun to work on and thank you for reading through this article. If you have any feedback or questions, please feel free to type it out in the comments section below.

If you are looking forward to more interesting tutorials, then please follow me on Twitter and Medium.


  1. Source Code.
  2. CS231n: Convolutional Neural Networks for Visual Recognition


[1]U-Net — Wikipedia

[2]Evaluating image segmentation models — Jeremy Jordan

Want to learn more? Check out a few of my other articles:

  1. Create a custom face recognition model and run it on your system.
  2. Build a live emotion recognition model.
SegNet: Road Scene Segmentation

Road Surface Semantic Segmentation

Hello There! This post is about a road surface semantic segmentation approach. So the focus here is on the road surface patterns, like: what kind of pavement the vehicle is driving on or if there is any damage on the road, also the road markings and speed-bumps as well and other things that can be relevant for a vehicular navigation task.

Here I will show you the step-by-step approach based on the paper available at Autonomous Robots (Springer) [1]. The Ground Truth and the experiments were made using the RTK dataset [2], with images captured with a low-cost camera, containing images of roads with different types of pavement and different conditions of pavement quality.

It was fun to work on it and I’m excited to share it, I hope you enjoy it too. 🤗

The purpose of this approach is to verify the effectiveness of using passive vision (camera) to detect different patterns on the road. For example, to identify if the road surface is an asphalt or cobblestone or an unpaved (dirt) road? This may be relevant for an intelligent vehicle, whether it is an autonomous vehicle or an Advanced Driver-Assistance System (ADAS). Depending on the type of pavement it may be necessary to adapt the way the vehicle is driven, whether for the safety of users or the conservation of the vehicle or even for the comfort of people inside the vehicle.

Another relevant factor of this approach is related to the detection of potholes and water-puddles, which could generate accidents, damage the vehicles and can be quite common in developing countries. This approach can also be useful for departments or organizations responsible for maintaining highways and roads.

To achieve these objectives, Convolutional Neural Networks (CNN) were used for the semantic segmentation of the road surface, I’ll talk more about that in next sections.

To train the neural network and to test and validate the results, a Ground Truth (GT) was created with 701 images from the RTK dataset. This GT is available on the dataset page and is composed by the following classes:

Everything done here was done using Google Colab. Which is a free Jupyter notebook environment and give us free access to GPUs and is super easy to use, also very helpful for organization and configuration. It was also used the fastai [3], the amazing deep learning library. To be more precise, the step-by-step that I will present was very much based on one of the lessons given by Jeremy Howard on one the courses about deep learning, in this case lesson3-camvid.

The CNN architecture used was the U-NET [4], which is an architecture designed to perform the task of semantic segmentation in medical images, but successfully applied to many other approaches. In addition, ResNet [5] based encoder and a decoder are used. The experiments for this approach were done with resnet34 and resnet50.

For the data augmentation step, standard options from the fastai library were used, with horizontal rotations and perspective distortion being applied. With fastai it is possible to take care to make the same variations made in the data augmentation step for both the original and mask (GT) images.

A relevant point, which was of great importance for the definition of this approach, is that the classes of the GT are quite unbalanced, having much larger pixels of background or surface types (eg.: asphalt, paved or unpaved) than the other classes. Unlike an image classification problem, where perhaps replicating certain images from the dataset could help to balance the classes, in this case, replicating an image would imply further increasing the difference between the number of pixels from the largest to the smallest classes. Then, in the defined approach weights were used in the classes for balancing. 🤔

Based on different experiments, it was realized that just applying the weights is not enough, because when improving the accuracy of the classes that contain a smaller amount of pixels, the classes that contain a larger amount of pixels (eg.: asphalt, paved and unpaved) lost quality in the accuracy results.

The best accuracy values, considering all classes, without losing much quality for the detection of surface types, was with the following configuration: first training a model without using weights, generating a model with good accuracy for the types of surface, then, use that previously trained model as a basis for the next model that uses the proportional weights for the classes. And that’s it!

You can check the complete code, that I will comment on throughout this post, on GitHub:

Are you ready?

Cool, so we start by our initial settings, importing the fastai library and the pathlib module. Let’s call this as Step 1.

Step 1 — Initial settings

As we’ll use our dataset from google drive, we need to mount it, so in the next cell type:

You’ll see something like the next image, click on the link and you’ll get an authorization code, so just copy and paste the authorization code in the expected field.

Now just access your Google Drive as a file system. This is the start of Step 2, loading our data.

Step 2 — Preparing the data

Where “image” is the folder containing the original images. The “labels” is the folder containing the masks that we’ll use for our training and validation, these images are 8-bit pixels after a colormap removal process. In “colorLabels” I’ve put the original colored masks, which we can use later for visual comparison. The “valid.txt” file contains a list of images names randomly selected for validation. Finally, the “codes.txt” file contains a list with classes names.

Now, we define the paths for the original images and for the GT mask images, enabling access to all images in each folder to be used later.

We can see an example, image 139 from the dataset.

Next, as shown in fastai lesson, we use a function to infer the mask filename from the original image, responsible for the color coding of each pixel.

Step 3 — First Step — Without weights

Here we are at the Step 3. Let’s create the DataBunch for training our first model using data block API. Defining where our images come from, which images will be used for validation and and the masks corresponding to each original image. For the data augmentation, the fastai library also gives options, but here we’ll use only the default options with , which consists of randomly horizontal rotations and the perspective warping. Remember to set in the transform call to ensure that the transformations for the data augmentation in the dataset are the same for each mask and its original image. Imagine if we rotated the original image, but the mask corresponding to that image was not rotated, what a mess it would be! 😵

We continue using the lesson3-camvid example from the fastai course, to define the accuracy metric and the weight decay. I’ve used the resnet34 model since I didn’t have much of a difference using resnet50 in this approach with this dataset. We can find the learning rate using , which in my case I’ve defined as .

Next we run the for 10 times to check how our model is doing.

Using the confusion matrix we can see how good (or bad) the model is for each class until now…

Don’t forget to save the model we’ve trained until now.

Now we just train the model over more epochs to improve the learning, and remember to save our final model. The slice keyword is used to take a start and a stop value, so in the first layers begin the training with the start value and this will change until the stop value when reaching the end of the training process.

This is our first model, without weights, which works fine for road surfaces but doesn’t work for the small classes.

Step 4 — Second Part — With weights

We’ll use the first model in our next Step. This part is almost exactly the same as the Step 3, since the databunch, we just need to remember to load our previous model.

And, before we start the training process, we need to put weight in the classes. I defined these weights in order to try to be proportional to how much each class appears in the dataset (number of pixels). * I ran a python code with OpenCV just to count the number of pixels in each class over the GT’s 701 images, to get a sense of the proportion of each class… 😓

The remainder is exactly like step three presented before. What changes are the results obtained. 😬

Now, it looks like we have a more reasonable result for all classes. Remember to save it!

Finally, let’s see our images, right? Before anything, it will be better to save our results, or our test images.

But, wait! The images all look completely black, where are my results??? 😱 Calm down, these are the results, just without color map, if you open one of these images on the entire screen, with high brightness, you can see the small variations, “Eleven Shades of Grey” 🙃. So let’s color our results to be more presentable? Now we’ll use OpenCV and create a new folder to save our colored results.

So we create a function to identify each variation and to colorize each pixel.

Next, we read each image, call the function and save our final result.

But, this process could take more time than necessary, using the we achieve a performance as:

Imagine if we need to test with more images? We can speed up this step using Cython. So, let’s put a pinch of Cython on that!

So, we edit our function to identify each variation and to colorize each pixel, but this time, using Cython.

And we just read each image and call the function and save our final result as we did before.

And voila! Now we have a performance as:

Much better, right?

Some results samples

In the image below are some results. In the left column are the original images, in the middle column the GT and in the right column the result with this approach.

Video with the results

Identifying road surface conditions is important in any scenario, based on this the vehicle or driver can adapt and make a decision that can make the driving safer, more comfortable and more efficient. This is particularly relevant in developing countries that may have even more situations of road maintenance problems or a reasonable number of unpaved roads.

This approach looks promising for dealing with environments with variations in the road surface. This can also be useful for highway analysis and maintenance departments, in order to automate part of their work in assessing road quality and identifying where maintenance is needed.

However, some points were identified and analyzed as subject to improvement.

For the segmentation GT, it may be interesting to divide some classes into more specific classes, such as the Cracks class, used for different damages regardless of the type of road. Thus having variations of Cracks for each type of surface, because different surfaces have different types of damage. Also divide this class into different classes, categorizing different damage in each new class.

That’s all for now. Feel free to reach out to me. 🤘

This experiment is part of a project on visual perception for vehicular navigation from LAPiX (Image Processing and Computer Graphics Lab).

If you are going to talk about this approach, please cite as:

author = {Thiago Rateke and Aldo von Wangenheim},
title = {Road surface detection and differentiation considering surface damages},
year = {2020},
eprint = {2006.13377},
archivePrefix = {arXiv},
primaryClass = {cs.CV},

[1] T. Rateke, A. von Wangenheim. Road surface detection and differentiation considering surface damages, (2020), Autonomous Robots (Springer).

[2] T. Rateke, K. A. Justen and A. von Wangenheim. Road Surface Classification with Images Captured From Low-cost Cameras — Road Traversing Knowledge (RTK) Dataset, (2019), Revista de Informática Teórica e Aplicada (RITA).

[3] J. Howard, et al. fastai (2018).

[4] O. Ronneberger, P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation, (2015), NAVAB, N. et al. (Ed.). Medical Image Computing and Computer-Assisted Intervention — MICCAI 2015. Cham: Springer International Publishing.

[5] K. He, et al. Deep residual learning for image recognition, (2016), IEEE Conference on Computer Vision and Pattern Recognition (CVPR).


Segmentation road

Paved and Unpaved Road Segmentation Using Deep Neural Network

  • Dabeen Lee
  • Seunghyun Kim
  • Hongjun Lee
  • Chung Choo Chung
  • Whoi-Yul KimEmail author
Conference paper

First Online:

Part of the Communications in Computer and Information Science book series (CCIS, volume 1180)


Semantic segmentation is essential for autonomous driving, which classifies roads and other objects in the image and provides pixel-level information. For high quality autonomous driving, it is necessary to consider the driving environment of the vehicle, and the vehicle speed should be controlled according to types of road. For this purpose, the semantic segmentation module has to classify types of road. However, current public datasets do not provide annotation data for these road types. In this paper, we propose a method to train the semantic segmentation model for classifying road types. We analyzed the problems that can occur when using a public dataset like KITTI or Cityscapes for training, and used Mapillary Vistas data as training data to get generalized performance. In addition, we use focal loss and over-sampling techniques to alleviate the class imbalance problem caused by relatively small class data.


Semantic segmentation Class imbalance Road type Autonomous driving 

This is a preview of subscription content, log in to check access.



This work was in parts supported by Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MOTIE) (No. 20000293, Road Surface Condition Detection using Environmental and In-vehicle Sensors).


  1. 1.

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar

  2. 2.

    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)CrossRefGoogle Scholar

  3. 3.

    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223 (2016)Google Scholar

  4. 4.

    Iagnemma, K., Kang, S., Shibly, H., Dubowsky, S.: Online terrain parameter estimation for wheeled mobile robots with application to planetary rovers. IEEE Trans. Robot. 20(5), 921–927 (2004)CrossRefGoogle Scholar

  5. 5.

    Wang, S., Kodagoda, S., Ranasinghe, R.: Road terrain type classification based on laser measurement system data. In: Australasian Conference on Robotics and Automation (ACRA) (2012)Google Scholar

  6. 6.

    Slavkovikj, V., Verstockt, S., De Neve, W., Van Hoecke, S., Van de Walle, R.: Image-based road type classification. In: International Conference on Pattern Recognition (CVPR) (2014)Google Scholar

  7. 7.

    Roychowdhury, S., Zhao, M., Wallin, A., Ohlsson, N., Jonasson, M.: Machine learning models for road surface and friction estimation using front-camera images. In: International Joint Conference on Neural Networks (IJCNN) (2018)Google Scholar

  8. 8.

    Neuhold, G., Ollmann, T., Bulo, S.R., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: The IEEE International Conference on Computer Vision (ICCV), pp. 4990–4999 (2017)Google Scholar

  9. 9.

    Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: The IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)Google Scholar

  10. 10.

    Valada, A., Vertens, J., Dhall, A., Burgard, W.: AdapNet: adaptive semantic segmentation in adverse environmental conditions. In: International Conference on Robotics and Automation (ICRA) (2017)Google Scholar

  11. 11.

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: International Conference on Pattern Recognition (CVPR) (2015)Google Scholar

  12. 12.

    Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Dabeen Lee
  • Seunghyun Kim
  • Hongjun Lee
  • Chung Choo Chung
  • Whoi-Yul KimEmail author
  1. 1.Hanyang UniversitySeoulSouth Korea
CalmCar Front Camera - Object Detection and Road Segmentation


Similar news:


378 379 380 381 382