DATASET



We have curated a diverse dataset of two-wheeler rider intentions captured by monocular cameras (GoPro), encompassing various maneuvers on Indian roads as shown in Fig. 1. Our dataset comprises rider maneuvers such as left turn, right turn, left lane change, right lane change, straight, and slow-stop. Each video clip in the dataset consists of a single maneuver. The rider maneuvering videos are captured on residential, market, rural, and suburban streets, as well as narrow interior roads where three-wheelers or four-wheelers cannot traverse, covering different times of day and night over varying traffic densities.


Fig. 1. The proposed dataset illustrates diverse two-wheeler rider maneuvers on unstructured Indian road conditions.

We selected these classes of maneuvers based on two factors. Firstly, they are frequently observed on Indian roads. Secondly, they are commonly observed without recurring interference from the surrounding traffic context (i.e. other traffic agents are not acting as a cause for the maneuver, for example by an obstruction, yield, cut-in, etc).

The dataset comprises 1,000 multi-view video clips capturing rider maneuvers, each equipped with corresponding labels and video embedding features. For the contest, we provide extracted video features using three pre-trained models, namely, VGG16, ResNet50, R(2+1)D to make sure that the participants will not be restricted by limited compute.


Training Set:

Task 1 and Task 2 has 500 clips of video data. Participants are required to develop corresponding models according to the requirements of each task. We are providing three sets of extracted 500 training set video features with labels for Task 1 and Task 2 that can be downloaded from here.

Note: You can use any one of these training set features for building your model in order to solve task 1 and task 2.
The file structure of the training set consists of three subfolders, namely, frontal_view, left_side_mirror_view, right_side_mirror_view. Each of these subfolders have maneuvers (i.e., label folders), namely, Left Lane Change, Right Lane Change, Left Turn, Right Turn, Slow-Stop, Straight. The label folders have respective maneuvers video features. For example, there are 29 video features of maneuver left lane change in Left Lane Change folder.
For Task 1, frontal_view video features needs to be used and for Task 2, all three views i.e., frontal_view, left_side_mirror_view and right_side_mirror_view video features needs to be used in developing the model.



Validation Set:

Task 1 and task 2 contains 200 clips of video data. Participants predict labels of each video, and submit the prediction results. We are providing three sets of extracted 200 validation set video features with labels for Task 1 and Task 2 that can be downloaded from here.


Output Format Specification:

The output should be saved as 'task1_val_result_format.csv' for task 1 and 'task2_val_result_format.csv' for task 2. The first column of the csv file will have the name of the video followed by the next six columns having the maneuver names in the format specified. The classification results should be uploaded as one-hot encoded values (where a 1 corresponds to a predicted class and a 0 is filled in for the rest of classes).


Submission Form for Validation Set:

Participants are requested to fill out the Validation Results Submission form and upload the CSV files for Task 1 and Task 2.


Note: A maximum of THREE submissions will be allowed for validation results upload. Both Task 1 and Task 2 results must be uploaded simultaneously.



Test Set: (New)

Task 1 and task 2 contains 300 clips of video data. Participants predict labels of each video, and submit the prediction results. We are providing three sets of extracted 300 test set video features WITHOUT labels for Task 1 and Task 2 that can be downloaded from here.


Output Format Specification: (New)

The output should be saved as 'task1_test_result_format.csv' for task 1 and 'task2_test_result_format.csv' for task 2. The first column of the csv file will have the name of the video followed by the next six columns having the maneuver names in the format specified. The classification results should be uploaded as one-hot encoded values (where a 1 corresponds to a predicted class and a 0 is filled in for the rest of the classes).


Submission Form for Test Set: (New)

Participants are requested to fill out the Test Results Submission form and upload the CSV files for Task 1 and Task 2. The organizers will evaluate the F1 score and accuracy, and rank the submissions on the leaderboard based on the prediction results on the test set.

Test Results Submission form: Test Results Upload Form (Closed).

Note: A maximum of THREE submissions will be allowed for test results upload. Both Task 1 and Task 2 results must be uploaded simultaneously.




Baseline Code


We provide colab baseline runnable notebooks for Task 1 i.e., single/frontal-view rider intention prediction (Baseline_Single_View) and Task 2 i.e., multi-view rider intention prediction (Baseline_MV_early_fusion).

The github repo for baseline can be found here: https://github.com/wasilone11/ICPR-RIP-2024


Evaluation Metrics


The participants are required to output maneuvers for each video. We evaluate RIP with two ranking metrics for Task 1 and Task 2: the classification accuracy, handling the driving maneuvers as classes, and the \(F_1\) score for detecting the maneuvers (i.e. the harmonic mean of the precision and recall)
The accuracy metric is defined as follows:

$$Acc := {1 \over n}{\sum_{i=1}^n \sigma(p(s_i),t_i)}$$ with \( \sigma(i,j) := 1\) (if \(i = j\)) and \(0\) (otherwise)
where \(p\) refers to the prediction of the classifier for the sample \(s_i\) with corresponding target label \(t_i\) and \(n\) the total number of samples in the data set. Here, each sample \(s_i\) is a video of a specific maneuver.

Another metric that we use for measuring the performance of the model is \(F_1\) score. Using precision and recall, we calculate the \(F_1\) score as $$ F_1 = \frac{2 . P . R}{P + R} $$
where \(P\) and \(R\) are, $$ P = \frac{tp}{tp + fp + fpp}, R = \frac{tp}{tp + fp + mp} $$ Precision and recall are defined as follows:

  • true prediction (\(tp\)): correct prediction of the maneuver in a video
  • false prediction (\(fp\)): prediction is different than the actual performed maneuver
  • false positive prediction (\(fpp\)): a maneuver-action predicted, but the driver is driving straight
  • missed prediction (\(mp\)): a driving-straight predicted, but a maneuver is performed