Skip to content

This repository contains the VLEngagement dataset and the helper functions/ tools required to work with the dataset.

Notifications You must be signed in to change notification settings



Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Video lectures dataset

This repository contains the dataset and source code of the experiments conducted at reported using the VLEngagement Dataset. The VLEngagement dataset provides a set of statistics aimed at studying population-based (context-agnostic) engagement in video lectures, together with other conventional metrics in subjective assessment such as average star ratings and number of views. We believe the dataset will serve the community applying AI in Education to further understand what are the features of educational material that makes it engaging for learners.


The dataset is particularly suited to solve the cold-start problem found in educational recommender systems, both when i) user cold-start, new users join the system and we may not have enough information about their context so we may simply recommend population-based engaging lectures for a specific query topic and ii) item cold-start, new educational content is released, for which we may not have user engagement data yet and thus an engagement predictive model would be necessary. To the best of our knowledge, this is the first dataset to tackle such a task in education/scientific recommendations at this scale.

The Dataset is a pivotal milestone in uplifting sustainability of future knowledge systems having direct impact on scalable, automatic quality assurance and personalised education. It improves transparency by allowing the interpretation of humanly intuitive features and their influence in population-based engagement prediction.


VLEngagement dataset can be considered as a highly impactful resource contribution to the information retrieval, multimedia analysis, educational data mining, learning analytics and AI in education research community as it will enable a whole new line of research that is geared towards next generation information and knowledge management within educational repositories, Massively Open Online Course platforms and other Video/document platforms. This dataset complements the ongoing effort of understanding learner engagement in video lectures. However, it dramatically improves the research landscape by formally establishing two objectively measurable novel tasks related to predicting engagement of educational videos while making a significantly larger, more-focused dataset and its baselines available to the research community with more relevance to AI education. AI in Education, Intelligent Tutoring Systems and Educational Data Mining communities are on a rapid growth trajectory right now and will benefit from this dataset as it directly addresses issues related to the respective knowledge fields. The simultaneously growing need for scalable, personalised learning solutions makes this dataset a central piece within community that will enable improving scalable quality assurance and personalised educational recommendation in the years to come. The value of this dataset to the field is expected to last for a long time and will increase with subsequent versions of the dataset being available in the future with more videos and more features.

Using the Dataset and Its Tools

The resource is developed in a way that any researcher with very basic technological literacy can start building on top of this dataset.

  • The dataset is provided in Comma Seperate Values (CSV) format making it human-readable while being accessible through a wide range of data manipulation and statistical software suites.
  • The resource includes helper_tools that provides a set of functions that any researcher with basic python knowledge can use to interact with the dataset and also evaluate the built models.
  • models.regression provides well-documented example code snippets that can 1) enable the researcher to reproduce results reported for baseline models, 2) use an example coding snippets to understand how to build novel models using the VLEngagement dataset.
  • feature_extraction module presents the programming logic of how features in the dataset are calculated. The feature extraction logic is presented in the form of well-documented (PEP-8 standard, Google Docstrings format) Python functions that can be used to 1) understand the logic behind feature extraction or 2) apply the feature extraction logic to your own lecture records to generate more data

Structure of the Resource

The structure of the repository divides the resources to two distinct components on top-level.

  1. VLEngagement_datasets: This section stores the different versions of VLEngagement datasets (current version: v1)
  2. contenxt_agnostic_engagement: This module stores all the code related to manipulating and managing the datasets.

In addition, there are two files:

  • The main source of information for understanding and working with the VLEngagement datasets.
  • Python setup file that will install the support tools to your local python environment.

Table of Contents

VLEngagement Datasets

This section makes the VLEngagement datasets publicly available. The VLEngagement dataset is constructed using the aggregated video lectures consumption data coming from a popular OER repository, VideoLectures.Net. These videos are recorded when researchers are presenting their work at peer-reviewed conferences. Lectures are reviewed and hence material is controlled for correctness of knowledge and pedagogical robustness. Specififally, the dataset is comparatively more useful when building e-learning systems for Artificial Intellgence and Computer Science Education as majority of lectures in the dataset belong to these topics.


All the relevant datasets are available as Comma Separated Value (CSV) file within a dataset subdirectory (eg. v1/VLEngagement_dataset_v1.csv). At present, a dataset consisting around 12,000 lectures is available publicly.

Dataset Number of Lectures Number of Users Number of Star Ratings Log Recency URL
v1 11568 Over 1.1 Million 2127 Until February 01, 2021 /VLEngagement_datasets/12k

The latest dataset of this collection is v1. The tools required to load, and manipulate the datasets are found in context_agnostic_engagement.utils.io_utils module.


We restrict the final dataset to lectures that have been viewed by at least 5 unique users to preserve anonymity of users and have reliable engagement measurements. Additionally, a regime of techniques are used for preserving the anonymity of the data authors using the remaining features. Rarely occurring values in Lecture Type feature were grouped together to create the other category. Language feature is grouped into en and non-en categories. Similarly, Domain category groups Life Sciences, Physics, Technology, Mathematics, Computer Science, Data Science and Computers subjects to stem category and the other subjects to misc category. Rounding is used with Published Date, rounding to the nearest 10 days. Lecture Duration is rounded to the nearest 10 seconds. Gaussian white noise (10%) is added to Title Word Count feature and rounded to the nearest integer.


There 4 main types of features extracted from the video lectures. These features can be categorised into six quality verticals.

All the features that are included in the dataset are summarised in Table 1.

Table 1: Features extracted and available in the VLEngagement dataset with their variable type (Continuous vs. Categorical) and their quality vertical.

Variable Type Name Quality Vertical Description
Metadata-based   Features
cat. Language - Language of instruction of the video lecture
cat. Domain - Subject area (STEM or Miscellaneous)
Content-based   Features
con. Word Count Topic Coverage Word Count of Transcript
con. Title Word Count Topic Coverag Word Count of Title
con. Document Entropy Topic Coverage Document Entropy of Transcript
con. Easiness (FK Easiness) Understandability FK Easiness based on FK Easiness
con. Stop-word Presence Rate Understandability Stopword Presence Rate of Transcript text
con. Stop-word Coverage Rate Understandability Stopword Coverage Rate of Transcript text
con. Preposition Rate Presentation Preposition Rate of Transcript text
con. Auxiliary Rate Presentation Auxiliary Rate of Transcript text
con. To Be Rate Presentation To-Be Verb Rate of Transcript text
con. Conjunction Rate Presentation Conjunction Rate of Transcript text
con. Normalisation Rate Presentation Normalisation Rate of Transcript text
con. Pronoun Rate Presentation Pronoun Rate of Transcript text
con. Published Date Freshness Duration between 01/01/1970 and the lecture published date (in days)
Wikipedia-based   Features
cat. Top-5 Authoritative Topic URLs Authority 5 Most Authoritative Topic URLs based on PageRank Score. 5 features in   this group
con. Top-5 PageRank Scores Authority PageRank Scores of the top-5 most authoritative topics
cat. Top-5 Covered Topic URLs Topic Coverage 5 Most Covered Topic URLs based on Cosine Similarity Score. 5 features in   this group
con. Top-5 Cosine Similarities Topic Coverage Cosine Similarity Scores of the top-5 most covered topics
Video-based Features
con. Lecture Duration Topic Coverage Duration of the video (in seconds)
cat. Is Chunked Presentation If the lecture consists of multiple videos
cat. Lecture Type Presentation Type of lecture (lecture, tutorial, invited talk etc.)
con. Speaker speed Presentation Speaker speed (words per minute)
con. Silence Period Rate (SPR) Presentation Fraction of silence in the lecture video

General Features

Features that extracted from Lecture metadata that are associated with the language and subject of the materials.

Content-based Features

Features that have been extracted from the contents that are discussed within the lecture. These features are extracted using the content transcript in English lectures. Features are extracted from the English translation where the lecture is a non-english lecture. The transcription and translation services are provided by the TransLectures project.

Textual Feature Extraction

Different groups of word tokens are used when calculating features such as Preposition rate, Auxilliary Rate etc. as proposed by Dalip et al..

The features are calculated using the formulae listed below:

The tokens used during feature extraction are listed below:

Wikipedia-based Features

Two features groups that associate to content authority and topic coverage are extracted by connecting the lecture transcript to Wikipedia. Entity Linking technology is used to identify Wikipedia concepts that are asscoated with the lecture contents.

  • Most Authoritative Topics The Wikipedia topics in the lecture are used to build a Semantic graph of the lecture where the Semantic Relatedness is calculated using Milne and Witten method (4). PageRank is run on the semantic graph to identify the most authoritative topics within the lecture. The top-5 most authoritative topic URLs and their respective PageRank value is included in the dataset.

  • Most Convered Topics Similarly, the Cosine Similarity between the Wikipedia topic page and the lecture transcript is used to rank the Wikipedia topics that are most covered in the video lecture. The top-5 most covered topic URLs and their respective cosine similarity value is included in the dataset.

Video-specific Features

Video-specific features are extracted and included in the dataset. Most of the features in this category are motivated by prior work analyses done on engagement in video lectures (5).


There are several target labels available in the VLEngagement dataset. These target labels are created by aggregating available explicit and implicit feedback measures in the repository. Mainly, the labels can be constructed as three different types of quantification's of learner subjective assessment of a video lecture. The labels available with the dataset are outlined in Table 2:

Table 2: Labels in VLEngagement dataset with their variable type (Continuous vs. Categorical), value interval and category.

Type Label Range Interval Category
cont. Mean Star Rating [1,5) Explicit Rating
cont. View Count (5,∞) Popularity
cont. SMNET [0,1) Watch Time
cont. SANET [0,1) Watch Time
cont. Std. of NET (0,1) Watch Time
cont. Number of User Sessions (5,∞) Watch Time

Explicit Rating

In terms of rating labels, Mean Star Rating is provided for the video lecture using a star rating scale from 1 to 5 stars. As expected, explicit ratings are scarce and thus only populated in a subset of resources (1250 lectures). Lecture records are labelled with -1 where star rating labels are missing. The data source does not provide access to ratings from individual users. Instead, only the aggregated average rating is available.


A popularity-based target label is created by extracting the View Count of the lectures. The total number of views for each video lecture as of February 17, 2018 is extracted from the metadata and provided with the dataset.

Watch Time/Engagement

The majority of learner engagement labels in the VLEngagement dataset are based on watch time. We aggregate the user view logs and use the Normalised Engagement Time (NET) to compute the Median of Normalised Engagement (MNET), as it has been proposed as the gold standard for engagement with educational materials in previous work (5). We also calculate the Average of Normalised Engagement (ANET).

VLEngagement 12k Dataset

*** VLEngagement 12k Dataset*** is the latest addition of video lecture engagement data to this collection. This dataset contains all the english lectures from our previous release of lectures and contains additional lectures.

Lecture Duration Distribution

Duration of lectures is evidenced to be one of the most influential features when it comes to engagement with video lectures. Similar to the observations of our previous work (2), it is observed that the new dataset too has a bimodal distribution for duration. The density plot below presents this.

Lecture Categories

There are lectures belonging to diverse topic categories in the dataset. For preserving anonymity, we have grouped these lectures into stem and misc groups. The original data source has around 21 categories on the top level of which the distribution is presented below.

Although majority of the lectures belong to Computer Science category, there are other categories that are diverse in this dataset. The predictive performance of non-CS lectures has also been empirically tested.

content_agnostic_engagement Module

This section contains the code that enables the research community to work with the VLEngagement dataset. The folder structure in this section logically separates the code into three modules.


This section contains the programming logic of the functions used for feature extraction. The main use of this module is when one is interested in populating the features for their own lecture corpus using the exact programming logic used to populate VLEngagement data. Several files with feature extraction related functions are found in this module.

  • Internal functions relevant to making API calls to the Wikifier.
  • Internal functions relevant to utility functions for handling text.
  • content_based_features: Functions and logic associated with extracting content-based features.
  • wikipedia_based_features: Functions and logic associated with extracting Wikipedia-based features.


This module includes the helper tools that are useful in working with the dataset. The two main submodules contain helper functions relating to evaluation and input-output operations.

  • evaluation_metrics: contains the helper functions to run Root Mean Sqaure Error (RMSE), Spearman's Rank Order Correlation Coefficient (SROCC) and Pairwise Ranking Accuracy (Pairwise).
  • io_utils: contains the helper functions that are required for loading and manipulating the dataset.


This module contains the python scripts that have been used to create the current baseline. Currently, regression models have been proposed as baseline models for the tasks. The models/regression/ can be used to reproduce the baseline performance for Random Forests (RF) models.


This repository contains the VLEngagement dataset and the helper functions/ tools required to work with the dataset.






No releases published


No packages published


  • Python 100.0%