Robot

Access Personalised Learning With Embibe Simple!

Click on Start Trial To Access Learning Outcomes Today

Deduplication: A Technical Overview

As an EdTech platform, Embibe curates and manages a huge pool of learning objects which can be served to the students to fulfil their learning requirements. This content pool primarily holds content like videos, explainers, and interactive learning elements to educate the user about any academic concept. Also, it contains questions that can be bundled together intelligently to provide gamified practice and test experiences. At Embibe, the user engagement under the practice and test storyline provides us with the crucial academic, behavioural, test-taking, test-level, and user efforts-related specifics that help us drive the user journey and help the student unlock their maximum potential. Given the importance of the practice and test features, we believe in achieving maximum user engagement and retention.

There are various sources through which the pool of questions is prepared – in-house faculties and subject matter experts, academic consultants, and various other personnel are involved in this process. The pool also contains questions from renowned textbooks and reference materials. Given the involvement of several entities in driving the content pool and the importance of the content in driving engagement, it becomes necessary to keep track of the content quality. There are various quality-related issues involved with content curation at scale, like Content Duplication, Question Correctness Issues, Incomplete Questions, and Incorrect Meta Tagging, to name a few. In this article, we will discuss the Content Duplication issue and the intelligent system used at Embibe to tackle it.

Content Duplication and Resolution

Content Duplication (Test/Practice Problems/Questions) in the system is one of the issues that adversely impact user engagement. To understand better, it can be compared with “Facebook or Instagram displaying the same video/image repetitively when a user is busy scrolling through; admit it, it hampers the user engagement, and at worst, the user can bow out of the platform forever.” Similarly, if the same question gets served to the student in the same practice or test sessions, it will certainly contribute to the user drop-off.

At Embibe, to deal with this issue, we have employed a hybrid approach that encapsulates Syntax (edit-distance) based measures and the Deep Learning-based (ResNet-18 Convolutional Neural Network Architecture) dense vector similarities to identify the duplicates for the questions. We utilize Elasticsearch’s (Lucene) core functionalities like the Full-Text Queries on textual content, and the recent script score queries on the Dense Vector Fields to implement the deduplication pipeline.

Our learning objects (questions) contain Textual (question text, answer text) as well as Image/Pictorial information (Figures, Diagrams, etc.), and the pipeline considers both of them to identify the exact data de duplicate counterparts from the content pool. We have also enabled a real-time utility wrapped around the same approach to prevent the creation and ingestion of duplicate questions into the system; it works like gatekeeping for deduplication.

The semantic similarity of text could be further enhanced using knowledge-aware models, and accessing interpretable information from deep learning models [1][2]. Explainable models could foster trust among academicians to rely on the outcomes of models [3].

We try to summarise this pipeline through a Data Flow Diagram depicted below: 

Threshold Selection

For the Content Data Deduplication pipeline, threshold selection/tuning is at the core of the problem. It helps in separating similar and non-duplicate questions from duplicate ones. Here, to identify the appropriate thresholds, we have used subject matter experts’ help in preparing a labelled dataset, where they have been given an anchor question and a list of candidates. They were asked to mark the pairs as Duplicate or Not-Duplicate. For candidate generation, top k candidates were selected from the content pool using Elasticsearch’s Full-Text queries and Script Score queries on the image-dense vectors.

Now, to select the right threshold value, a grid search was employed over the different threshold values (range: 0.5 to 1.0, step-size: 0.05) with the maximum accuracy score objective against the labelled dataset. Here top k candidates were generated for the anchor questions, and the accuracy numbers were captured at different threshold values. The similarity score threshold that yields the maximum accuracy was chosen as the final threshold value.

Benchmarking Process

Against the hold-out labelled set, a benchmarking of the mentioned duplicates identification process has been done. The table below mentions the specifics:

Data Set Size Accuracy (marked correctly)
Labelled Question Pairs containing: Only Text, Text + Image, Only Image 5114 83.1% (4250)
Labelled Question Pairs containing: Text + Image, Only Image 2710 80.1% (2193)

Conclusion and Future Work

Though the 80%+ accuracy is sufficient in many machine learning tasks, the scale at which Embibe operates requires more accurate models to further reduce manual verification. With the current development in semantic similarity-based text mining, Embibe is developing a dense vector (Image and Text Embedding) based content similarity algorithm with a target of 90%+ accuracy. 

References:

[1] Faldu, Keyur, Amit Sheth, Prashant Kikani, and Hemang Akabari. “KI-BERT: Infusing Knowledge Context for Better Language and Domain Understanding.” arXiv preprint arXiv:2104.08145 (2021).

[2] Gaur, Manas, Keyur Faldu, and Amit Sheth. “Semantics of the Black-Box: Can knowledge graphs help make deep learning systems more interpretable and explainable?.” IEEE Internet Computing 25, no. 1 (2021): 51-59.

[3] Gaur, Manas, Ankit Desai, Keyur Faldu, and Amit Sheth. “Explainable AI Using Knowledge Graphs.” In ACM CoDS-COMAD Conference. 2020.