{"id":995,"date":"2023-01-20T07:10:20","date_gmt":"2023-01-20T07:10:20","guid":{"rendered":"https:\/\/www.embibe.com\/in-en\/?post_type=positions&#038;p=995"},"modified":"2023-01-20T07:10:41","modified_gmt":"2023-01-20T07:10:41","slug":"deduplication-technical-overview","status":"publish","type":"positions","link":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/","title":{"rendered":"Deduplication: Technical Overview"},"content":{"rendered":"<h4>Introduction:<\/h4>\n<p>As an Ed-tech platform, Embibe curates and manages a huge pool of learning objects which can be served to the students to fulfil their learning requirements. This content pool primarily holds content like videos, explainers, interactive learning elements to educate the user with any academic concept. Also, it contains questions that can be bundled together intelligently to provide gamified practice and test experiences. At Embibe, the user engagement under the practice and test storyline provide us with the crucial academic, behavioural, test-taking, test-level, and user efforts-related specifics that help us drive the user journey and help the student unlock her maximum potential. Given the importance of the practice and test features, we believe in achieving maximum user engagement and retention.<br>There are various sources through which the pool of questions is prepared: In-house faculties and Subject Matter Experts, Academic Consultants, and various other personnel are involved in this process. The pool also contains the questions from the renowned textbooks and reference materials. Given the involvement of several entities in driving the content pool and the importance of the content in driving the engagement, it becomes necessary to keep track of the content quality. There are various quality-related issues involved with content curation at scale, like Content Duplication, Question Correctness issues, Incomplete Questions, Incorrect Meta Tagging, to name a few. In this article, we will be discussing the Content Duplication issue and the intelligent system being used at the Embibe to tackle it.<\/p>\n<h4>Content Duplication and Resolution:<\/h4>\n<p>The duplicated content (Test\/Practice Problems\/Questions) in the system is one of the issues that adversely impact user engagement. To understand better, it can be compared with \u201cFacebook or Instagram displaying the same video\/image repetitively when a user is busy scrolling through, admit it, it hampers the user engagement, and at worst the user can bow out of the platform for forever.\u201d Similarly, If the same question gets served to the student in the same practice or test sessions, it will certainly contribute to the user drop-off.<br>At Embibe, to deal with this issue, we have employed a hybrid approach that encapsulates Syntax (edit-distance) based measures and the Deep Learning-based (ResNet-18 Convolutional Neural Network Architecture) dense vector similarities to identify the duplicates for the questions. We utilize Elasticsearch\u2019s (Lucene) core functionalities like the Full-Text Queries on textual content, and the recent script score queries on the Dense Vector Fields to implement the deduplication pipeline. Our learning objects (questions) contain Textual(question text, answer text) as well as Image\/Pictorial information (Figures, Diagrams, etc.), and the pipeline considers both of them to identify the exact duplicate counterparts from the content pool. We have also enabled a real-time utility wrapped around the same approach to prevent the creation and ingestion of duplicate questions into the system; it works like a gate-keeping for deduplication.<\/p>\n<h4>We try to summarise this pipeline through a Data Flow Diagram depicted below:<\/h4>\n<figure>\n<img decoding=\"async\" src=\"https:\/\/indicmicrosites-assets.embibe.com\/in-en\/wp-content\/uploads\/2023\/01\/09114801\/WhatsApp-Image-2023-01-09-at-5.15.37-PM.jpeg\"><\/figure>\n<h4>Threshold Selection:<\/h4>\n<p>For the content deduplication pipeline, the threshold selection\/tuning is at the core of the problem. It helps in separating similar and non-duplicate questions from duplicate ones. Here to identify the appropriate thresholds, we have had taken the help of the Subject Matter Experts in preparing a labelled dataset, where they have been given an anchor question and a list of candidates, from that they were asked to mark the pairs as Duplicate or Not-Duplicate. For candidate generation, top-k candidates were selected from the content pool using Elasticsearch\u2019s Full-Text queries and Script Score queries on the image dense vectors.<br>Now, to select the right threshold value, a grid search was employed over the different threshold values (range: 0.5 to 1.0, step-size: 0.05) with the maximum accuracy score objective against the labelled dataset. Here top-k candidates were generated for the anchor questions and the accuracy numbers were captured at different threshold values. The similarity score threshold that yields the maximum accuracy was chosen as the final threshold value.<\/p>\n<h4>Benchmarking Process:<\/h4>\n<p>Against the hold-out labelled set, a benchmarking of the mentioned duplicates identification process has been done, the table below mentions the specifics:<\/p>\n\n\n<table style=\"margin-bottom: 1rem;\"><tbody><tr><th>Data<\/th><th>Set Size<\/th><th>Accuracy (marked correctly)<\/th><\/tr><tr><td>Labelled Question Pairs containing: Only Text, Text + Image, Only Image<\/td><td>5114<\/td><td>83.1% (4250)<\/td><\/tr><tr><td>Labelled Question Pairs containing: Text + Image, Only Image<\/td><td>2710<\/td><td>80.1% (2193)<\/td><\/tr><\/tbody><\/table>\n\n\n<h4>Future Improvements:<\/h4>\n\n\n<p>Few changes can be made to improvise the performance of the mentioned pipeline.<\/p>\n\n\n\n<ol>\n<li>Using trained language models, we can introduce semantic nature in the textual content and the CNN-based image part of the content. Using this, we can not just identify syntactically same but semantically same (paraphrased) duplicates from the content pool.<\/li>\n\n\n\n<li>The candidate selection, thresholding, and graph clustering process operate in bulk and repetitively; this can be optimized by using an incremental graph clustering algorithm. This addition can make the pipeline entirely incremental.<\/li>\n<\/ol>\n","protected":false},"template":"","yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deduplication: Technical Overview - EMBIBE - The most powerful AI-powered learning platform<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deduplication: Technical Overview - EMBIBE - The most powerful AI-powered learning platform\" \/>\n<meta property=\"og:description\" content=\"Introduction: As an Ed-tech platform, Embibe curates and manages a huge pool of learning objects which can be served to the students to fulfil their learning requirements. This content pool primarily holds content like videos, explainers, interactive learning elements to educate the user with any academic concept. Also, it contains questions that can be bundled together intelligently to provide gamified practice and test experiences. At Embibe, the user engagement under the practice and test storyline provide us with the crucial....\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/\" \/>\n<meta property=\"og:site_name\" content=\"EMBIBE - The most powerful AI-powered learning platform\" \/>\n<meta property=\"article:modified_time\" content=\"2023-01-20T07:10:41+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/indicmicrosites-assets.embibe.com\/in-en\/wp-content\/uploads\/2023\/01\/09114801\/WhatsApp-Image-2023-01-09-at-5.15.37-PM.jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/\",\"url\":\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/\",\"name\":\"Deduplication: Technical Overview - EMBIBE - The most powerful AI-powered learning platform\",\"isPartOf\":{\"@id\":\"https:\/\/www.embibe.com\/in-en\/#website\"},\"datePublished\":\"2023-01-20T07:10:20+00:00\",\"dateModified\":\"2023-01-20T07:10:41+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.embibe.com\/in-en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deduplication: Technical Overview\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.embibe.com\/in-en\/#website\",\"url\":\"https:\/\/www.embibe.com\/in-en\/\",\"name\":\"EMBIBE - The most powerful AI-powered learning platform\",\"description\":\"Just another WordPress site\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.embibe.com\/in-en\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Deduplication: Technical Overview - EMBIBE - The most powerful AI-powered learning platform","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/","og_locale":"en_US","og_type":"article","og_title":"Deduplication: Technical Overview - EMBIBE - The most powerful AI-powered learning platform","og_description":"Introduction: As an Ed-tech platform, Embibe curates and manages a huge pool of learning objects which can be served to the students to fulfil their learning requirements. This content pool primarily holds content like videos, explainers, interactive learning elements to educate the user with any academic concept. Also, it contains questions that can be bundled together intelligently to provide gamified practice and test experiences. At Embibe, the user engagement under the practice and test storyline provide us with the crucial....","og_url":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/","og_site_name":"EMBIBE - The most powerful AI-powered learning platform","article_modified_time":"2023-01-20T07:10:41+00:00","og_image":[{"url":"https:\/\/indicmicrosites-assets.embibe.com\/in-en\/wp-content\/uploads\/2023\/01\/09114801\/WhatsApp-Image-2023-01-09-at-5.15.37-PM.jpeg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/","url":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/","name":"Deduplication: Technical Overview - EMBIBE - The most powerful AI-powered learning platform","isPartOf":{"@id":"https:\/\/www.embibe.com\/in-en\/#website"},"datePublished":"2023-01-20T07:10:20+00:00","dateModified":"2023-01-20T07:10:41+00:00","breadcrumb":{"@id":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.embibe.com\/in-en\/joinus\/deduplication-technical-overview\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.embibe.com\/in-en\/"},{"@type":"ListItem","position":2,"name":"Deduplication: Technical Overview"}]},{"@type":"WebSite","@id":"https:\/\/www.embibe.com\/in-en\/#website","url":"https:\/\/www.embibe.com\/in-en\/","name":"EMBIBE - The most powerful AI-powered learning platform","description":"Just another WordPress site","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.embibe.com\/in-en\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/www.embibe.com\/in-en\/wp-json\/wp\/v2\/positions\/995"}],"collection":[{"href":"https:\/\/www.embibe.com\/in-en\/wp-json\/wp\/v2\/positions"}],"about":[{"href":"https:\/\/www.embibe.com\/in-en\/wp-json\/wp\/v2\/types\/positions"}],"wp:attachment":[{"href":"https:\/\/www.embibe.com\/in-en\/wp-json\/wp\/v2\/media?parent=995"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}