Abstract:
|
Getting correctly labelled data is an important preliminary stage for many supervisedmachine learning problems, but it can also be really difficult to perform. Sometimes, toobtain a good model, it could require tens or hundreds of thousands of examples andexamples usually does not come with labels. Active Learning is a methodology that canradically accelerate the labelling process and reduce costs for many machine learningprojects. This methodology prioritises which data is most confusing and requests justthose labels instead of collecting all the labels for all the data at once.The aim of this work is to implement an Active Learning methodology that helps toautomate the capture of relevant information from Key Investment Information Docu-ments (KIID). Those documents aim to help investors to understand the nature and keyrisks of investment products in order to make a more informed investment decision.Until now, data extraction was done through a specific labelling application developedby the company where this project was performed. This application is able to extractpieces of text from KIIDs, to assign its surrounding information and to query a certainhuman oracle (or expert) the correct label (whether the piece of text is associated or notto a specific type of information). The database obtained after this process is used in asupervised machine learning task to obtain a model that allows to recognise the type ofinformation of a piece of text. The incorporation of the Active Learning methodologyto this application will drastically reduce the query phase i.e. the most expensive partof the whole process.The machine learning algorithm used for classification is the well-known Support Vec-tor Machines. The examples are translated into binary vectors throughUni-grams andBi-grams which means generating a vocabulary ofoneandtwowords and identifyingwhether the text feature contains theseN-grams (N= 1 or 2) or not.Three of the most common Active Learning strategies have been implemented inorder to compare between them and with the passive learning method. Active learn-ing strategies implemented are:uncertainty sampling,query by committeeanddensity-weightedmethods. All of them compared with the passive learning method,randomsampling. 10-fold cross-validation is used to assess each of the approaches implementedand three evaluation measures are applied: recall, accuracy and computational time |