Abstract:
|
Mediktor is a clinically validated symptom-checker that gives accurate pre-diagnosis based on a personalized interactive questionnaire that the users have to answer. Learning to predict the next most fitting question to ask to a user, given some previously answered ones, is key for adding value to Mediktor's evaluator system. Current solutions for similar tasks, i.e. predicting a word given a specific context, are based on representing them in vector spaces. These vectors are called word embeddings* and these methods are proved to give outstanding results in Natural Language Processing tasks. For this particular project, the Continuous Bag of Words (CBOW) model from the Word2Vec models by Mikolov, T. et al (1) was the most suitable approach. It is demonstrated (2) that these models are able to learn excellent vector representations of words; the challenge is to make it work for datasets that have great variability and complexity as in our case. This work has been valuable for understanding the similarity between groups of questions that are asked together with every other question in the vocabulary. Being the first approach of using machine learning techniques to learn similarities among questions for this particular data, the results are satisfactory. The predictions obtained for the testing data and the visualization of the word embeddings in the multidimensional space are reasonable. In the process of graphical validation of the model, we have found that as expected, the learned word embeddings have formed clusters (groups of similar questions) in the vector space. Nevertheless, further research should be done into this subject to optimize the training aiming for better results in prediction. *Word embeddings are representations of words in a lower dimensional vector space. They allow to learn features about the words and how they interact with each other; words with similar meanings should have similar representations. Being the first approach of using Machine Learning techniques to learn similarities among questions for this particular data, the results are coherent. The predictions obtained for the testing data and the visualization of the word embeddings in the multidimensional space are reasonable. In the process of graphical validation of the model, we have found that as expected, the learned word embeddings have formed clusters (groups of similar questions) in the vector space. Nevertheless, further research should be done into this subject to optimize the training aiming for better results. *Word embeddings are representations of words in a lower dimensional vector space. They allow to learn features about the words and how they interact with each other; words with similar meanings should have similar representations. |