Abstract:
|
Easy access to a wide range of information available online enables people to explore
this information with an ambition to explore interesting content even more.
This opportunity often leads to a problem of finding interesting and relevant information
from the sea of knowledge. This problem is often referred to as the
information overload problem, which is getting harder and harder to deal with
as the amount of information available online grows. In this thesis, one source of
information is exploited and organized in such a way that the task of discovering
new content is made easier.
We use Really Simple Syndication (RSS) as our source of information and two
methods to categorize it: document clustering with K-Means and Latent Dirichlet
Allocation (LDA). We use the textual information that the RSS contains, each
RSS feed usually contains a specific set of topics. Our first goal is to perform
document clustering to the data, in order to generate meaningful clusters with
the help of natural language processing (NLP) techniques to preprocess the data.
Our second goal is to analyze the clustered RSS feeds and exploit the similarities
between the documents to generate meaningful user models based on user feed
subscriptions. The third goal is to provide relevant recommendations based on
the user models we have learned. We combine the current state-of-the-art methods
and present novel methods to compare feeds. We exploit WordNet shallow
ontologies in our novel method to create generalized representations of the feeds.
The final goal is to develop a functional application that can leverage the methods
we developed with the help of machine learning libraries. The method we
propose is a combination of document clustering techniques, text similarity, feed
modeling and recommendation system.The results of our experiments show that K-Means clustered documents combined
with recommendations based on the feed contents yield the best results.
Using WordNet to measure the similarity of words provides also promising results.
Further exploring the advantages of using semantic similarities would be
an interesting research topic in the document similarity measures. |