Abstract:
|
The objective of this project is to create a program that processes a set of ancient
Chinese poems, reading them from a text file and storing them into data
structures, so that they
can be used to find similar sentences to a text the user
will introduce. In order to achieve this they are broken into sentences, which are
clustered (always keeping track of which poem they belong to), using a tf-idf
score system between them to establish
their similarity.
Similar sentences will
be found checking the similarity between the words they contain to the provided
text.
The clusters are calculated with a modification of hierarchical clustering,
following the same principles, but limiting clustering to four sentences maximum.
This way, a small set of similar sentences can be provided to the user instead of
just one sentence similar to the text he inputted. Four clusters will be provided,
the ones to which the most similar sentences belong to |