Abstract:
|
Heterogeneous sources of information, such as images, videos, text
and metadata are often used to describe di erent or complementary views of
the same multimedia object, especially in the online news domain and in large
annotated image collections. The retrieval of multimedia objects, given a mul-
timodal query, requires the combination of several sources of information in an
e cient and scalable way. Towards this direction, we provide a novel unsuper-
vised framework for multimodal fusion of visual and textual similarities, which
are based on visual features, visual concepts and textual metadata, integrating
non-linear graph-based fusion and Partial Least Squares Regression. The fu-
sion strategy is based on the construction of a multimodal contextual similarity
matrix and the non-linear combination of relevance scores from query-based
similarity vectors. Our framework can employ more than two modalities and
high-level information, without increase in memory complexity, when com-
pared to state-of-the-art baseline methods. The experimental comparison is
done in three public multimedia collections in the multimedia retrieval task.
The results have shown that the proposed method outperforms the baseline
methods, in terms of Mean Average Precision and Precision@20. |