Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/2117/119271

Approximating the schema of a set of documents by means of resemblance
Abelló Gamazo, Alberto; Palol, Xavier de; Hacid, Mohand-Saïd
Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació; Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.
Peer Reviewed
-Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació
-Data mining
-Automatic data collection systems
-Document
-Design
-XML
-Mineria de dades
-Classificació automàtica
Artículo - Versión presentada
Artículo
Springer
         

Mostrar el registro completo del ítem

Documentos relacionados

Otros documentos del mismo autor/a

Abelló Gamazo, Alberto; Palol Arregui, Xavier de; Hacid, Mohand-Saïd
Nadal Francesch, Sergi; Romero Moral, Óscar; Abelló Gamazo, Alberto; Vassiliadis, Panos; Vansummeren, Stijn
Theodorou, Vasileios; Abelló Gamazo, Alberto; Thiele, Maik; Lehner, Wolfgang
Abelló Gamazo, Alberto; Romero Moral, Óscar; Jovanovic, Petar; Nadal Francesch, Sergi; Bilalli, Besim; Candón Arenas, Héctor; Mayorova, Daria; Thavornun, Varunya; Gil González, Daniel
Samos, J; Abelló Gamazo, Alberto; Oliva, M; Rodríguez González, María Elena; Saltor Soler, Félix Enrique; Sistac Planas, Jaume; Araque, F; Delgado, C; Garvi, E; Ruiz, E