Yiming Yang, Sean Slattery and Rayid Ghani
Journal of Intelligent Information Systems—Special Issue on Automatic Text Categorization, 2001
Abstract: Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our experimental results suggest that a naive use of linked pages, such as treating the words in the linked neighborhood of a page as local to that page, can be more harmful than helpful when the linked neighborhood is highly "noisy". This is especially true if the classifier is not sufficiently robust in discriminating informative words from noisy ones. It is also evident in our results that extracting meta data (when available) from related web sites can be extremely useful for improving classification accuracy. Finally, the relative performance of the classifiers being tested provides insights into their strengths and limitations for solving classification problems involving diverse and often noisy web pages.