Automated Blog Classification: Challenges and Pitfalls


Hong QU, Andrea La Pietra, and Sarah S Poon. 2006. “Automated Blog Classification: Challenges and Pitfalls.” In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Pp. 184-186. Stanford, California: Association for the Advancement of Artificial Intelligence. Publisher's Version
ss06-03-037.pdf472 KB


Blogs are difficult to categorize by humans and machines alike, because they are written in a capricious style. In the early days of web, directories maintain by humans could not keep up millions the websites; likewise, blog directories cannot keep up with the explosive growth of the blogsphere. This paper investigates the efficacy of using machine learning to categorize blogs. We design a text classification experiment to categorize one hundred and twenty blogs into four topics: personal diary, news, political, and sports. The baseline feature is unigrams weighed by TF-IDF, which yielded 84% accuracy. We analyze the corpus, features, and result data. Our analysis leads us to believe that blog taxonomies need to support polyhierarchy—a given blog may be correctly classified under more than one category.
Last updated on 09/09/2019