Web Retrieval and Mining

Spring 2025

Description

The Web has become the largest data repository in the world. This course aims at introducing the basic and advanced techniques of (1) Web information retrieval (IR): How to search the large-scale Web data and (2) Web mining: How to discover knowledge from the diverse data resources on the Web.

The lecture will cover the topics of (1) Web IR, including the fundamentals of modern IR systems, crawling, ranking algorithms, Web page classification and clustering, Chinese IR, multimedia IR, and case studies of search engines, and (2) Web mining, including Web content/text mining, Web structure mining, Web query log mining, information extraction, and taxonomy generation.

Students in this course are expected to read research papers on a relevant topic to Web IR or Web mining, do a project, and then present their work in class.

Instructor

Pu-Jen Cheng

Email: pjcheng@csie.ntu.edu.tw,

Homepage: http://www.csie.ntu.edu.tw/~pjcheng

Office hours: R323, 9:00 am ~ 11:00 am, Tuesday

Class Hours: 9:10 am ~ 12:10 am, Friday

Classroom: R102, CSIE building

Prerequisites: Programming experience will be necessary for the assignments and project.

Readings:

Introduction to Information Retrieval (IIR), by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Cambridge university Press, 2008. (Selected Chapters)

Search Engines: Information Retrieval in Practice, by W. Bruce Croft, Donald Metzler, and Trevor Strohman, 2009. (Selected Chapters)

Information retrieval : Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L.A. Clarke, Gordon V. Cormack, Cambridge, Mass.: MIT Press, 2010. (Selected Chapters)

Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 1999. (Selected Chapters)

Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti, Morgan Kaufmann, 2002. (Selected Chapters)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, by Bing Liu, Springer, 2006. (Selected Chapters)

Selected papers (mainly from SIGIR, WWW, CIKM, WSDM, RecSys, ACL, NAACL-HLT, EMNLP, ICLR, NIPS, ICML)

Grading:

Assignments (50%): handwritten + programming

Midterm Exam (20%)

Term Project (30%)