"Interestingness" is the most useful heuristics in my research

Generally speaking, I am interested in exploiting computers to do smart things (things used to be doable only by humans, or even those cannot be done by humans), which generally requires an integration of ideas and approaches from the following areas:

  • Knowledge Discovery and Data Mining (KDD)
  • Natural Language Processing (NLP)
  • Machine Learning (ML)
  • Artificial Intelligence (AI)
  • Social Network Analysis (SNA)

See below for an overview (click for a higher resolution version):

Detailed descriptions about the research projects I have been working on:

"Interestingness" research in Data Mining

Everybody loves interesting things. People enjoy watching interesting movies, meeting with interesting people, working on interesting tasks, etc. However, what after all does this magic term "interesting" really mean? Is there any universal property that makes one thing more interesting than another? Is there any rule of thumb for finding interesting things from data?

These are questions drawing attention from people in different disciplines such as philosophy, psychology and biology. As a computer scientist, I am more interested in designing a general computational model to model interestingness so that one can use it to identify interesting things from data. This is a fairly new research direction in KDD. Please see [Lin2003a, KDDExploration], [Lin2003b, ICDM], [Lin2004, AAAI] for our previous works.

Discovering Abnormal Instances in Complex Networks

Complex networks are everywhere: A social network represents people and their relationship with each other; a biology network could capture the interaction information between genes and proteins, a thesaurus network such as WORDNET contains information about the interaction between words, etc. Abnormal instances play an important role in a complex network. For example, an abnormal individual in a homeland security network can imply potential threat personnel. A kind of food that is abnormally associated to some disease in the food chain can represent the cause or cure of the disease. An abnormal connection between a person and a company can indicate certain insider trading.

A system that is capable of identifying abnormal or suspicious instances in a complex network will then have a wide range of applications in fraud detection, law enforcement, homeland security, and scientific discovery. The goal of this project is to develop an "Electronic Sherlock Holmes" type of tool that suggests abnormal evidences from complex networks to humans. We have developed one of the pioneering algorithms for this purpose, and will keep working on the related problem in the future. Please refer to [Lin 2006, Ph.D. thesis] for more details.

Natural Language Processing

Allen Turing once predicted that the computer should be able to communicate with humans in natural language by the end of 20th century, just like we have seen in tons of cartoons and movies. Unfortunately, this never happened, and we are not even close. Does this mean that we, as the followers of Turing, should quit trying? Surely not. On the contrary, it is exciting to learn that there is so much room we can work on toward this amazing goal (the world simply won't be the same with computers that speak and understand human language). In these years researchers turned to smaller problems and have decent accomplishments. The best example is the success of search engines such as Google. Besides information retrieval, now NLP researchers (including Google) are working on harder, system-level problems (including machine translation, information extraction, summarization, semantic role labeling, etc) with the goal to eventually fulfill Turing's dream.

I am generally interested in a variety of natural language problems such as the ones described above, in particular those of general language model learning and analysis, unsupervised information extraction and word sense disambiguation, semantic analysis, and natural language generation. Please see [Lin 2007] for our unsupervised word sense disambiguation system.

Applications Information Retrieval

Search engine has become a ubiquitous tool in these days. What can we do using a search engine? Can we perform intelligent tasks exploiting a search engine? How can we compare the results of two search engines? How can we improve the searched ranking of a certain pages given keywords? How can we estimate the size of the whole Web? Can a search engine assist us finding plagiarisms?

We have interests to research on how to answer the above questions.

Machine Learning Theory and Application

I am interested in unsupervised learning methods and how they can be integrated with supervised learning methods to alleviate the curse of dimension and the need of large data. In particular how EM can be combined with CRF to perform sequential labeling in an unsupervised manner accurately and efficiently. I am also interested in applying ML theories in challenging environment such as imbalanced data classification, outlier detection and explanation, and graph-based clustering & pattern recognition.

Explanation-based Knowledge Discovery

I am excited to have identified a new research area that requires the integration of NLP and KDD research, which I call the explanation-based knowledge discovery.

One of the most notorious issues that hinder the progress of KDD research is the difficulty of verification. That is to say, it is generally hard to evaluate a knowledge discovery system, since its finding is previously unknown (otherwise it is should not be regarded as "discovery"). To deal with this problem, we propose "Explanation-based Knowledge Discovery", whose goal is to equip a knowledge discovery system the capability of explaining it's findings to the users. The idea is that since it is not possible to verify the results directly, we can try to verify the "process".

The practical goal of this project is to identify a set of popular knowledge discovery algorithms and for each of them to come up with an explanation system that describes the results in a meaningful and understandable manner to the users. Please see [Lin 2006, Ph.D. thesis], [Lin, 2006, CL], [Lin 2004, LinkKDD], [Lin 2004, AAAI] for more details.

Ancient Script Decipherment

I am very interested in applying modern computer science technologies to handle humanity problems (e.g. solve historical mysteries). Deciphering ancient script is an ideal example of applying machine learning and NLP technologies to assist dealing with an philology/archeology problem. The script I have been working on for years is the Hieroglyphic Luwian (see sample). Our intelligent learning program successfully uncovered the linear writing order of this script, according to which we can then move on to the next stage of decipherment, such as word clustering and semantic labeling. This work demonstrates that an intelligent program can sometimes even handle tasks that are believed to be hard for normal people. Besides Luwian, there are still many un-deciphered mysterious ancient scripts (including the notorious Voynich Manuscript) that one can play with. This work has been published in Artificial Intelligence [Lin, 2006, AIJ].


I will be interested in working on (and collaborating with other researchers) in the realm of intelligent information processing, including (but not exclusively) Knowledge Discovery and Data Mining for National Digital Archives (there are tons of data there waiting to be mined), Artificial Life, game AI, AI and NL for education, etc.