Ramsey Numbers and Decision Trees for Entity Identification
Dr. Sambuddha Roy
IBM India Research Lab, Delhi.
We consider the problem of constructing decision trees for entity identification from a given relational table. The input is a table containing information about a set of entities over a fixed set of attributes. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary. We study the general problem involving arbitrary input tables and present approximation algorithms and hardness results. We consider a natural greedy algorithm and prove an approximation guarantee of $r_K(1+ ln N)$, where $N$ is the number of entities, $K$ is the maximum number of distinct values of an attribute, and $r_K$ is a suitably defined Ramsey number. In addition, our analysis indicates a possible way of resolving a Ramsey theoretic conjecture by Erd?os. We also show that it is NP-Hard to approximate the general version of the problem within a factor of $(4 + \epsilon)$ and the special case of binary tables within $(2 + \epsilon)$, for any $\epsilon > 0$. The latter result improves the previously best known lowerbound of $(1 + \epsilon)$, for some $\epsilon > 0$.