Compression for Similarity Queries

Dr. Amir Ingber
Postdoctoal Research Fellow, Stanford University
Given on: Dec. 5th, 2013

Abstract

Traditionally, data compression deals with concisely representing data, e.g., a sequence of letters, for the purpose of eventual reproduction (either exact or approximate). In this work, we are interested in compression when the goal is to identify whether the original sequence is similar to a given query sequence. This problem arises in an increasing variety of fields ranging from forensics to genomics to internet search ranking.

We study the fundamental trade-off between compression rate, sequence length, and reliability of the answers to the queries. We focus on the practically motivated case where false negatives in the similarity identification are not allowed, and wish to minimize the probability of false positives. We characterize the minimal compression rate that allows reliable query answers (i.e., a vanishing probability of false positives) for the case of source and query sequences drawn from an i.i.d. distribution. We also characterize the best achievable exponential rate of decay of the false positive probability when working at rates above the minimum required for reliability.

Finally, guided by the above findings, we describe the construction of practical compressors for queries, some based on off-the-shelf lossy compression algorithms. We exhibit results of experimentation with such compression schemes on both simulated and real data. For the simulated data, the performance of these schemes is seen to approach the theoretical limits. When tested on a database of real DNA sequences, these schemes achieve significant compression while allowing highly reliable query answers.

Joint work with Thomas Courtade, Idoia Ochoa and Tsachy Weissman.

Biography

Amir Ingber is a postdoctoral research fellow at the Information Systems Lab at the EE Dept., Stanford University. His research interests include information theory, data compression and its applications in similarity search in databases. He received a B.Sc. in electrical engineering and computer science (summa cum laude), an M.Sc. in electrical engineering (summa cum laude), and a Ph.D. in electrical engineering, all from Tel-Aviv University, Israel, in 2005, 2006 and 2011 respectively.

Between 2002–2011 he served as a DSP and Algorithms Engineer and consultant at several companies, including Intel Corporation, Comsys Communication and Signal Processing, and Amimon Ltd.

Dr. Ingber was a recipient of the Best Student Paper Award at the 2011 IEEE International Symposium on Information Theory (ISIT), St. Petersburg, Russia. Among his other recent awards are the Rothschild Fellowship for postdoctoral studies (2011-), the Intel award for excellence in academic studies and research (2011) and the Adams Fellowship (2008–2011) awarded by the Israel Academy of Sciences and Humanities.