In todays digital era, information is quite easy to collect and store. People store various text documents, images, audio recordings, scientific observations and measurements, business information, intelligence data etc. The hope is that some valuable information and knowledge can be extracted from the raw data.
The process of organizing and searching the data can be tedious and nontrivial task. One obvious difficulty is the size of the datasets. For instance, some estimates of the number of WWW sites on Internet go over two billion; good search engines store information on as many as hundred million documents; some ten million are updated daily. Further, the data can be noisy, incomplete or missing; the complexities of human languages generate additional problems (polysemy, synonymy). In addition, one sometimes needs advanced concept-based searching, based on the semantic structure of the dataset.
These issues pose several challenging problems to mathematics. How do we design good mathematical models for various types of data? What algorithm can be used to lower the dimension of the problem and reduce the noise? How do we implement the algorithms on the state of the art computer architectures?
We will discuss how various mathematical techniques from linear algebra (vector space models and matrix decompositions), discrete mathematics (graph partitioning) and statistics enter the area of data mining and provide valuable tools to manage and search databases. As data mining is relatively new area for applied mathematics, the best methods are yet to be discovered. The applications are important and very attractive.