Francais | English | Espanõl

Data mining

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition.

Contents

[edit] Example

A simple example of data mining, often called Market Basket Analysis, is its use for retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favour silk shirts over cotton ones.

Another is that of a supermarket chain who, through analysis of transactions over a long period of time, found that beer and diapers were often bought together. Although explaining this relationship may be difficult, taking advantage of it is easier, for example by placing the high-profit diapers in the store close to the high-profit beers. (This example is questioned at Beer and Nappies -- A Data Mining Urban Legend.)

[edit] Use of the term

Data mining has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" <ref>W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, Fall 1992, pp. 213-228.</ref> and "the science of extracting useful information from large data sets or databases" <ref>D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA, 2001. ISBN 0-262-08290-X</ref>.

It involves sorting through large amounts of data and picking out relevant information.

It is usually used by businesses and other organizations, but is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimentation.

Metadata, or data about a given set of data, are often expressed in a condensed data mine-able format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

Although data mining is a relatively new term, the technology is not. Companies for a long time have used powerful computers to sift through volumes of data such as supermarket scanner data, and produce market research reports. Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of analysis.

Data mining identifies trends within data that go beyond simple analysis. Through the use of sophisticated algorithms, users have the ability to identify key attributes of business processes and target opportunities.

[edit] Related terms

Although the term "data mining" is usually used in relation to analysis of data, like artificial intelligence, it is an umbrella term with varied meanings in a wide range of contexts. Unlike data analysis, data mining is not based or focused on an existing model which is to be tested or whose parameters are to be optimized.

In statistical analyses where there is no underlying theoretical model, data mining is often approximated via stepwise regression methods wherein the space of 2k possible relationships between a single outcome variable and k potential explanatory variables is smartly searched. With the advent of parallel computing, it became possible (when k is less than approximately 40) to examine all 2k models. This procedure is called all subsets or exhaustive regression. Some of the first applications of exhaustive regression involved the study of plant data.<ref>A.G. Ivakhnenko, Heuristic Self-Organization in Problems of Engineering Cybernetics, GMDH library, Automatica, 6, 1970, pp.207–219.</ref>

[edit] Data dredging

Data dredging or data fishing are terms one may use to criticize someone's data mining efforts when it is felt the patterns or causal relationships discovered are unfounded.

Data dredging is the scanning of the data for any relationships, and then when one is found coming up with an interesting explanation. The conclusions may be suspect because data sets with large numbers of variables have by chance some "interesting" relationships. Fred Schwed <ref>Fred Schwed, Jr, Where Are the Customers' Yachts? ISBN 0-471-11979-2 (1940).</ref> said:

"There have always been a considerable number of people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it."

Nevertheless, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies), and correlation analysis has shown to be very useful in risk management. Indeed, finding correlations in the financial markets, when done properly, is not the same as finding false patterns in roulette wheels.

Some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear.

Most data mining efforts are focused on developing highly detailed models of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data. <ref>T. Menzies, Y. Hu, Data Mining For Very Busy People. IEEE Computer, October 2003, pp. 18-25.</ref>

When data sets contain a big set of variables, the level of statistical significance should be proportional to the patterns that were tested. For example, if we test 100 random patterns, it is expected that one of them will be "interesting" with a statistical significance at the 0.01 level.

Cross validation is a common approach to evaluating the fitness of a model generated via data mining, where the data is divided into a training subset and a test subset to respectively build and then test the model. Common cross validation techniques include the holdout method, k-fold cross validation, and the leave-one-out method.

[edit] Privacy concerns

There are also privacy concerns associated with data mining - specifically regarding the source of the data analyzed. For example, if an employer has access to medical records, they may screen out people who have diabetes or have had any legal problems.

Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns. <ref>K.A. Taipale, Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data, Center for Advanced Studies in Science and Technology Policy. 5 Colum. Sci. & Tech. L. Rev. 2 (December 2003).</ref>

There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs exhibiting harmful interactions. Since any particular combination may occur in only 1 out of 1000 people, a great deal of data would need to be examined to discover such an interaction. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.

Essentially, data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.

[edit] Combinatorial game data mining

Since the early 1990s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. This is pattern-recognition at too high an abstraction for known Statistical Pattern Recognition algorithms or any other algorithmic approaches to be applied: at least, no one knows how to do it yet (as of January 2005). The method used is the full force of Scientific Method: extensive experimentation with the tablebases combined with intensive study of tablebase-answers to well designed problems, combined with knowledge of prior art i.e. pre-tablebase knowledge, leading to flashes of insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of people doing this work, though they were not and are not involved in tablebase generation.

[edit] Notable uses of data mining

[edit] See also

[edit] Structured Data Mining

[edit] Unstructured Data Mining

[edit] Induction algorithms

[edit] Supervised learning

[edit] Unsupervised learning

[edit] Dimensionality reduction

[edit] Application areas

[edit] Software

  • Funnelback is a commercial search engine that crawls and indexes unstructured data, including xml, pdf, word documents, spread sheets, trim databases, sql databases, text, and html. It is also possible to add other plugins to read other formats;
  • Essbase has data mining capabilities, including PMML support and a Data Mining Wizard;
  • Java Data Mining;
  • Microsoft Analysis Services is a full suite of data mining algorithms and tools;
  • MicroStrategy allows importing PMML models and applying them to large quantities of data;
  • Neural network software;
  • [Oracle Data Mining] enables scalable, secure in-database mining - i.e. model creation, management, and scoring of data within the Oracle10g database. The suite consists of a guided-mining GUI tool (Oracle Data Miner), a Java Data Mining compliant Java API, native PL/SQL API, and Oracle SQL functions for fast, embedded analytics and scoring.
  • Point Horizon is an integrated data exploration, analysis, visualization and forcasting application with emphasis in dynamical methods.
  • R is an open-source statistical environment and programming language that fits well for machine learning and data mining.
  • ROOT, a package born for physics data analysis, can also be used for data mining;
  • Talend_Open_Studio (www.talend.com) - ETL Tool, which uses an Eclipse Rich Client Platform (RCP) as the GUI. The GUI is used to create graphical transformations and mappings, which ultimately generate underlying perl code. The platform is distributed under the GPL V2 terms.
  • Teradata contains datamining tools such as data exploration, data preprocessing, analytic modelling, scoring and deployment within a database;
  • [Vitalnet] extracts useful information from large, complex health-related data sets.
  • Weka is a freely available open-source data mining software written in Java featuring numerous clustering, classification, regression, and meta-learning operators;
  • YALE is an integrated freely available open-source software environment for data exploration, data preprocessing, intelligent data analysis, knowledge discovery, data mining, machine learning, prediction, visualization, etc. written in Java with more than 350 data mining operators, fully integrating Weka, and featuring a graphical user interface as well as a XML-based scripting language for data mining.
  • [XmlMiner] is a class library, toolkit and free web service specialising in data, text and structure mining XML data sources, and in handling semi-structured data. The scripting language is XML based, as is the model representation language, Metarule, is also XML based, representing knowledge as a collection of fuzzy logic production rules.

[edit] References

<references />

[edit] General references

  • Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7 (companion book site)
  • Kurt Thearling, An Introduction to Data Mining (also available is a corresponding online tutorial)
  • Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, Wiley Interscience, ISBN 0-471-05669-3, (see also Powerpoint slides)
  • Phiroz Bhagat, Pattern Recognition in Industry, Elsevier, ISBN 0-08-044538-1
  • Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (2000), ISBN 1-55860-552-5, (see also Free Weka software)
  • Yike Guo and Robert Grossman, editors: High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers, 1999.
  • Dean W. Abbott, I. Philip Matkovsky, and John Elder IV, Ph.D. An Evaluation of High-end Data Mining Tools for Fraud Detection published a comparative analysis of major high-end data mining software tools that was presented at the 1998 IEEE International Conference on Systems, Man, and Cybernetics, San Diego, CA, October 12-14, 1998.
  • Mierswa, Ingo and Wurst, Michael and Klinkenberg, Ralf and Scholz, Martin and Euler, Timm: YALE: Rapid Prototyping for Complex Data Mining Tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), 2006.

[edit] External links

cs:Data mining de:Data-Mining ru:Добыча данных es:Minería de datos fa:کاوش‌های ماشینی در داده‌ها fr:Exploration de données ko:데이터 마이닝 it:Data mining he:כריית מידע lt:Duomenų išgavimas hu:Adatbányászat nl:Datamining ja:データマイニング no:Data mining pl:Eksploracja danych pt:Mineração de dados ru:Извлечение информации sl:Podatkovno rudarjenje su:Data mining sv:Data mining th:การทำเหมืองข้อมูล vi:Khai phá dữ liệu zh:数据挖掘

Personal tools