What are differences between Data/Text Mining and Statistics?
• Statistical analysis is designed to deal with structured data in order to
solve structured problem:
– Results are software and researcher independent
– Inference reflects statistical hypothesis testing
• Data mining is designed to deal with structured data in order to solve
unstructured business problems
– Results are software and researcher dependent (absence of
implementation standards)
– Inference reflects computational properties of data mining
algorithm at hand
• Text mining is designed to deal with unstructured data in order to solve
unstructured problems
– Results are software and researcher dependent
– Inference reflects computational properties and visualization
capability of text mining algorithm at hand
When data mining technology is appropriate?
• Data mining technology is appropriate if:
– The business problem is unstructured
– Accurate prediction is more important than the explanation
– The data include the mixture of interval, nominal, ordinal, count, and text variables, and the role and the number of non-numeric variables are essential
– Among those variables there are a lot of irrelevant and redundant attributes
– The relationship among variables could be non-linear with uncharacterizable nonlinearities
– The data are highly heterogeneous with a large percentage of outliers, leverage points, and missing values
– The sample size is relatively large
• Important marketing and sales studies/projects have the majority of
these features

