SVM, neural network and decision tree
After reading a post concerning the PAKDD 2007 competition on Abbott's Analytics, I was curious about the trends of some data mining methods. I decided to play with Google Trends using three common methods: Support Vector Machine (SVM), Artificial Neural Network (ANN) and Decision Tree (DT). The following picture shows the trends in search on Google for the three terms "svm", "neural network" and "decision tree" since 2004:
The main observation is that SVM and ANN seem to be less trendy these last years. It is interesting to see that DT are constant over the years. These are the first conclusions we could draw from this picture. However, it is always dangerous to conclude on some numbers. In the above case, several factors have to be taken into account when making such conclusions:
- The way of writing the searched terms. For example, SVM could be found under "support vector machine", "support vector", "svm", etc. However, it seems that "svm" is most often used. The same remark for neural networks is also valid.
- The diversity of search engines. Although the most popular, Google is not the only search engine on the web. A lot of people may use other engines such as Yahoo!, Live Search or All the Web. Only searches on Google are considered in this picture.
- The difference between "searching" and "using". In other words, people may search for some methods but finally decide to use another one. Therefore, the fact that a keyword is often searched on Google does not mean that the corresponding method is used.
7 comments:
I'd like to elaborate on the following, important point made in this post:
For example, SVM could be found under "support vector machine", "support vector", "svm", etc. However, it seems that "svm" is most often used. The same remark for neural networks is also valid.
False positives as well as false negatives will arise in such analysis. "SVM" may refer to other things, such as the exchange symbols for Servicemaster Company or SilverCorp Metals Inc.
Tree-induction is a difficult concept to search for on-line, as "decision tree" refers to many methods which use a tree structure, but do not learn anything. Even names of individual tree-induction algorithms can be tricky (such as ID3 or CART).
Very interesting analysis. I'm not sure what to think of the trends. While I agree with Will that the FPs and FNs are a problem, it also isn't clear that these do more than add a bit of noise. Trees are the most problematic of the three in my opinion, as I often see CART or C5 instead of "decision trees", and frankly see just the term "trees" as much or more than "decision trees".
Nevertheless, the SVM/NN results in particular are quite interesting.
So Sandro, I wonder if there is a way to repeat this with "ensembles" as the search word (or "model ensembles" or "bagging" or "boosting", etc.)
Very interesting thread. I think we have overlooked the fact that SVM is often looked as a type of ANN. Check out the comment:
"Neural networks can be used to solve classification problems, typically through Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM) type networks. " at
http://www.nd.com/apps/trading.html
Also just wanted to point out that realizing the importance of data mining in the world of Business Intelligence, database vendors like Oracle have provided SVM, Decision Tree algorithms in the database itself as Oracle Data Mining or ODM option in the Oracle RDBMS. The idea is to apply data mining right next to where the data resides rathen than having to pull the data out from its natural store whether it database, flat file or Excel as in tools like Weka etc.
Apart from the use of ANN, SVM and Decision tree, I am often surprized that Naive Bayes is so popular as well for classification and practically speaking the algorithm works inspite of the rigid requirement of independence of attributes. One of the forums that often discusses the data mining features of the database is http://OracleBIWA.org
Thanks
...I am often surprized that Naive Bayes is so popular as well for classification and practically speaking the algorithm works inspite of the rigid requirement of independence of attributes.
I would say rather that Naive Bayes assumes, rather than requires independence among the attributes. As you say, Naive Bayes often works well, despite the violation of this assumption. Additionally, Naive Bayes: 1. frequently handles very large numbers of attributes well, 2. capably deals with missing attributes, and 3. can be updated with new exemplars without starting from scratch.
Thanks for your relevant comments to this post.
Will: I agree and I have now found how to use the OR for combining keywords with Google Trends.
Dean: Good idea, I will check that soon.
Shyam: Naive Bayes is definitely a learning technique to consider. I can incorporate it for a more extensive Google Trend analysis.
Had not used Google Trends before so was nice to see your graphs. Was also interested to see comments on synonyms etc. However before getting too drawn into the particulars of the methods/trends shown I have a more fundamental question about Google Trends itself...
I tried a few examples and was not totally convinced of the outcomes. So I went back to a very simple single term trend for a domain I happen to know fairly well - epidemiology. (You can see the output I got by entering - http://www.google.com/trends?q=epidemiology&ctab=0&geo=all&date=all&sort=0).
Now, as I have no way of knowing exactly what the y-axis represents I have to take it as reasonable that the subject is now drawing only about half as much interest (many searches) as in Q3/Q4 of 2005 - though I can think of no rational explanation for this? (Are these trend lines based on absolute numbers or are they 'discounted' against some overall volume data?)
Much more worryingly and what caused me to write this note and question fundamentally what Google Trends is actually picking up was the 'Region' breakdown shown. I find it VERY difficult to believe that over a long timeframe (3 years) and using a pretty 'mainstream' term, that Kenya and Nigeria should show around 4 and 2 times as many searches as any other country. Also looking at the languages I see that English is in 5th place - again puzzling (though I have no idea what "Tagalog" is?! Also I realise that the language and region results are interlinked and so the anomoly in the one may largely explain the other.)
In any case there is something very odd going on with the snap-shots Google is using for these trends and whatever that bias is must be understood before you read too much into the results?
Crawford.
(PS - I just entered "Neural Network" and see that Sri Lanka and Iran head the regional tables with none of the larger OECD countries in the top 10 - does that not seem to indicate some strange bias?)
I think that proper variable selection and careful model fitting is more important than choosing the best model.
Most interesting recent developments are in this area (proliferation of shrinkage methods, transforms like MARS, validation using boosting, etc.)
Post a Comment