Data Mining Research

Showing posts with label decision tree. Show all posts

Tuesday, November 25, 2008

Readability of Decision Trees

One of the most often cited advantage of decision trees is their readability. Several data miners (to whom I belong) justify the use of this technique since it is quite easy to understand the obtained model (no black box). However, there are certain issues that make decision trees unreadable.

First, there is normalization (or standardization). In most projects, data have to be normalized before using decision tree. Therefore, once you plot the tree, values are meaningless. Of course, you can map the data back in the original format, but it has to be done.

Second is the number of trees. In the project I carry on at my job, I can have 100 or more decision trees by month (see this post for more details). It is clearly impossible to read all these trees even if they are independently understandable. The same happens with random forests. When there are 1000 trees voting for a given class, how can one understand the process (or rules) that produce the class output?

Decision trees still have a lot of advantages. However, the "readability" advantage must be taken with care. It may be valid in some applications, but can often be a mirage.

Continue reading... Sphere: Related Content

Monday, November 10, 2008

Stock Prediction using Decision Tree: Risk Management

This post is the last one of a series on using decision tree for stock prediction. Here are the first, second, third and fourth posts. This final post is dedicated to the risk management part of the system.

At this step, we have as input a list of transactions containing the ticker name (e.g. "MSFT US Equity" for Microsoft, etc.), the date of transaction and the quantity. As the system is "long only", the quantity is always positive (+1 means buy one stock).

The risk management is informally defined here as a way of managing a portfolio by closing it in certain specific situations. In this system, a simple take-profit/stop-loss at the portfolio level is used. Every day, a number of transactions are carried out. All transactions in the same day are gathered in one portfolio. At any time, the portfolio return is calculated to check for both the take-profit and stop-loss limits. If the portfolio return reaches the take-profit (e.g. +5%) or the stop-loss (-20%), it is closed.

In addition, each portfolio is subject to a maximum number of days of conservation. When this number of days is reached, the portfolio is closed. To find the best values for the take-profit, stop-loss and number of days of conservation, a three-dimensional grid search is performed. Note that there is nothing to prevent the system from overfitting at this step of the methodology (unlike the decision tree step that uses cross-validation). This could of course be improved.

Continue reading... Sphere: Related Content

Wednesday, September 24, 2008

Stock Prediction using Decision Tree

This is the first post in a series on using Decision Tree for Stock Prediction. Here are the second, third, fourth and fifth posts.

I have started applying data mining to finance for a few months now. I will thus give you an insight about my main project regarding stock market prediction. While starting in my company, I have seen several projects (so-called "screener", i.e. based on technical indicators to build stock picking rules, but no use of data mining). Most of them make two assumptions:

The rules based on technical indicators don't evolve in time
Stocks are selected (and sometimes processed) differently according to the sector they belong to (e.g. health and care, industry, etc.)

Since I don't feel good with these two assumptions, I have started a new project based on the following idea:

Each technical indicator may work for a particular stock and at a certain moment in time
This means that i) rules based on indicators should evolve in time and ii) each stock should be processed independently. Note that the second point doesn't mean that there are no correlation between a particular stock and the sector it belongs to. It only means that stocks may behave differently and thus should be treated independently. However, any information from their sector could be used in the forecasting process.

When seen as a balck box, the system has information about a specific stock (such as open, high, low, close, volume, etc.) as input and a class value as output. The class is fixed this way:

   1 if close[j+n] > (x% * close[j]) + close[j]
-1 otherwise

where n is the difference between the current day and the day predicted and x is a value chosen to take transaction fees into account (note that a fixed value could also be chosen instead of a percentage). The class predictions are thus made for each stock independently. One year daily data is used for training and the following month for testing. A shifting window process is made so that the system adapts itself to the current market.

Here are the different steps of the overall methodology that makes use of decision tree for stock prediction:

1. Stock filtering
2. Data preprocessing
3. Classification tree
4. Risk management

In the following posts, I will explain in details each of these steps.

Continue reading... Sphere: Related Content

Monday, May 14, 2007

SVM, neural network and decision tree

After reading a post concerning the PAKDD 2007 competition on Abbott's Analytics, I was curious about the trends of some data mining methods. I decided to play with Google Trends using three common methods: Support Vector Machine (SVM), Artificial Neural Network (ANN) and Decision Tree (DT). The following picture shows the trends in search on Google for the three terms "svm", "neural network" and "decision tree" since 2004:

Red = "neural network", blue = "svm", orange = "decision tree"
The main observation is that SVM and ANN seem to be less trendy these last years. It is interesting to see that DT are constant over the years. These are the first conclusions we could draw from this picture. However, it is always dangerous to conclude on some numbers. In the above case, several factors have to be taken into account when making such conclusions:

The way of writing the searched terms. For example, SVM could be found under "support vector machine", "support vector", "svm", etc. However, it seems that "svm" is most often used. The same remark for neural networks is also valid.
The diversity of search engines. Although the most popular, Google is not the only search engine on the web. A lot of people may use other engines such as Yahoo!, Live Search or All the Web. Only searches on Google are considered in this picture.
The difference between "searching" and "using". In other words, people may search for some methods but finally decide to use another one. Therefore, the fact that a keyword is often searched on Google does not mean that the corresponding method is used.

Consequently, even if these kind of plots look nice, interpreting the information they give and in which context it is valid is not an easy task.

Continue reading... Sphere: Related Content