Data Mining Research

Showing posts with label data mining stock market. Show all posts

Tuesday, January 06, 2009

Data Mining Blog: Neural Market Trends

If you're interested in data mining and the financial market, this blog is definitely for you. In Neural Market Trends, Thomas Ott writes about how he applies data mining in the financial market. His blog emphasis on the use of the RapidMiner tool. He describes himself as a RapidMiner evangelist. Here is an excerpt from his blog description:

Welcome! I use Rapidminer to create models of the financial markets. I share tutorials and videos on how to use Rapidminer to build your own. What to learn more? Just click on my tutorial page for more information. Thanks for stopping by!

Feel free to visit this blog if you're interested in financial applications of data mining. The big list of data mining blogs has been updated as well as the OPML version.

Continue reading... Sphere: Related Content

Friday, December 12, 2008

Stock Picking using Data Mining: Parameter Tuning

It is known that in data mining projects, one can spend 80% of the time for data preprocessing and the remaining 20% for the data mining task itself. However, when data mining is integrated in an overall system (such as a stock picking system), an important task is to tune the parameters of the overall system.

For example, in the above mentioned system, there are several parameters to fix in order to obtain satisfying results. Here is a list of these parameters:

Number of stocks to analyze (depends on the computational resources)
Number of stocks to select as the best ones (fixed number or with a threshold on the validation accuracy and the minimum number of trades)
Short or long term prediction (predict increase/decrease of given stocks in X days)
Confusion matrix for the classifier (how to penalize the errors of the classifier)
Size of the shifting window (i.e. size of the training/validation set)

These parameters will vary according to each project. For example, you can have a look at the parameters mentioned in a post by Themos Kalafatis. Feel free to comment and give examples of parameters that you have to tune.

Continue reading... Sphere: Related Content

Monday, November 10, 2008

Stock Prediction using Decision Tree: Risk Management

This post is the last one of a series on using decision tree for stock prediction. Here are the first, second, third and fourth posts. This final post is dedicated to the risk management part of the system.

At this step, we have as input a list of transactions containing the ticker name (e.g. "MSFT US Equity" for Microsoft, etc.), the date of transaction and the quantity. As the system is "long only", the quantity is always positive (+1 means buy one stock).

The risk management is informally defined here as a way of managing a portfolio by closing it in certain specific situations. In this system, a simple take-profit/stop-loss at the portfolio level is used. Every day, a number of transactions are carried out. All transactions in the same day are gathered in one portfolio. At any time, the portfolio return is calculated to check for both the take-profit and stop-loss limits. If the portfolio return reaches the take-profit (e.g. +5%) or the stop-loss (-20%), it is closed.

In addition, each portfolio is subject to a maximum number of days of conservation. When this number of days is reached, the portfolio is closed. To find the best values for the take-profit, stop-loss and number of days of conservation, a three-dimensional grid search is performed. Note that there is nothing to prevent the system from overfitting at this step of the methodology (unlike the decision tree step that uses cross-validation). This could of course be improved.

Continue reading... Sphere: Related Content

Thursday, November 06, 2008

Artificial Intelligence Applied to Stock Picking

The Herald Tribune has an interesting article written by Charles Duhigg about using artificial intelligence in the stock market. Among others, the author has interviewed Ray Kurzweil, a hedge fund manager involved in artificial intelligence.

In this article, as it is often the case when applying AI in finance, the two buzz words are "neural networks" and "genetic algorithms". Nothing about decision trees and support vector machines, for example. Maybe SVM is too recent to be yet applied in finance. But what about decision tree? Not trendy enough?

If you read the article, you will notice that, the author highlight one important drawback of such techniques: "black box-ness". This is true, but another drawback is certainly more important: overfitting. Technical analysts may manually overfit their data when predicting future trends. However, it is much more "easy" to overfit the data when using data mining techniques.

Link to the Herald Tribune article

Continue reading... Sphere: Related Content

Monday, October 27, 2008

Stock Prediction using Decision Tree: Classification Tree

This is the fourth post in a series on using Decision Tree for Stock Prediction. For more information, feel free to read post 1, post 2 and post 3 of the series.

Once the data have been preprocessed, we obtain a matrix in which each row is a different day (since we work with daily data) and each column is one of the possible variable (close, volume, technical indicators, combination of some indicators, etc.). The reason why I started with decision tree instead of more "trendy" neural networks or support vector machines is because I prefer to begin with simple methods and then, if necessary, change to a more complex one.

One big advantage with decision tree is that one can understand the model by seeing it (i.e. by looking at the tree). It is very appreciable to understand why, at a given day, MSFT (ticker name for Microsoft) has been predicted to increase or decrease. However, this readability is only applicable as a pre-study in the project. Indeed, since the project is based on making one prediction a day (during all the backtesting period) for each selected stock, there are too many different models for a Human being to understand them.

Thus, the high number of models is due to the following processes which have to be done:

For each year to backtest For each open day in the year For each stock that has been selected For each hyper-parameter value of the tree For each fold of the cross-validation Build a decision and evaluate it

If we consider that building a decision tree takes 1 second, then, for a backtest on 100 stocks from 2001 to 2008, we need:

8 * 252 * 100 * (10*10) * 10 = 201'600'000 seconds

This means more than 6 years of computation on a 4 CPU computer. At this stage, there are mainly two possibilities:

Grid computing
Computing the trees each month instead of each day

By applying these two ideas, it is possible to bring the processing time to around 3 hours of calculation (with a 6 x 4 CPU grid of computers). The next post of the series will discuss the risk management of the system.

Continue reading... Sphere: Related Content

Monday, October 13, 2008

Decision Tree for Stock Prediction: Data Preprocessing

This post is part of a series on Decision Tree for Stock Prediction. For more details feel free to read part 1 and part 2 of the series.

Once the stock have been filtered, a list of stocks for every months of the shifting window system is available. Then, two steps need to be undertaken: calculation of technical indicators and standardization of data. First, data is separated in training and testing sets. One year is taken as training data and one month as test data. For example:

Training set: August, 31st 2004 -> August, 31st 2005 Test set: September, 1st 2005 -> September, 1st 2005

In fact, the date of August, 31st 2005 is not exact. Since we predict for n days in advance, we need to remove these n days from the training set.

Thus, technical indicators can be calculated. For most of them, we need data that are older than August, 31st 2004. An example is the calculation of a simple moving average (SMA) on 20 days. Below is a non-exhaustive list of basic and technical indicators that are used in the system:

Close price
Volume
Simple Moving Average (SMA)
Relative Strength Index (RSI)
Momentum
Rate Of Change (ROC)
On Balance Volume (OBV)
Etc.

In addition to these indicators, combinations of them are used. We thus obtain a matrix for both training and test data, where each row is a day in the year and each column is one of the possible indicators. For the training data matrix, an additional column representing the output class (-1 or +1) is added.

Once this is done, data are standardized to obtain a zero mean and unit standard deviation. This will then allow decision tree to correctly choose parameters that are not in the same unit (e.g. close values with volumes). The next post will discuss about the main part of the system: the classification tree process.

Continue reading... Sphere: Related Content