Data Mining Research

Showing posts with label data mining in finance. Show all posts

Wednesday, December 17, 2008

Data Mining Research Interview: Thomas A. Rathburn

It is my pleasure to welcome on Data Mining Research, Thomas A. "Tony" Rathburn, a senior consultant at The Modeling Agency. I have recently read two of his articles about data mining. One of them, my favorite, is Data Mining the Financial Markets. He kindly accepted to answer four questions from Data Mining Research.

Data Mining Research: Who are you and how did you enter the data mining field?

Thomas Rathburn: I did my PhD work in Management Information Systems in the 1980's and taught Computer Science and Statistics at Kent State for seven years. Most of my research was related to Artificial Intelligence. At the time, I also was doing some preliminary applied consulting work in the area... primarily with banks and insurance. In the early 90's I took a position as Director of Training & Consulting with NeuralWare, in Pittsburgh, where I expanded into Finance and Marketing. I left that position a couple of years later to trade the 30-year Treasury Bond, and the corresponding futures and options contracts, on the Chicago Board of Trade with Lakeshore Trading. That was followed by a time with Hull Trading doing similar work with the SP500. I've been engaged in a general consulting practice modeling business applications of human behavior since. I currently teach for The Modeling Agency, Unica Software, SPSS, Group 1 software and The Data Warehousing Institute (TDWI), as well as doing direct consulting work for a number of clients and subcontract consulting for Capgemini and AT Kearney.

DMR: If you could give only one advice to someone starting a new data mining project, what would it be?

TR: Understand the differences between data mining and traditional statistical analysis. Stats is primarily concerned with measures of central tendency and conducts it's modeling from that point of departure. While data mining uses similar techniques, the conceptualization of the problem is different. In data mining, I am concerned with sub-groups that display a behavior of interest at a rate different from the mean. In developing models that consistently and reliably identify these sub-groups, I am able to adapt my resource allocation strategies in a way that enhances performance.

First, and foremost, make sure your project definition and performance metrics are clearly and completely stated at the inception of the project. Absolutely everything that follows should be done to enhance performance as stated in your project definition.

Understand the differences between human behavior modeling and physical systems modeling.

Understand Low-Risk/High ROI project design and incremental development.

Understand that you don't have to know every thing to enhance performance.

Understand that performance enhancement comes from enhanced project conceptualization and efficient utilization of data. Advanced mathematical techniques have minimal impact if you get those two things right.

DMR: Can you give examples of common pitfalls you encountered during your data mining projects?

TR: The single biggest issue is not understanding what data mining is... it's analysis... to enhance performance... your specific metrics of success. It is not weird math to develop a magical solution.

The second issue is not appropriately completing project definition.

The third is not understanding how to extract the required information content from your data.

The fourth is learning algorithms and analytics rather than focusing on the reality of what you are trying to actually achieve and the goals in measuring that reality.

DMR: What is "The Modeling Agency" and what do you provide to your clients?

TR: The Modeling Agency is a group of senior level consultants, coordinated by Eric King, to provide training and consulting services to clients on technology projects. The best description is available through our website, or with direct conversations with Eric. His contact information is available on the website.

DMR: Thanks a lot for your answers.

For more information, you can visit The Modeling Agency website.

Continue reading... Sphere: Related Content

Monday, October 27, 2008

Stock Prediction using Decision Tree: Classification Tree

This is the fourth post in a series on using Decision Tree for Stock Prediction. For more information, feel free to read post 1, post 2 and post 3 of the series.

Once the data have been preprocessed, we obtain a matrix in which each row is a different day (since we work with daily data) and each column is one of the possible variable (close, volume, technical indicators, combination of some indicators, etc.). The reason why I started with decision tree instead of more "trendy" neural networks or support vector machines is because I prefer to begin with simple methods and then, if necessary, change to a more complex one.

One big advantage with decision tree is that one can understand the model by seeing it (i.e. by looking at the tree). It is very appreciable to understand why, at a given day, MSFT (ticker name for Microsoft) has been predicted to increase or decrease. However, this readability is only applicable as a pre-study in the project. Indeed, since the project is based on making one prediction a day (during all the backtesting period) for each selected stock, there are too many different models for a Human being to understand them.

Thus, the high number of models is due to the following processes which have to be done:

For each year to backtest For each open day in the year For each stock that has been selected For each hyper-parameter value of the tree For each fold of the cross-validation Build a decision and evaluate it

If we consider that building a decision tree takes 1 second, then, for a backtest on 100 stocks from 2001 to 2008, we need:

8 * 252 * 100 * (10*10) * 10 = 201'600'000 seconds

This means more than 6 years of computation on a 4 CPU computer. At this stage, there are mainly two possibilities:

Grid computing
Computing the trees each month instead of each day

By applying these two ideas, it is possible to bring the processing time to around 3 hours of calculation (with a 6 x 4 CPU grid of computers). The next post of the series will discuss the risk management of the system.

Continue reading... Sphere: Related Content

Monday, October 13, 2008

Decision Tree for Stock Prediction: Data Preprocessing

This post is part of a series on Decision Tree for Stock Prediction. For more details feel free to read part 1 and part 2 of the series.

Once the stock have been filtered, a list of stocks for every months of the shifting window system is available. Then, two steps need to be undertaken: calculation of technical indicators and standardization of data. First, data is separated in training and testing sets. One year is taken as training data and one month as test data. For example:

Training set: August, 31st 2004 -> August, 31st 2005 Test set: September, 1st 2005 -> September, 1st 2005

In fact, the date of August, 31st 2005 is not exact. Since we predict for n days in advance, we need to remove these n days from the training set.

Thus, technical indicators can be calculated. For most of them, we need data that are older than August, 31st 2004. An example is the calculation of a simple moving average (SMA) on 20 days. Below is a non-exhaustive list of basic and technical indicators that are used in the system:

Close price
Volume
Simple Moving Average (SMA)
Relative Strength Index (RSI)
Momentum
Rate Of Change (ROC)
On Balance Volume (OBV)
Etc.

In addition to these indicators, combinations of them are used. We thus obtain a matrix for both training and test data, where each row is a day in the year and each column is one of the possible indicators. For the training data matrix, an additional column representing the output class (-1 or +1) is added.

Once this is done, data are standardized to obtain a zero mean and unit standard deviation. This will then allow decision tree to correctly choose parameters that are not in the same unit (e.g. close values with volumes). The next post will discuss about the main part of the system: the classification tree process.

Continue reading... Sphere: Related Content

Wednesday, August 27, 2008

Data Mining on the NIFTY

I'm just back from a business trip in India (Delhi). I went there to meet MarkeTopper, a company that uses data mining for stock market predictions. My first impression was the seriousness of the company. They have an excellent internal structure and their employees are very qualified. Unlike me, they're not making direct predictions on stock market increase or decrease in the future. Indeed, they use data mining algorithms to build their strategies and portfolios.

I can't enter into details about the algorithms they use and how they use it for obvious confidentiality purposes. The meeting was very interesting for several reasons. One of them was the way they approched the problem of using data mining algorithms in the stock market. It is completely different from my personnal approach. I always thought of using these algorithms to predict a value (close price) or a class (increase/decrease) representing price evolution five, ten or twenty days ahead and then applying this process in the past to backtest the system. Their approach is to use the same kind of algorithms to tune a strategy and then backtest it to see its results in the past. Very interesting!

MarkeTopper website

Continue reading... Sphere: Related Content

Sunday, May 25, 2008

Influencing predictions

As you may know, I have started applying data mining in a small financial company in Switzerland. I have thus read some books about technical analysis. The aim of technical analysis is to use technical indicators (that are based on daily stock prices) to predict the future trends of stocks in the short or mid term. I was surprised to read in a book that the author compares stock market predictions with weather forecasting.

I personally think that there is a big difference between these two tasks (at least conceptually). It concerns the influence of your predictions. With the stock market, people are using technical analysis to buy and sell and thus influencing what they are predicting. Millions of traders and quantitative analysts in the world are using the same set of tools and acting consequently. This is not the case with weather forecasting. Even if millions of people can predict exactly the weather for the next day, it won't change anything to the weather of tomorrow. This is an interesting difference between these two tasks.

Of course, stock market is not only dependent on traders who believe in technical analysis. Most of the influence certainly comes from the offer/demand couple as well as every day news. Also, several people around the world do not base their strategies on technical analysis but rather on fundamentals (information coming directly from companies). What do you think of this issue? Are technical analysts really influencing their own predictions? Feel free to post your opinion.