Thursday, 12. May 2005

12.5.05 (II)

I did an in-depth studying of the paper 'Relationship-Based Clustering and Visualization for High-Dimension Data Mining'. Detailed notes will follow as soon as i'm finished with the paper.

12.5.05 (I)

Some first findings after playing around with gCluto and a dataset of 1000 twoday stories.

what i did
* i exported the latest 1000 stories from the MySQL twoday.net database to a file named 'twoday.txt'; each row containing the story ID and the content of that story;
* i used the programm doc2mat to build a data matrix containing the term frequencies in a format, that can be imported by gCluto. Since doc2mat can just parse ASCII i had to recode 'twoday.txt' from 'UTF-8' to 'ISO-8859-1' with the UNIX command 'recode UTF-8 twoday.txt', and had to substitue each of the Umlauts (sed -i -e 's/รค/ae/g' twoday.txt);
doc2mat -nostop -minwlen=2 -nlskip=1 -tokfile twoday.txt twoday.mat. I did not use the stop word list, since imo such terms should be filtered out as unimportant by the clustering algorithm itself. doc2mat uses Porter's term stemming algorithm (which is designed for english documents)
* i installed gCluto on my laptop, and imported the twoday.mat file together with its according row- and column-labels.
* i did a first clustering in gCluto with the default settings, that is a 'Repeated Bisection', with the Cosine as a distance measure.

what i found out
* Cluster sizes range from 48 to 323, which is acceptable.
* One of the cluster seems to be one with english documents;
* The most descriptive, and also most descriminating terms are all terms, that are very very common, i.e. words that are supposed to be in a stop word list.
* The most descriptive terms are nearly identical with the most descriminating terms. Not sthg that i expected.
* The matrix visualization does not show any significant 'bands', rather some singular red 'spots'. It remains unclear from the visualization, what really distinguishes the found clusters. (see the screenshots below)
* The mountain visualization looks nice, but at the moment it remains unclear of how that plot is generated, and how it is to be interpretated.

what i still don't know
* A lot more playng around with the clustering options needs to be done, after i studied the manuel of gCluto.
* ...

some screenshots i made

Software du Jour

CLUTO
CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO is well-suited for clustering data sets arising in many diverse application areas including information retrieval, customer purchasing transactions, web, GIS, science, and biology.
Status: not yet evaluated
http://www-users.cs.umn.edu/~karypis/cluto/index.html

gCLUTO
a graphical Cluster Toolkit based upon CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/gcluto/index.html (manual)
Status: first experiments with demo dataset and twoday dataset (also see here)
Note: quite impressive little program to generate matrix visualizations and mountain views of clusterings.

METIS
METIS is a family of programs for partitioning unstructured graphs and hypergraphs and computing fill-reducing orderings of sparse matrices.
Status: not yet evaluated
http://www-users.cs.umn.edu/~karypis/metis/index.html

cviz
A visualization tool designed for analyzing high-dimensional data in large, complex data sets.
Status: evaluated
Note: CViz is a deprecated, un-impressive, non-interactive cluster tool with a restrictive user's license.see screenshot
http://www.alphaworks.ibm.com/tech/cviz

Search

 

currently reading



William N. Venables, Brian D. Ripley
Modern Applied Statistics with S

Recent Updates

John
Amoxicillin And Clavulanate 250mg With No Prescription...
Smithe526 (guest) - 13. May, 21:03
Hi, I am doing a project...
Hi, I am doing a project for my school using this doc2mat...
Sangeetha (guest) - 2. Mar, 10:35
mountain vizualization...
By the way, here they explain how the mountain visualization...
Tatiana (guest) - 10. Mar, 02:12
hi, I wonder how did...
hi, I wonder how did you make scrin shorts of the mountin...
Tatiana (guest) - 10. Mar, 02:10
SOM + genes
Interpreting patterns of gene expression with self-organizing...
michi - 4. Sep, 23:03

data analysis
diary
linkdump
literature
software
Profil
Logout
Subscribe Weblog