diary

Monday, 15. August 2005

...

The cluster- and visualisation algorithm for EigenCluster can be used through a web-form.

Using the data on Irish polls resulted in the following image:
bertin-after
Using the first 100 entries of the twoday dataset I resulted in the following image:
twoday-after
I assume the poor result is caused by the large number of variables which have no entries at all (due to the fact that i just used the first 100 lines of that dataset). Trying to upload the whole dataset failed, probably cause it was too large to be processed by the server.

Tuesday, 31. May 2005

some random notes

two (random) ideas at the moment:
* kohonen map on a sphere (also see this related article)
* further investigate into Jacques Bertin's work regarding the reorderable matrix (also see this related article)

Thursday, 12. May 2005

12.5.05 (II)

I did an in-depth studying of the paper 'Relationship-Based Clustering and Visualization for High-Dimension Data Mining'. Detailed notes will follow as soon as i'm finished with the paper.

12.5.05 (I)

Some first findings after playing around with gCluto and a dataset of 1000 twoday stories.

what i did
* i exported the latest 1000 stories from the MySQL twoday.net database to a file named 'twoday.txt'; each row containing the story ID and the content of that story;
* i used the programm doc2mat to build a data matrix containing the term frequencies in a format, that can be imported by gCluto. Since doc2mat can just parse ASCII i had to recode 'twoday.txt' from 'UTF-8' to 'ISO-8859-1' with the UNIX command 'recode UTF-8 twoday.txt', and had to substitue each of the Umlauts (sed -i -e 's/รค/ae/g' twoday.txt);
doc2mat -nostop -minwlen=2 -nlskip=1 -tokfile twoday.txt twoday.mat. I did not use the stop word list, since imo such terms should be filtered out as unimportant by the clustering algorithm itself. doc2mat uses Porter's term stemming algorithm (which is designed for english documents)
* i installed gCluto on my laptop, and imported the twoday.mat file together with its according row- and column-labels.
* i did a first clustering in gCluto with the default settings, that is a 'Repeated Bisection', with the Cosine as a distance measure.

what i found out
* Cluster sizes range from 48 to 323, which is acceptable.
* One of the cluster seems to be one with english documents;
* The most descriptive, and also most descriminating terms are all terms, that are very very common, i.e. words that are supposed to be in a stop word list.
* The most descriptive terms are nearly identical with the most descriminating terms. Not sthg that i expected.
* The matrix visualization does not show any significant 'bands', rather some singular red 'spots'. It remains unclear from the visualization, what really distinguishes the found clusters. (see the screenshots below)
* The mountain visualization looks nice, but at the moment it remains unclear of how that plot is generated, and how it is to be interpretated.

what i still don't know
* A lot more playng around with the clustering options needs to be done, after i studied the manuel of gCluto.
* ...

some screenshots i made

Search

 

currently reading



William N. Venables, Brian D. Ripley
Modern Applied Statistics with S

Recent Updates

John
Amoxicillin And Clavulanate 250mg With No Prescription...
Smithe526 (guest) - 13. May, 21:03
Hi, I am doing a project...
Hi, I am doing a project for my school using this doc2mat...
Sangeetha (guest) - 2. Mar, 10:35
mountain vizualization...
By the way, here they explain how the mountain visualization...
Tatiana (guest) - 10. Mar, 02:12
hi, I wonder how did...
hi, I wonder how did you make scrin shorts of the mountin...
Tatiana (guest) - 10. Mar, 02:10
SOM + genes
Interpreting patterns of gene expression with self-organizing...
michi - 4. Sep, 23:03

data analysis
diary
linkdump
literature
software
Profil
Logout
Subscribe Weblog