Enterprise Search support for Apache Lucene and Solr by Lucid Imagination

Secondary links

  • Contact Us
  • Log in
  • Downloads
  • Solutions
    • Software |
    • Services |
    • Training |
    • White Papers & Case Studies |
    • Webinars & Events |
  • Developers
    • Blog |
    • Tech Articles |
    • Community |
    • Documentation |
    • Downloads |
    • Webcasts & Podcasts |
  • About
    • Market Overview |
    • Management |
    • Company News |
    • In the Media |
    • Contact |

beta

Start new search

Back to search results

  1. FromDate
  2. Grant Ingersoll2009-06-16 17:35
  3. Shashikant Kore2009-06-16 23:43
  4. Ted Dunning2009-06-17 02:51
  5. Grant Ingersoll2009-06-17 09:14
  6. Grant Ingersoll2009-06-17 09:32
  7. Shashikant Kore2009-06-18 06:17
  8. Grant Ingersoll2009-07-14 09:41
  9. Ted Dunning2009-07-27 21:42
  10. Benson Margulies2009-07-27 21:51
  11. Ted Dunning2009-07-28 00:48
  12. Grant Ingersoll2009-07-28 06:55
  13. Benson Margulies2009-07-28 14:49
  14. Ted Dunning2009-07-28 16:36
  15. Grant Ingersoll2009-08-18 09:55
  16. Grant Ingersoll2009-08-18 10:32
  17. Benjamin Dageroth2009-08-18 11:37
  18. Ted Dunning2009-08-18 13:04
  19. Grant Ingersoll2009-08-18 13:49
  20. Jack Tanner2009-08-18 17:40
  21. Grant Ingersoll2010-01-09 12:18
  22. Grant Ingersoll2010-01-09 13:57
  23. Ted Dunning2010-01-09 15:31
  24. Ted Dunning2010-01-09 15:32

[mahout-user] Validating clustering output

Subject:
Re: Validating clustering output
From:
Benson Margulies <bimargulies@...>
Date:
2009-07-27 21:51
Brown and DiPietro's algorithm for clustering based on entropy is
somewhat infamous for the difficulty of achieving usable performance.
Mike Collins was responsible for a famously speedy version. Having
build one that is just barely fast enough in C++, I wouldn't recommend
trying it in Java. Of course, you aren't proposing that, just
recommending the bigram entropy metric or something like it.

On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<ted.dunning@gmail.com> wrote:
(vastly delayed response ... huge distractions competing with more than 2 minutes answers are to blame) Grant, For evaluating clustering for symbol sequences: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275 Most of the other references I have found talk about quality relative to gold standard judgments about whether exemplars are in the same class or relative to similarity/distinctiveness ratios.  Neither is all that satisfactory. My preference is an entropic measure that describes how much of the information in your data is captured by the clustering vs how much residual info there is. The other reference I am looking for may be in David Mackay's book.  The idea is that you measure the quality of the approximation by looking at the entropy in the cluster assignment relative to the residual required to precisely specify the original data relative to the quantized value. This is also related to trading off signal/noise in a vector quantizer. David,  do you have a moment to talk about this with me?  I can't free up the time to chase these final references and come up with a nice formula for this.  I think you could do it in 10-20 minutes. On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <gsingers@apache.org>wrote:
On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:  A principled approach to cluster evaluation is to measure how well the
cluster membership captures the structure of unseen data.  A natural measure for this is to measure how much of the entropy of the data is captured by cluster membership.  For k-means and its natural L_2 metric, the natural cluster quality metric is the squared distance from the nearest centroid adjusted by the log_2 of the number of clusters.  This can be compared to the squared magnitude of the original data or the squared deviation from the centroid for all of the data.  The idea is that you are changing the representation of the data by allocating some of the bits in your original representation to represent which cluster each point is in.  If those bits aren't made up by the residue being small then your clustering is making a bad trade-off. In the past, I have used other more heuristic measures as well.  One of the key characteristics that I would like to see out of a clustering is a degree of stability.  Thus, I look at the fractions of points that are assigned to each cluster or the distribution of distances from the cluster centroid. These values should be relatively stable when applied to held-out data. For text, you can actually compute perplexity which measures how well cluster membership predicts what words are used.  This is nice because you don't have to worry about the entropy of real valued numbers.
Do you have any references on any of the above approaches?
-- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)

Solr Powered

Give us your feedback

  • Lucene
  • Solr
  • Nutch
  • Tika
  • Mahout
  • Droids
  • PyLucene
  • Lucene.Net
  • Lucy
  • Lucene4c
  • Open Relevance Project
  • How We Can Help:
    • Getting Started |
    • Support Subscriptions |
    • White Papers |
    • Training |
    • Consulting |
    • Contact Us |
  • Developers:
    • Blog |
    • Documentation |
    • Tech Articles |
    • Podcasts and Videos |
    • Community |
  • Downloads:
    • LucidWorks for Solr |
    • LucidWorks for Lucene |
    • LucidGaze for Solr |
    • LucidGaze for Lucene |
  • Products:
  • Services:

Contact | Privacy Policy | Legal Terms of Use | Copyrights and Disclaimers | Admin

Apache Solr, Apache Lucene, ApacheCon and their logos are trademarks of the Apache Software Foundation.

© 2010 Lucid Imagination. All Right reserved.