org.apache.mahout.math.stats
Class LogLikelihood

java.lang.Object
  extended by org.apache.mahout.math.stats.LogLikelihood

public final class LogLikelihood
extends Object

Utility methods for working with log-likelihood


Nested Class Summary
static class LogLikelihood.ScoredItem<T>
           
 
Method Summary
static
<T> List<LogLikelihood.ScoredItem<T>>
compareFrequencies(com.google.common.collect.Multiset<T> a, com.google.common.collect.Multiset<T> b, int maxReturn, double threshold)
          Compares two sets of counts to see which items are interestingly over-represented in the first set.
static double entropy(long... elements)
          Calculates the unnormalized Shannon entropy.
static double logLikelihoodRatio(long k11, long k12, long k21, long k22)
          Calculates the Raw Log-likelihood ratio for two events, call them A and B.
static double rootLogLikelihoodRatio(long k11, long k12, long k21, long k22)
          Calculates the root log-likelihood ratio for two events.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

entropy

public static double entropy(long... elements)
Calculates the unnormalized Shannon entropy. This is -sum x_i log x_i / N = -N sum x_i/N log x_i/N where N = sum x_i If the x's sum to 1, then this is the same as the normal expression. Leaving this un-normalized makes working with counts and computing the LLR easier.

Returns:
The entropy value for the elements

logLikelihoodRatio

public static double logLikelihoodRatio(long k11,
                                        long k12,
                                        long k21,
                                        long k22)
Calculates the Raw Log-likelihood ratio for two events, call them A and B. Then we have:

 Event AEverything but A
Event BA and B together (k_11)B, but not A (k_12)
Everything but BA without B (k_21)Neither A nor B (k_22)

Parameters:
k11 - The number of times the two events occurred together
k12 - The number of times the second event occurred WITHOUT the first event
k21 - The number of times the first event occurred WITHOUT the second event
k22 - The number of times something else occurred (i.e. was neither of these events
Returns:
The raw log-likelihood ratio

Credit to http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for the table and the descriptions.


rootLogLikelihoodRatio

public static double rootLogLikelihoodRatio(long k11,
                                            long k12,
                                            long k21,
                                            long k22)
Calculates the root log-likelihood ratio for two events. See logLikelihoodRatio(long, long, long, long).

Parameters:
k11 - The number of times the two events occurred together
k12 - The number of times the second event occurred WITHOUT the first event
k21 - The number of times the first event occurred WITHOUT the second event
k22 - The number of times something else occurred (i.e. was neither of these events
Returns:
The root log-likelihood ratio

There is some more discussion here: http://s.apache.org/CGL And see the response to Wataru's comment here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html


compareFrequencies

public static <T> List<LogLikelihood.ScoredItem<T>> compareFrequencies(com.google.common.collect.Multiset<T> a,
                                                                       com.google.common.collect.Multiset<T> b,
                                                                       int maxReturn,
                                                                       double threshold)
Compares two sets of counts to see which items are interestingly over-represented in the first set.

Parameters:
a - The first counts.
b - The reference counts.
maxReturn - The maximum number of items to return. Use maxReturn >= a.elementSet.size() to return all scores above the threshold.
threshold - The minimum score for items to be returned. Use 0 to return all items more common in a than b. Use -Double.MAX_VALUE (not Double.MIN_VALUE !) to not use a threshold.
Returns:
A list of scored items with their scores.


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.