org.apache.mahout.cf.taste.impl.model.file
Class FileDataModel

java.lang.Object
  extended by org.apache.mahout.cf.taste.impl.model.AbstractDataModel
      extended by org.apache.mahout.cf.taste.impl.model.file.FileDataModel
All Implemented Interfaces:
Serializable, Refreshable, DataModel

public class FileDataModel
extends AbstractDataModel

A DataModel backed by a delimited file. This class expects a file where each line contains a user ID, followed by item ID, followed by optional preference value, followed by optional timestamp. Commas or tabs delimit fields:

userID,itemID[,preference[,timestamp]]

Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference).

The preference value is assumed to be parseable as a double. The user IDs and item IDs are read parsed as longs. The timestamp, if present, is assumed to be parseable as a long, though this can be overridden via readTimestampFromString(String). The preference value may be empty, to indicate "no preference value", but cannot be empty. That is, this is legal:

123,456,,129050099059

But this isn't:

123,456,129050099059

It is also acceptable for the lines to contain additional fields. Fields beyond the third will be ignored. An empty line, or one that begins with '#' will be ignored as a comment.

This class will reload data from the data file when refresh(Collection) is called, unless the file has been reloaded very recently already.

This class will also look for update "delta" files in the same directory, with file names that start the same way (up to the first period). These files have the same format, and provide updated data that supersedes what is in the main data file. This is a mechanism that allows an application to push updates to FileDataModel without re-copying the entire data file.

One small format difference exists. Update files must also be able to express deletes. This is done by ending with a blank preference value, as in "123,456,".

Note that it's all-or-nothing -- all of the items in the file must express no preference, or the all must. These cannot be mixed. Put another way there will always be the same number of delimiters on every line of the file!

This class is not intended for use with very large amounts of data (over, say, tens of millions of rows). For that, a JDBC-backed DataModel and a database are more appropriate.

It is possible and likely useful to subclass this class and customize its behavior to accommodate application-specific needs and input formats. See processLine(String, FastByIDMap, FastByIDMap, boolean) and processLineWithoutID(String, FastByIDMap, FastByIDMap)

See Also:
Serialized Form

Field Summary
static long DEFAULT_MIN_RELOAD_INTERVAL_MS
           
 
Constructor Summary
FileDataModel(File dataFile)
           
FileDataModel(File dataFile, boolean transpose, long minReloadIntervalMS)
           
FileDataModel(File dataFile, boolean transpose, long minReloadIntervalMS, String delimiterRegex)
           
FileDataModel(File dataFile, String delimiterRegex)
           
 
Method Summary
protected  DataModel buildModel()
           
static char determineDelimiter(String line)
           
 File getDataFile()
           
 LongPrimitiveIterator getItemIDs()
           
 FastIDSet getItemIDsFromUser(long userID)
           
 float getMaxPreference()
           
 float getMinPreference()
           
 int getNumItems()
           
 int getNumUsers()
           
 int getNumUsersWithPreferenceFor(long itemID)
           
 int getNumUsersWithPreferenceFor(long itemID1, long itemID2)
           
 PreferenceArray getPreferencesForItem(long itemID)
           
 PreferenceArray getPreferencesFromUser(long userID)
           
 Long getPreferenceTime(long userID, long itemID)
          Retrieves the time at which a preference value from a user and item was set, if known.
 Float getPreferenceValue(long userID, long itemID)
          Retrieves the preference value for a single user and item.
 LongPrimitiveIterator getUserIDs()
           
 boolean hasPreferenceValues()
           
protected  void processFile(FileLineIterator dataOrUpdateFileIterator, FastByIDMap<?> data, FastByIDMap<FastByIDMap<Long>> timestamps, boolean fromPriorData)
           
protected  void processFileWithoutID(FileLineIterator dataOrUpdateFileIterator, FastByIDMap<FastIDSet> data, FastByIDMap<FastByIDMap<Long>> timestamps)
           
protected  void processLine(String line, FastByIDMap<?> data, FastByIDMap<FastByIDMap<Long>> timestamps, boolean fromPriorData)
           Reads one line from the input file and adds the data to a FastByIDMap data structure which maps user IDs to preferences.
protected  void processLineWithoutID(String line, FastByIDMap<FastIDSet> data, FastByIDMap<FastByIDMap<Long>> timestamps)
           
protected  long readItemIDFromString(String value)
          Subclasses may wish to override this if ID values in the file are not numeric.
protected  long readTimestampFromString(String value)
          Subclasses may wish to override this to change how time values in the input file are parsed.
protected  long readUserIDFromString(String value)
          Subclasses may wish to override this if ID values in the file are not numeric.
 void refresh(Collection<Refreshable> alreadyRefreshed)
           Triggers "refresh" -- whatever that means -- of the implementation.
protected  void reload()
           
 void removePreference(long userID, long itemID)
          See the warning at setPreference(long, long, float).
 void setPreference(long userID, long itemID, float value)
          Note that this method only updates the in-memory preference data that this FileDataModel maintains; it does not modify any data on disk.
 String toString()
           
 
Methods inherited from class org.apache.mahout.cf.taste.impl.model.AbstractDataModel
setMaxPreference, setMinPreference
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_MIN_RELOAD_INTERVAL_MS

public static final long DEFAULT_MIN_RELOAD_INTERVAL_MS
See Also:
Constant Field Values
Constructor Detail

FileDataModel

public FileDataModel(File dataFile)
              throws IOException
Parameters:
dataFile - file containing preferences data. If file is compressed (and name ends in .gz or .zip accordingly) it will be decompressed as it is read)
Throws:
FileNotFoundException - if dataFile does not exist
IOException - if file can't be read

FileDataModel

public FileDataModel(File dataFile,
                     String delimiterRegex)
              throws IOException
Parameters:
delimiterRegex - If your data file don't use '\t' or ',' as delimiter, you can specify a custom regex pattern.
Throws:
IOException

FileDataModel

public FileDataModel(File dataFile,
                     boolean transpose,
                     long minReloadIntervalMS)
              throws IOException
Parameters:
transpose - transposes user IDs and item IDs -- convenient for 'flipping' the data model this way
minReloadIntervalMS - the minimum interval in milliseconds after which a full reload of the original datafile is done when refresh() is called
Throws:
IOException
See Also:
FileDataModel(File)

FileDataModel

public FileDataModel(File dataFile,
                     boolean transpose,
                     long minReloadIntervalMS,
                     String delimiterRegex)
              throws IOException
Parameters:
delimiterRegex - If your data file don't use '\t' or ',' as delimiters, you can specify user own using regex pattern.
Throws:
IOException
Method Detail

getDataFile

public File getDataFile()

reload

protected void reload()

buildModel

protected DataModel buildModel()
                        throws IOException
Throws:
IOException

determineDelimiter

public static char determineDelimiter(String line)

processFile

protected void processFile(FileLineIterator dataOrUpdateFileIterator,
                           FastByIDMap<?> data,
                           FastByIDMap<FastByIDMap<Long>> timestamps,
                           boolean fromPriorData)

processLine

protected void processLine(String line,
                           FastByIDMap<?> data,
                           FastByIDMap<FastByIDMap<Long>> timestamps,
                           boolean fromPriorData)

Reads one line from the input file and adds the data to a FastByIDMap data structure which maps user IDs to preferences. This assumes that each line of the input file corresponds to one preference. After reading a line and determining which user and item the preference pertains to, the method should look to see if the data contains a mapping for the user ID already, and if not, add an empty data structure of preferences as appropriate to the data.

Note that if the line is empty or begins with '#' it will be ignored as a comment.

Parameters:
line - line from input data file
data - all data read so far, as a mapping from user IDs to preferences
fromPriorData - an implementation detail -- if true, data will map IDs to PreferenceArray since the framework is attempting to read and update raw data that is already in memory. Otherwise it maps to Collections of Preferences, since it's reading fresh data. Subclasses must be prepared to handle this wrinkle.

processFileWithoutID

protected void processFileWithoutID(FileLineIterator dataOrUpdateFileIterator,
                                    FastByIDMap<FastIDSet> data,
                                    FastByIDMap<FastByIDMap<Long>> timestamps)

processLineWithoutID

protected void processLineWithoutID(String line,
                                    FastByIDMap<FastIDSet> data,
                                    FastByIDMap<FastByIDMap<Long>> timestamps)

readUserIDFromString

protected long readUserIDFromString(String value)
Subclasses may wish to override this if ID values in the file are not numeric. This provides a hook by which subclasses can inject an IDMigrator to perform translation.


readItemIDFromString

protected long readItemIDFromString(String value)
Subclasses may wish to override this if ID values in the file are not numeric. This provides a hook by which subclasses can inject an IDMigrator to perform translation.


readTimestampFromString

protected long readTimestampFromString(String value)
Subclasses may wish to override this to change how time values in the input file are parsed. By default they are expected to be numeric, expressing a time as milliseconds since the epoch.


getUserIDs

public LongPrimitiveIterator getUserIDs()
                                 throws TasteException
Returns:
all user IDs in the model, in order
Throws:
TasteException - if an error occurs while accessing the data

getPreferencesFromUser

public PreferenceArray getPreferencesFromUser(long userID)
                                       throws TasteException
Parameters:
userID - ID of user to get prefs for
Returns:
user's preferences, ordered by item ID
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getItemIDsFromUser

public FastIDSet getItemIDsFromUser(long userID)
                             throws TasteException
Parameters:
userID - ID of user to get prefs for
Returns:
IDs of items user expresses a preference for
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getItemIDs

public LongPrimitiveIterator getItemIDs()
                                 throws TasteException
Returns:
a LongPrimitiveIterator of all item IDs in the model, in order
Throws:
TasteException - if an error occurs while accessing the data

getPreferencesForItem

public PreferenceArray getPreferencesForItem(long itemID)
                                      throws TasteException
Parameters:
itemID - item ID
Returns:
all existing Preferences expressed for that item, ordered by user ID, as an array
Throws:
NoSuchItemException - if the item does not exist
TasteException - if an error occurs while accessing the data

getPreferenceValue

public Float getPreferenceValue(long userID,
                                long itemID)
                         throws TasteException
Description copied from interface: DataModel
Retrieves the preference value for a single user and item.

Parameters:
userID - user ID to get pref value from
itemID - item ID to get pref value for
Returns:
preference value from the given user for the given item or null if none exists
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getPreferenceTime

public Long getPreferenceTime(long userID,
                              long itemID)
                       throws TasteException
Description copied from interface: DataModel
Retrieves the time at which a preference value from a user and item was set, if known. Time is expressed in the usual way, as a number of milliseconds since the epoch.

Parameters:
userID - user ID for preference in question
itemID - item ID for preference in question
Returns:
time at which preference was set or null if no preference exists or its time is not known
Throws:
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

getNumItems

public int getNumItems()
                throws TasteException
Returns:
total number of items known to the model. This is generally the union of all items preferred by at least one user but could include more.
Throws:
TasteException - if an error occurs while accessing the data

getNumUsers

public int getNumUsers()
                throws TasteException
Returns:
total number of users known to the model.
Throws:
TasteException - if an error occurs while accessing the data

getNumUsersWithPreferenceFor

public int getNumUsersWithPreferenceFor(long itemID)
                                 throws TasteException
Parameters:
itemID - item ID to check for
Returns:
the number of users who have expressed a preference for the item
Throws:
TasteException - if an error occurs while accessing the data

getNumUsersWithPreferenceFor

public int getNumUsersWithPreferenceFor(long itemID1,
                                        long itemID2)
                                 throws TasteException
Parameters:
itemID1 - first item ID to check for
itemID2 - second item ID to check for
Returns:
the number of users who have expressed a preference for the items
Throws:
TasteException - if an error occurs while accessing the data

setPreference

public void setPreference(long userID,
                          long itemID,
                          float value)
                   throws TasteException
Note that this method only updates the in-memory preference data that this FileDataModel maintains; it does not modify any data on disk. Therefore any updates from this method are only temporary, and lost when data is reloaded from a file. This method should also be considered relatively slow.

Parameters:
userID - user to set preference for
itemID - item to set preference for
value - preference value
Throws:
NoSuchItemException - if the item does not exist
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

removePreference

public void removePreference(long userID,
                             long itemID)
                      throws TasteException
See the warning at setPreference(long, long, float).

Parameters:
userID - user from which to remove preference
itemID - item to remove preference for
Throws:
NoSuchItemException - if the item does not exist
NoSuchUserException - if the user does not exist
TasteException - if an error occurs while accessing the data

refresh

public void refresh(Collection<Refreshable> alreadyRefreshed)
Description copied from interface: Refreshable

Triggers "refresh" -- whatever that means -- of the implementation. The general contract is that any Refreshable should always leave itself in a consistent, operational state, and that the refresh atomically updates internal state from old to new.

Parameters:
alreadyRefreshed - Refreshables that are known to have already been refreshed as a result of an initial call to a Refreshable.refresh(Collection) method on some object. This ensure that objects in a refresh dependency graph aren't refreshed twice needlessly.

hasPreferenceValues

public boolean hasPreferenceValues()
Returns:
true if this implementation actually stores and returns distinct preference values; that is, if it is not a 'boolean' DataModel

getMaxPreference

public float getMaxPreference()
Specified by:
getMaxPreference in interface DataModel
Overrides:
getMaxPreference in class AbstractDataModel
Returns:
the maximum preference value that is possible in the current problem domain being evaluated. For example, if the domain is movie ratings on a scale of 1 to 5, this should be 5. While a Recommender may estimate a preference value above 5.0, it isn't "fair" to consider that the system is actually suggesting an impossible rating of, say, 5.4 stars. In practice the application would cap this estimate to 5.0. Since evaluators evaluate the difference between estimated and actual value, this at least prevents this effect from unfairly penalizing a Recommender

getMinPreference

public float getMinPreference()
Specified by:
getMinPreference in interface DataModel
Overrides:
getMinPreference in class AbstractDataModel
See Also:
DataModel.getMaxPreference()

toString

public String toString()
Overrides:
toString in class Object


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.