[LibRec] 1 导入项目并运行测试类
1 在LibRec执行算法共用两种方式:在程序中指定配置项;读取配置文件
1.1 在程序中指定配置项
1.1.1 官方实例代码:
public class MainTest {
public static void main(String[] args) throws Exception {
// build data model
Configuration conf = new Configuration();
conf.set("dfs.data.dir", "D:\\syt\\librec\\data");
// 加载训练数据
TextDataModel dataModel = new TextDataModel(conf);
// 构造数据模型
dataModel.buildDataModel();
// build recommender context
RecommenderContext context = new RecommenderContext(conf, dataModel);
// build similarity
conf.set("rec.recommender.similarity.key" ,"item");
conf.setBoolean("rec.recommender.isranking", true);
conf.setInt("rec.similarity.shrinkage", 10);
RecommenderSimilarity similarity = new CosineSimilarity();
similarity.buildSimilarityMatrix(dataModel);
context.setSimilarity(similarity);
// build recommender
conf.set("rec.neighbors.knn.number", "200");
Recommender recommender = new ItemKNNRecommender();
recommender.setContext(context);
// run recommender algorithm
recommender.train(context);
// evaluate the recommended result
EvalContext evalContext = new EvalContext(conf, recommender, dataModel.getTestDataSet(), context.getSimilarity().getSimilarityMatrix(), context.getSimilarities());
RecommenderEvaluator ndcgEvaluator = new NormalizedDCGEvaluator();
ndcgEvaluator.setTopN(10);
double ndcgValue = ndcgEvaluator.evaluate(evalContext);
System.out.println("ndcg:" + ndcgValue);
}
}
1.1.2 运行测试类MainTest报错
<1>

显示是加载数据文件报错,发现报错提示的路径是属性dfs.data.dir加上【movielens/ml-100k/ratings.txt】,后面的路径应该是工程自动加上的;
全局搜索发现,后面的路径对应是配置文件 librec-default.properties 中属性 【data.input.path】的值;
查看加载模型代码:
TextDataModel dataModel = new TextDataModel(conf);
TextDataModel类中有buildConvert方法,

数据文件的路径=【dfs.data.dir】+【data.input.path】
那么什么时候调用buildConvert呢??
在调用 buildDataModel 方法时,会去调用 buildConvert 方法。
【解决方案:】
不能直接注释掉【data.input.path】,否则读取不到该属性会报空指针错误。
只要保证【dfs.data.dir】+【data.input.path】是你放数据的路径就可以。
【结果:】只要将数据路径问题解决,该测试类就可以正常运行了。
1.1.3 测试类debug分析
<1> 构造数据模型方法 【 dataModel.buildDataModel() 】
@Override
public void buildDataModel() throws LibrecException {
context = new DataContext(conf);
if (!conf.getBoolean("data.convert.read.ready")) {
buildConvert();
LOG.info("Transform data to Convertor successfully!");
conf.setBoolean("data.convert.read.ready", true);
}
buildSplitter();
if (StringUtils.isNotBlank(conf.get("data.appender.class")) && !conf.getBoolean("data.appender.read.ready")) {
buildFeature();
LOG.info("Transform data to Feature successfully!");
conf.setBoolean("data.appender.read.ready", true);
}
LOG.info("Split data to train Set and test Set successfully!");
if (trainDataSet != null && trainDataSet.size() > 0 && testDataSet != null && testDataSet.size() > 0) {
LOG.info("Data cardinality of training is " + trainDataSet.size());
LOG.info("Data cardinality of testing is " + testDataSet.size());
}
}
(1)buildConvert:读取数据集
该方法共有三个实现方法,目前有TextDataModel(注释中标明支持csv格式,实际txt也支持)和ArffDataModel(当数据列大于4时需要使用arff格式存储),另外还有JDBCDataModel(该实现类还未实现,字面上看应该是支持从数据库读取数据),分析TextDataModel中的buildConvert方法:

【过程:】
a. 先获取数据文件路径 【dfs.data.dir】(数据集文件夹路径)+【data.input.path】(数据集名称);
b. 获取dataColumnFormat,这是指定数据文件的数据格式的
UIR:user-item-rating 即 最常见的 用户-物品-评分 矩阵
UIRT:user-item-rating-datetime 即 用户-物品-评分-时间 矩阵
c. 构建Convertor,去读取数据。TextDataConvertor中的 readData方法:
private void readData(String... inputDataPath) throws IOException {
LOG.info(String.format("Dataset: %s", Arrays.toString(inputDataPath)));
matrix = new DataFrame();
if (Objects.isNull(header)) {
if (dataColumnFormat.toLowerCase().equals("uirt")) {
header = new String[]{"user", "item", "rating", "datetime"};
attr = new String[]{"STRING", "STRING", "NUMERIC", "DATE"};
} else {
header = new String[]{"user", "item", "rating"};
attr = new String[]{"STRING", "STRING", "NUMERIC"};
}
}
matrix.setAttrType(attr);
matrix.setHeader(header);
List<File> files = new ArrayList<>();
SimpleFileVisitor<Path> finder = new SimpleFileVisitor<Path>() {
@Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
files.add(file.toFile());
return super.visitFile(file, attrs);
}
};
for (String path : inputDataPath) {
Files.walkFileTree(Paths.get(path.trim()), finder);
}
int numFiles = files.size();
int cur = 0;
Pattern pattern = Pattern.compile(sep);
for (File file : files) {
try (Source fileSource = Okio.source(file);
BufferedSource bufferedSource = Okio.buffer(fileSource)) {
String temp;
while ((temp = bufferedSource.readUtf8Line()) != null) {
if ("".equals(temp.trim())) {
break;
}
String[] eachRow = pattern.split(temp);
for (int i = 0; i < header.length; i++) {
if (Objects.equals(attr[i], "STRING")) {
DataFrame.setId(eachRow[i], matrix.getHeader(i));
}
}
matrix.add(eachRow);
}
LOG.info(String.format("DataSet: %s is finished", StringUtil.last(file.toString(), 38)));
cur++;
fileRate = cur / numFiles;
} catch (IOException e) {
e.printStackTrace();
}
}
List<Double> ratingScale = matrix.getRatingScale();
if (ratingScale != null) {
LOG.info(String.format("rating Scale: %s", ratingScale.toString()));
}
LOG.info(String.format("user number: %d,\t item number is: %d", matrix.numUsers(), matrix.numItems()));
}
其中getRatingScale()是获取user-item-rating中不同的rating,并按照大小进行升序排序。
BiMap是guava中,它提供了key和value的双向关联的数据结构,要求不仅仅是key,value也得是唯一的。
UserMappingData是什么意思??
getUserMappingData()方法,返回BiMap<String,Integer>
标识某一UserId出现在数据集(去重后的)的第几个。
ItemMappingData是什么意思??
getItemMappingData()方法,返回BipMap<String,Integer>
标识某一ItemId出现在数据集(去重后的)的第几个。
(2)buildSplitter:数据划分方式(划分训练集和测试集)
三个DataModel实现类目前都是使用父类AbstractDataModel的buildSplitter方法
protected void buildSplitter() throws LibrecException {
String splitter = conf.get("data.model.splitter");
try {
if (dataSplitter == null) {
dataSplitter = (DataSplitter) ReflectionUtil.newInstance(DriverClassUtil.getClass(splitter), conf);
}
if (dataSplitter != null) {
dataSplitter.setDataConvertor(dataConvertor);
dataSplitter.splitData();
trainDataSet = dataSplitter.getTrainData();
testDataSet = dataSplitter.getTestData();
}
} catch (ClassNotFoundException e) {
throw new LibrecException(e);
}
}
运用反射工具类,根据不同的属性值,创建不同的5种Splitter类。
【data.model.splitter】属性的属性值有5种: 详细可以看项目工程doc/wiki下的DataModel.md
a. ratio
按照比例来划分数据集
data.model.splitter=ratio
data.splitter.ratio=rating # by rating
data.splitter.trainset.ratio=0.8 # resting data used as test set
b. loocv
随机留出一位或者留出最后一位user或者item来作为测试数据, 剩下的数据作为训练数据
data.model.splitter=loocv
data.splitter.loocv=user
c. givenn
保留n个user或item作为测试数据, 剩下的数据作为训练数据
data.model.splitter=givenn
data.splitter.givenn=user
data.splitter.givenn.n=10
d. kcv
K折交叉验证. 即将数据集划分为K份, 每次选择其中一份作为测试数据集, 余下数据作为验证数据集, 共进行K次.每次会进行一次评估, 在K折结束之后会再次对K折计算结果进行一次评估.
data.model.splitter=kcv
data.splitter.cv.number=5 # K-fold
e. testset
预留出部分数据集作为测试数据集. 这里需要设置配置项data.testset.path也就是指定预留出测试数据的路径。与2.0版本不同,训练数据集与测试数据集需要分别指定。同一配置项内若包含多个路径,可以采用冒号进行分割。具体可参考constantGuess推荐算法的配置文件。
data.model.splitter=testset
data.input.path=filmtrust/rating/ratings_0.txt:filmtrust/rating/ratings_1.txt
data.testset.path=filmtrust/rating/ratings_2.txt:filmtrust/rating/ratings_3.txt
(3)buildFeature():部分算法引入user-user-relation或者item-item-relation关系矩阵
protected void buildFeature() throws LibrecException {
String feature = conf.get("data.appender.class");
if (StringUtils.isNotBlank(feature)) {
try {
dataAppender = (DataAppender) ReflectionUtil.newInstance(DriverClassUtil.getClass(feature), conf);
dataAppender.setUserMappingData(getUserMappingData());
dataAppender.setItemMappingData(getItemMappingData());
dataAppender.processData();
} catch (ClassNotFoundException e) {
throw new LibrecException(e);
} catch (IOException e) {
throw new LibrecException(e);
}
}
}
【data.appender.class】属性值有4种:
social、document、auxiliary、location
data.appender.class=social
data.appender.path=directory/to/relationData
<2> 创建recommender上下文,计算相似性矩阵
相似性矩阵构建数据集中user-user或者item-item之间的距离
// build recommender context
RecommenderContext context = new RecommenderContext(conf, dataModel);
// build similarity
conf.set("rec.recommender.similarity.key" ,"item");
conf.setBoolean("rec.recommender.isranking", true);
conf.setInt("rec.similarity.shrinkage", 10);
RecommenderSimilarity similarity = new CosineSimilarity();
similarity.buildSimilarityMatrix(dataModel);
context.setSimilarity(similarity);
LibRec已经实现的距离度量算法:

【rec.recommender.similarity.key】:值有'social','user','item'
分别对应基于社交、用户和单品的相似度计算。
【rec.recommender.isranking】:是否排序,这不是在相似度矩阵这边使用,而是recommender调用train方法使用
【rec.similarity.shrinkage】:???相似度缩减
<3> 创建Recommender实现类,并调用train方法(以ItemKNNRecommender为例)
(1)train方法解析:
a、AbstractRecommender的train方法

b、调用setup方法
AbstractRecommender的setup方法

AbstractRecommender实现类MatrixRecommender的setup方法
protected void setup() throws LibrecException{
super.setup();
trainMatrix = (SequentialAccessSparseMatrix) getDataModel().getTrainDataSet();
testMatrix = (SequentialAccessSparseMatrix) getDataModel().getTestDataSet();
validMatrix = (SequentialAccessSparseMatrix) getDataModel().getValidDataSet();
numUsers = trainMatrix.rowSize();
numItems = trainMatrix.columnSize();
numRates = trainMatrix.size();
Set<Double> ratingSet = new HashSet<>();
for (MatrixEntry matrixEntry : trainMatrix) {
ratingSet.add(matrixEntry.get());
}
ratingScale = new ArrayList<>(ratingSet);
Collections.sort(ratingScale);
maxRate = Collections.max(ratingScale);
minRate = Collections.min(ratingScale);
if (minRate == maxRate) {
minRate = 0;
}
globalMean = trainMatrix.mean();
int[] numDroppedItemsArray = new int[numUsers]; // for AUCEvaluator
int maxNumTestItemsByUser = 0; //for idcg
for (int userIdx = 0; userIdx < numUsers; ++userIdx) {
numDroppedItemsArray[userIdx] = numItems - trainMatrix.row(userIdx).getNumEntries();
int numTestItemsByUser = testMatrix.row(userIdx).getNumEntries();
maxNumTestItemsByUser = maxNumTestItemsByUser < numTestItemsByUser ? numTestItemsByUser : maxNumTestItemsByUser;
}
int[] itemPurchasedCount = new int[numItems]; // for NoveltyEvaluator
for (int itemIdx = 0; itemIdx < numItems; ++itemIdx) {
itemPurchasedCount[itemIdx] = trainMatrix.column(itemIdx).getNumEntries()
+ testMatrix.column(itemIdx).getNumEntries();
}
conf.setInts("rec.eval.auc.dropped.num", numDroppedItemsArray);
conf.setInt("rec.eval.key.test.max.num", maxNumTestItemsByUser); //for nDCGEvaluator
conf.setInt("rec.eval.item.num", testMatrix.columnSize()); // for EntropyEvaluator
conf.setInts("rec.eval.item.purchase.num", itemPurchasedCount); // for NoveltyEvaluator
}
MatrixRecommender的实现类ItemKNNRecommender的setup方法

所以,AbstractRecommender中setup方法,实际上是调用的子类ItemKNNRecommender的setup方法,又有super.setup()存在,所以实际上父类的setup方法,也都被调用了。
c、调用trainModel方法
AbstractRecommender中trainModel方法是抽象方法,这里是直接调用子类ItemKNNRecommender中的trainModel()方法
d、调用cleanup方法
AbstractRecommender中cleanup方法是空的,并且子类没有重写
<4> 评估推荐结果
// evaluate the recommended result
EvalContext evalContext = new EvalContext(conf, recommender, dataModel.getTestDataSet(), context.getSimilarity().getSimilarityMatrix(), context.getSimilarities());
RecommenderEvaluator ndcgEvaluator = new NormalizedDCGEvaluator();
ndcgEvaluator.setTopN(10);
double ndcgValue = ndcgEvaluator.evaluate(evalContext);
System.out.println("ndcg:" + ndcgValue);