scikit-surprise 智能推荐模块使用说明
目录
1、前言
2、算法
3、数据集
3.1 three built-in datasets are available:
3.2 Load a dataset from a pandas dataframe.
3.3 Load a dataset from a (custom) file.
3.4 Load a dataset where folds (for cross-validation) are predefined by some files.
4、predict
4.1 SVD & load_builtin("ml-100k")
4.2 KNNBasic&load_builtin("ml-100k")
4.3 BaselineOnly&custom dataset
5 精度评定
1、前言
Surprise,提供一系列内置的智能推荐算法算法和相应的练习数据集。
参考:The model_selection package — Surprise 1 documentation
安装:pip install scikit-surprise -i https://pypi.org/simple
2、算法
The available prediction algorithms are:
| random_pred.NormalPredictor | Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. | 
| baseline_only.BaselineOnly | Algorithm predicting the baseline estimate for given user and item. | 
| knns.KNNBasic | A basic collaborative filtering algorithm. | 
| knns.KNNWithMeans | A basic collaborative filtering algorithm, taking into account the mean ratings of each user. | 
| knns.KNNWithZScore | A basic collaborative filtering algorithm, taking into account the z-score normalization of each user. | 
| knns.KNNBaseline | A basic collaborative filtering algorithm taking into account a baseline rating. | 
| matrix_factorization.SVD | The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. | 
| matrix_factorization.SVDpp | The SVD++ algorithm, an extension of  | 
| matrix_factorization.NMF | A collaborative filtering algorithm based on Non-negative Matrix Factorization. | 
| slope_one.SlopeOne | A simple yet accurate collaborative filtering algorithm. | 
| co_clustering.CoClustering | A collaborative filtering algorithm based on co-clustering. | 
3、数据集
3.1 three built-in datasets are available:
-  The movielens-100k dataset. 
-  The movielens-1m dataset. 
-  The Jester dataset 2. 
Built-in datasets can all be loaded (or downloaded if you haven’t already) using the Dataset.load_builtin() method. Summary:
-  
  Dataset.load_builtin Load a built-in dataset. 
classmethod:
load_builtin(name='ml-100k', prompt=True)
eg:
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import train_test_split
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin("ml-100k")
# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=0.25) 
 
3.2 Load a dataset from a pandas dataframe.
you can use a custom dataset that is stored in a pandas dataframe.
classmethod:
load_from_df(df, reader)
eg:
import pandas as pd
from surprise import Dataset, NormalPredictor, Reader
from surprise.model_selection import cross_validate
# Creation of the dataframe. Column names are irrelevant.
ratings_dict = {
    "itemID": [1, 1, 1, 2, 2],
    "userID": [9, 32, 2, 45, "user_foo"],
    "rating": [3, 2, 4, 3, 1],
}
df = pd.DataFrame(ratings_dict)
# A reader is still needed but only the rating_scale param is required.
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[["userID", "itemID", "rating"]], reader)
 
3.3 Load a dataset from a (custom) file.
classmethod:
load_from_file(file_path, reader)[source]¶
Use this if you want to use a custom dataset and all of the ratings are stored in one file. You will have to split your dataset using the split method. 
Parameters:
-  file_path ( string) – The path to the file containing ratings.
-  reader (Reader) – A reader to read the file. 
eg:
import os
from surprise import BaselineOnly, Dataset, Reader
from surprise.model_selection import cross_validate
# path to dataset file
file_path = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/u.data")
# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format="user item rating timestamp", sep="\t")
data = Dataset.load_from_file(file_path, reader=reader) 
3.4 Load a dataset where folds (for cross-validation) are predefined by some files.
classmethod:
load_from_folds(folds_files, reader)
The purpose of this method is to cover a common use case where a dataset is already split into predefined folds, such as the movielens-100k dataset which defines files u1.base, u1.test, u2.base, u2.test, etc… It can also be used when you don’t want to perform cross-validation but still want to specify your training and testing data (which comes down to 1-fold cross-validation anyway).
Parameters:
-  folds_files ( iterableoftuples) – The list of the folds. A fold is a tuple of the form(path_to_train_file, path_to_test_file).
-  reader (Reader) – A reader to read the files. 
class surprise.dataset.DatasetAutoFolds(ratings_file=None, reader=None, df=None)
A derived class from Dataset for which folds (for cross-validation) are not predefined. (Or for when there are no folds at all).
build_full_trainset()
Do not split the dataset into folds and just return a trainset as is, built from the whole dataset.
User can then query for predictions.
4、predict
4.1 SVD & load_builtin("ml-100k")
from surprise import accuracy, Dataset, SVD
 from surprise.model_selection import train_test_split
 # Load the movielens-100k dataset (download it if needed),
 data = Dataset.load_builtin("ml-100k")
 # sample random trainset and testset
 # test set is made of 25% of the ratings.
 trainset, testset = train_test_split(data, test_size=0.25)
 # We'll use the famous SVD algorithm.
 algo = SVD()
 # Train the algorithm on the trainset, and predict ratings for the testset
 algo.fit(trainset)
 predictions = algo.test(testset)  #predict 参数为数据集
accuracy.rmse(predictions) #精度评定
algo.predict(uid,iid,u_r) # predict( a single sample)单个的样本
4.2 KNNBasic&load_builtin("ml-100k")
from surprise import Dataset, KNNBasic
 # Load the movielens-100k dataset
 data = Dataset.load_builtin("ml-100k")
 # Retrieve the trainset.
 trainset = data.build_full_trainset()
 # Build an algorithm, and train it.
 algo = KNNBasic()
 algo.fit(trainset)
#algo.test()
#algo.predict(uuid,iid)
4.3 BaselineOnly&custom dataset
import os
 from surprise import BaselineOnly, Dataset, Reader
 from surprise.model_selection import train_test_split
 # path to dataset file
 file_path = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/u.data")
 # As we're loading a custom dataset, we need to define a reader. In the
 # movielens-100k dataset, each line has the following format:
 # 'user item rating timestamp', separated by '\t' characters.
 reader = Reader(line_format="user item rating timestamp", sep="\t")
 data = Dataset.load_from_file(file_path, reader=reader)
trainset, testset = train_test_split(data, test_size=0.25)
algo=BaselineOnly()
predictions=algo.fit(trainset).test(testset)
#algo.predict(uid,iid)
5 精度评定
Available accuracy metrics:
| rmse | Compute RMSE (Root Mean Squared Error). | 
| mse | Compute MSE (Mean Squared Error). | 
| mae | Compute MAE (Mean Absolute Error). | 
| fcp | Compute FCP (Fraction of Concordant Pairs). | 
accuracy.rmse(predictions, verbose=True) #精度评定(rmse)
accuracy.mae(predictions,verbose=True)
accuracy.mse(predictions,verbose=True)
