利用MLPack插件在DuckDB中机器学习
先安装mlpack插件
D load httpfs;
D INSTALL mlpack FROM community;
100% ▕██████████████████████████████████████▏ (00:00:06.43 elapsed)
鸢尾花数据集(Iris Dataset)是机器学习中最经典的入门数据集之一。
鸢尾花数据集包含了三种鸢尾花(Setosa、Versicolor、Virginica)每种花的 4 个特征:花萼长度、花萼宽度、花瓣长度和花瓣宽度。
接下来我们的任务是基于这些特征来预测鸢尾花的种类。
示例脚本中有一处错误,mlpack_adaboost_train函数误写作mlpack_adaboost,已改正。
load httpfs;
load mlpack;
.timer on-- Perform adaBoost (using weak learner 'Perceptron' by default)
-- Read 'features' into 'X', 'labels' into 'Y', use optional parameters
-- from 'Z', and prepare model storage in 'M'
CREATE TABLE X AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris.csv");
CREATE TABLE Y AS SELECT * FROM read_csv("https://eddelbuettel.github.io/duckdb-mlpack/data/iris_labels.csv");
CREATE TABLE Z (name VARCHAR, value VARCHAR);
INSERT INTO Z VALUES ('iterations', '50'), ('tolerance', '1e-7');
CREATE TABLE M (key VARCHAR, json VARCHAR);-- Train model for 'Y' on 'X' using parameters 'Z', store in 'M'
CREATE TEMP TABLE A AS SELECT * FROM mlpack_adaboost_train("X", "Y", "Z", "M");-- Count by predicted group
SELECT COUNT(*) as n, predicted FROM A GROUP BY predicted;-- Model 'M' can be used to predict
CREATE TABLE N (x1 DOUBLE, x2 DOUBLE, x3 DOUBLE, x4 DOUBLE);
-- inserting approximate column mean values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199);
-- inserting approximate column mean values, min values, max values
INSERT INTO N VALUES (5.843, 3.054, 3.759, 1.199), (4.3, 2.0, 1.0, 0.1), (7.9, 4.4, 6.9, 2.5);
-- and this predict one element each
SELECT * FROM mlpack_adaboost_pred("N", "M");
执行结果如下:
root@66d4e20ec1d7:/par# ./duckdb141 mlpack
DuckDB v1.4.1 (Andium) b390a7c376
Enter ".help" for usage hints.
D .read ml.txt
Run Time (s): real 1.646 user 0.012000 sys 0.004000
Run Time (s): real 2.675 user 0.008000 sys 0.004000
Run Time (s): real 0.042 user 0.000000 sys 0.000000
Run Time (s): real 0.042 user 0.000000 sys 0.000000
Run Time (s): real 0.041 user 0.000000 sys 0.000000
Misclassified: 1
Run Time (s): real 0.118 user 0.192000 sys 0.000000
┌───────┬───────────┐
│ n │ predicted │
│ int64 │ int32 │
├───────┼───────────┤
│ 50 │ 0 │
│ 49 │ 1 │
│ 51 │ 2 │
└───────┴───────────┘
Run Time (s): real 0.001 user 0.000000 sys 0.000000
Run Time (s): real 0.040 user 0.000000 sys 0.000000
Run Time (s): real 0.042 user 0.004000 sys 0.000000
Run Time (s): real 0.041 user 0.000000 sys 0.000000
┌───────────┐
│ predicted │
│ int32 │
├───────────┤
│ 1 │
│ 1 │
│ 0 │
│ 2 │
└───────────┘
Run Time (s): real 0.003 user 0.004000 sys 0.000000
查看表中数据
D from x;
┌─────────┬─────────┬─────────┬─────────┐
│ column0 │ column1 │ column2 │ column3 │
│ double │ double │ double │ double │
├─────────┼─────────┼─────────┼─────────┤
│ 5.1 │ 3.5 │ 1.4 │ 0.2 │
│ · │ · │ · │ · │
│ 5.9 │ 3.0 │ 5.1 │ 1.8 │
├─────────┴─────────┴─────────┴─────────┤
│ 150 rows (40 shown) 4 columns │
└───────────────────────────────────────┘
Run Time (s): real 0.146 user 0.016000 sys 0.000000
D from y;
┌────────────┐
│ column0 │
│ int64 │
├────────────┤
│ 0 │
│ · │
│ 2 │
├────────────┤
│ 150 rows │
│ (40 shown) │
└────────────┘
Run Time (s): real 0.001 user 0.000000 sys 0.000000
D from z;
┌────────────┬─────────┐
│ name │ value │
│ varchar │ varchar │
├────────────┼─────────┤
│ iterations │ 50 │
│ tolerance │ 1e-7 │
└────────────┴─────────┘
Run Time (s): real 0.001 user 0.000000 sys 0.000000
D from m;
┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ key │ json │
│ varchar │ varchar │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ model │ {\n "x": {\n "cereal_class_version": 1,\n "numClasses": 3,\n "tolerance": 1e-7,\n "maxIterations": 50,\n "alpha": [\n 1.68364… │
└─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.001 user 0.000000 sys 0.000000
D from n;
┌────────┬────────┬────────┬────────┐
│ x1 │ x2 │ x3 │ x4 │
│ double │ double │ double │ double │
├────────┼────────┼────────┼────────┤
│ 5.843 │ 3.054 │ 3.759 │ 1.199 │
│ 5.843 │ 3.054 │ 3.759 │ 1.199 │
│ 4.3 │ 2.0 │ 1.0 │ 0.1 │
│ 7.9 │ 4.4 │ 6.9 │ 2.5 │
└────────┴────────┴────────┴────────┘
Run Time (s): real 0.001 user 0.004000 sys 0.000000
D from a;
┌────────────┐
│ predicted │
│ int32 │
├────────────┤
│ 0 │
│ · │
│ 2 │
├────────────┤
│ 150 rows │
│ (40 shown) │
└────────────┘
Run Time (s): real 0.001 user 0.000000 sys 0.000000
因为数据集很小,才150行,虽然迭代50次,训练模型和预测都非常快,模型的精度也还可以,只有1个分类错误。
