一种从大量数据中提取知识的过程,它涉及到统计学、机器学习和人工智能等多个领域。
它通常使用计算机程序来分析数据,发现潜在的关系或规则,并产生有用的信息。
包括:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)
Apriori算法:用于频繁项集挖掘和关联规则挖掘。可以使用apyori库来实现。
FP-growth算法:与Apriori算法类似,但是更快且更有效。可以使用pyfpgrowth库来实现。
使用Apriori算法实现关联规则挖掘
from apyori import apriori
transactions = [['apple', 'banana'], ['banana', 'orange'], ['apple', 'banana', 'orange'], ['banana', 'orange']]
rules = apriori(transactions, min_support=0.5, min_confidence=0.7)
# 找到了频繁项集,支持度为0.5。
# 如果两个项集的置信度大于0.7,则将它们作为规则输出。
for rule in rules:
print(rule)
使用FP-Growth算法实现关联规则挖掘
pyfpgrowth
库:!pip install pyfpgrowth
import pandas as pd
# 读取数据
df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/online-retail/online_retail.csv',
header=0, parse_dates=[4], encoding='unicode_escape')
# 提取订单号和商品列表,并将商品列表转换为列表形式
basket = (df[df['Country'] == "United Kingdom"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket_sets = basket.applymap(lambda x: 1 if x > 0 else 0).astype(int)
import pyfpgrowth
# 设置最小支持度和最小置信度
min_support = 0.02
min_confidence = 0.7
# 找到频繁项集
patterns = pyfpgrowth.find_frequent_patterns(basket_sets, int(len(basket_sets) * min_support))
# 生成关联规则
rules = pyfpgrowth.generate_association_rules(patterns, min_confidence)
for rule, (support, confidence) in sorted(rules.items(), key=lambda x: x[1][1], reverse=True):
# 输出关联规则及其置信度和支持度
antecedent = [item for item in rule]
consequent = [item for item in basket.columns if item not in antecedent]
print(f"{antecedent} => {consequent} (support={support}, confidence={confidence})")
['JUMBO BAG RED RETROSPOT'] => ['JUMBO STORAGE BAG SUKI'] (support=219, confidence=0.7777777777777778)
['REGENCY CAKESTAND 3 TIER'] => ['WHITE HANGING HEART T-LIGHT HOLDER'] (support=233, confidence=0.7209302325581395)
['JUMBO BAG PINK POLKADOT'] => ['JUMBO STORAGE BAG SUKI'] (support=242, confidence=0.7560975609756098)
['LUNCH BAG BLACK SKULL.'] => ['LUNCH BAG RED RETROSPOT'] (support=139, confidence=0.765625)
['PARTY BUNTING'] => ['JUMBO BAG RED RETROSPOT'] (support=184, confidence=0.7551020408163266)
['JUMBO STORAGE BAG SUKI'] => ['JUMBO BAG PINK POLKADOT'] (support=242, confidence=0.7023255813953488)
['LUNCH BAG PINK POLKADOT'] => ['LUNCH BAG RED RETROSPOT'] (support=180, confidence=0.7843137254901961)
['LUNCH BAG CARS BLUE'] => ['LUNCH BAG RED RETROSPOT'] (support=177, confidence=0.8240740740740741)
['LUNCH BAG SPACEBOY DESIGN'] => ['LUNCH BAG RED RETROSPOT'] (support=161, confidence=0.7385321100917431)
['WOODEN PICTURE FRAME WHITE FINISH'] => ['WOODEN FRAME ANTIQUE WHITE '] (support=144, confidence=0.7972027972027972)
['JUMBO BAG RED RETROSPOT'] => ['JUMBO BAG PINK POLKADOT'] (support=219, confidence=0.7777777777777778)
['LUNCH BAG WOODLAND'] => ['LUNCH BAG RED RETROSPOT'] (support=158, confidence=0.7939698492462312)
['LUNCH BAG RED SPOTTY'] => ['LUNCH BAG RED RETROSPOT'] (support=228, confidence=0.9421487603305785)
['LUNCH BAG SUKI DESIGN '] => ['LUNCH BAG RED RETROSPOT'] (support=176, confidence=0.839905352113281)
['SET OF 3 CAKE TINS PANTRY DESIGN '] => ['SET OF 3 RETROSPOT CAKE TINS'] (support=134, confidence=0.8564102564102564)
['JUMBO STORAGE BAG SKULLS'] => ['JUMBO BAG RED RETROSPOT'] (support=125, confidence=0.7515151515151515)
['JUMBO BAG APPLES'] => ['JUMBO BAG RED RETROSPOT'] (support=174, confidence=0.8536585365853658)
['LUNCH BAG APPLE DESIGN'] => ['LUNCH BAG RED RETROSPOT'] (support=168, confidence=0.7428571428571429)
['RECYCLING BAG RETROSPOT'] => ['JUMBO BAG RED RETROSPOT'] (support=155, confidence=0.825531914893617)
['PANTRY ROLLING PIN'] => ['SET OF 3 RETROSPOT CAKE TINS'] (support=136, confidence=0.7431693989071039)
['PLASTERS IN TIN SPACEBOY'] => ['PLASTERS IN TIN WOODLAND ANIMALS'] (support=119, confidence=0.8415492957746478)
import weka.associations.Apriori;
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;
public class AssociationRuleMining {
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new FileReader("transactions.arff"));
Instances data = new Instances(reader);
reader.close();
Apriori model = new Apriori();
String[] options = {"-R", "0.5"};
model.setOptions(options);
model.buildAssociations(data);
System.out.println(model);
}
}
sklearn
lib 实现:from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 训练决策树分类器
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))
sklearn
lib实现:from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 训练朴素贝叶斯分类器
clf = GaussianNB()
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))
sklearn
lib实现:from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
# 划分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 训练支持向量机分类器
clf = SVC()
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
# 输出分类器性能
print("Accuracy: %f" % clf.score(X_test, y_test))
回归分析
是一种用于预测数值型数据的统计方法,可以对数据集中的变量之间关系进行研究。
包括:
线性回归是一种最基础的回归分析方法,用于分析自变量和因变量之间的线性关系。
使用scikit-learn
库中的LinearRegression
模型来实现线性回归。
首先,准备数据集来预测学生的成绩:
import numpy as np
# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 1))
y = 2*X[:, 0] + np.random.normal(0, 0.5, size=100)
接下来,使用LinearRegression模型进行建模,并对模型进行拟合。
from sklearn.linear_model import LinearRegression
# 建模
model = LinearRegression()
# 拟合
model.fit(X, y)
模型建立完成后,即可使用模型进行预测,并计算模型的拟合优度(R-square)。
# 预测
y_pred = model.predict(X)
# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)
多项式回归是一种基于线性回归的改进方法,它通过添加高阶项(例如二次项、三次项等)来拟合数据的非线性关系。
使用scikit-learn
库中的PolynomialFeatures
类来生成高阶项特征,
然后使用LinearRegression
模型进行拟合。
首先,准备数据集:
import numpy as np
import matplotlib.pyplot as plt
# 生成数据集
np.random.seed(0)
X = np.linspace(-1,1,100)
y = np.sin(3*np.pi*X) + np.random.normal(0, 0.1, size=100)
接下来,使用PolynomialFeatures类生成二次项特征,并使用LinearRegression模型进行拟合。
from sklearn.preprocessing import PolynomialFeatures
# 生成二次项特征
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X.reshape(-1,1))
# 建模
model = LinearRegression()
# 拟合
model.fit(X_poly, y)
模型建立完成后,即可使用模型进行预测,并绘制模型的拟合曲线。
# 预测
X_test = np.linspace(-1,1,100)
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
y_pred = model.predict(X_test_poly)
# 绘制拟合曲线
plt.scatter(X, y, color='b')
plt.plot(X_test, y_pred, color='r')
plt.show()
岭回归是一种基于线性回归的正则化方法,
它通过添加l2正则项来缩小特征系数,防止因为特征过多而导致过拟合。
使用scikit-learn
库中的Ridge
模型来实现岭回归。
首先,准备数据集 :
import numpy as np
# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 10))
y = 2*X[:, 0] + 3*X[:, 1] + np.random.normal(0, 0.5, size=100)
接下来,使用Ridge模型进行建模,并对模型进行拟合。
from sklearn.linear_model import Ridge
# 建模
model = Ridge(alpha=1)
# 拟合
model.fit(X, y)
模型建立完成后,即可使用模型进行预测,并计算模型的拟合优度(R-square)。
# 预测
y_pred = model.predict(X)
# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)
lasso回归是一种基于线性回归的正则化方法,
它通过添加l1正则项来缩小特征系数,同时也可以达到特征选择的效果。
使用scikit-learn
库中的Lasso
模型来实现lasso回归。
首先, 准备数据集 :
import numpy as np
# 生成数据集
np.random.seed(0)
X = np.random.normal(0, 1, size=(100, 10))
y = 2*X[:, 0] + 3*X[:, 1] + np.random.normal(0, 0.5, size=100)
接下来,使用Lasso模型进行建模,并对模型进行拟合。
from sklearn.linear_model import Lasso
# 建模
model = Lasso(alpha=0.1)
# 拟合
model.fit(X, y)
模型建立完成后,即可使用模型进行预测,并计算模型的拟合优度(R-square)。
# 预测
y_pred = model.predict(X)
# 计算R-square
r_squared = model.score(X, y)
print('R-square is:', r_squared)
在实际应用中,可以根据数据的特点来选择适合的回归方法,并对模型进行适当的调参,以提高模型的预测准确性。