ML.NET教程之客户细分(聚类问题)

# 前端 2024-05-05 15:23 0 53 来源：云博客

理解问题

客户细分需要解决的问题是按照客户之间的相似特征区分不同客户群体。这个问题的先决条件中没有可供使用的客户分类列表，只有客户的人物画像。

数据集

已有的数据是公司的历史商业活动记录以及客户的购买记录。
offer.csv：

Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak1,January,Malbec,72,56,France,FALSE2,January,Pinot Noir,72,17,France,FALSE3,February,Espumante,144,32,Oregon,TRUE4,February,Champagne,72,48,France,TRUE5,February,Cabernet Sauvignon,144,44,New Zealand,TRUE6,March,Prosecco,144,86,Chile,FALSE7,March,Prosecco,6,40,Australia,TRUE8,March,Espumante,6,45,South Africa,FALSE9,April,Chardonnay,144,57,Chile,FALSE10,April,Prosecco,72,52,California,FALSE11,May,Champagne,72,85,France,FALSE12,May,Prosecco,72,83,Australia,FALSE13,May,Merlot,6,43,Chile,FALSE14,June,Merlot,72,64,Chile,FALSE15,June,Cabernet Sauvignon,144,19,Italy,FALSE16,June,Merlot,72,88,California,FALSE17,July,Pinot Noir,12,47,Germany,FALSE18,July,Espumante,6,50,Oregon,FALSE19,July,Champagne,12,66,Germany,FALSE20,August,Cabernet Sauvignon,72,82,Italy,FALSE21,August,Champagne,12,50,California,FALSE22,August,Champagne,72,63,France,FALSE23,September,Chardonnay,144,39,South Africa,FALSE24,September,Pinot Noir,6,34,Italy,FALSE25,October,Cabernet Sauvignon,72,59,Oregon,TRUE26,October,Pinot Noir,144,83,Australia,FALSE27,October,Champagne,72,88,New Zealand,FALSE28,November,Cabernet Sauvignon,12,56,France,TRUE29,November,Pinot Grigio,6,87,France,FALSE30,December,Malbec,6,54,France,FALSE31,December,Champagne,72,89,France,FALSE32,December,Cabernet Sauvignon,72,45,Germany,TRUE

transaction.csv：

Customer Last Name,Offer #Smith,2Smith,24Johnson,17Johnson,24Johnson,26Williams,18Williams,22Williams,31Brown,7Brown,29Brown,30Jones,8Miller,6Miller,10Miller,14Miller,15Miller,22Miller,23Miller,31Davis,12Davis,22Davis,25Garcia,14Garcia,15Rodriguez,2Rodriguez,26Wilson,8Wilson,30Martinez,12Martinez,25Martinez,28Anderson,24Anderson,26Taylor,7Taylor,18Taylor,29Taylor,30Thomas,1Thomas,4Thomas,9Thomas,11Thomas,14Thomas,26Hernandez,28Hernandez,29Moore,17Moore,24Martin,2Martin,11Martin,28Jackson,1Jackson,2Jackson,11Jackson,15Jackson,22Thompson,9Thompson,16Thompson,25Thompson,30White,14White,22White,25White,30Lopez,9Lopez,11Lopez,15Lopez,16Lopez,27Lee,3Lee,4Lee,6Lee,22Lee,27Gonzalez,9Gonzalez,31Harris,4Harris,6Harris,7Harris,19Harris,22Harris,27Clark,4Clark,11Clark,28Clark,31Lewis,7Lewis,8Lewis,30Robinson,7Robinson,29Walker,18Walker,29Perez,18Perez,30Hall,11Hall,22Young,6Young,9Young,15Young,22Young,31Young,32Allen,9Allen,27Sanchez,4Sanchez,5Sanchez,14Sanchez,15Sanchez,20Sanchez,22Sanchez,26Wright,4Wright,6Wright,21Wright,27King,7King,13King,18King,29Scott,6Scott,14Scott,23Green,7Baker,7Baker,10Baker,19Baker,31Adams,18Adams,29Adams,30Nelson,3Nelson,4Nelson,8Nelson,31Hill,8Hill,13Hill,18Hill,30Ramirez,9Campbell,2Campbell,24Campbell,26Mitchell,1Mitchell,2Roberts,31Carter,7Carter,13Carter,29Carter,30Phillips,17Phillips,24Evans,22Evans,27Turner,4Turner,6Turner,27Turner,31Torres,8Parker,11Parker,16Parker,20Parker,29Parker,31Collins,11Collins,30Edwards,8Edwards,27Stewart,8Stewart,29Stewart,30Flores,17Flores,24Morris,17Morris,24Morris,26Nguyen,19Nguyen,31Murphy,7Murphy,12Rivera,7Rivera,18Cook,24Cook,26Rogers,3Rogers,7Rogers,8Rogers,19Rogers,21Rogers,22Morgan,8Morgan,29Peterson,1Peterson,2Peterson,10Peterson,23Peterson,26Peterson,27Cooper,4Cooper,16Cooper,20Cooper,32Reed,5Reed,14Bailey,7Bailey,30Bell,2Bell,17Bell,24Bell,26Gomez,11Gomez,20Gomez,25Gomez,32Kelly,6Kelly,20Kelly,31Kelly,32Howard,11Howard,12Howard,22Ward,4Cox,2Cox,17Cox,24Cox,26Diaz,7Diaz,8Diaz,29Diaz,30Richardson,3Richardson,6Richardson,22Wood,1Wood,10Wood,14Wood,31Watson,7Watson,29Brooks,3Brooks,8Brooks,11Brooks,22Bennett,8Bennett,29Gray,12Gray,16Gray,26James,7James,8James,13James,18James,30Reyes,9Reyes,23Cruz,29Cruz,30Hughes,7Hughes,8Hughes,13Hughes,29Hughes,30Price,1Price,22Price,30Price,31Myers,18Myers,30Long,3Long,7Long,10Foster,1Foster,9Foster,14Foster,22Foster,23Sanders,1Sanders,4Sanders,5Sanders,6Sanders,9Sanders,11Sanders,20Sanders,25Sanders,26Ross,18Ross,21Morales,6Morales,7Morales,8Morales,19Morales,22Morales,31Powell,5Sullivan,8Sullivan,13Sullivan,18Russell,26Ortiz,8Jenkins,24Jenkins,26Gutierrez,6Gutierrez,8Gutierrez,10Gutierrez,18Perry,8Perry,18Perry,29Perry,30Butler,1Butler,4Butler,22Butler,28Butler,30Barnes,10Barnes,21Barnes,22Barnes,31Fisher,1Fisher,2Fisher,11Fisher,22Fisher,28Fisher,30Fisher,31

预处理

需要对两个数据集做关联处理，这样才能得到单一的视图。同时由于需要比较客户所产生的交易，还需要建立一张透视表。行代表客户，列代表商业活动，单元格值则显示是否客户有购买行为。

var offers = Offer.ReadFromCsv(_offersCsv);var transactions = Transaction.ReadFromCsv(_transactionsCsv);var clusterData = (from of in offers join tr in transactions on of.OfferId equals tr.OfferId select new { of.OfferId, of.Campaign, of.Discount, tr.LastName, of.LastPeak, of.Minimum, of.Origin, of.Varietal, Count = 1, }).ToArray();var count = offers.Count();var pivotDataArray = (from c in clusterData group c by c.LastName into gcs let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count) select new PivotData() { LastName = gcs.Key, Features = ToFeatures(lookup, count) }).ToArray();

ToFeatures方法依据商业活动的数量，生成所需的特征数组。

private static float[] ToFeatures(ILookup<string, int> lookup, int count){ var result = new float[count]; foreach (var item in lookup) { var key = Convert.ToInt32(item.Key) - 1; result[key] = item.Sum(); } return result;}

数据视图

取得用于生成视图的数组后，这里使用CreateStreamingDataView方法构建数据视图。而又因为Features属性是一个数组，所以必须声明其大小。

var mlContext = new MLContext();var schemaDef = SchemaDefinition.Create(typeof(PivotData));schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count);var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef);

PCA

PCA(principal Component Analysis)，主成分分析，是为了将过多的维度值减少至一个合适的范围以便于分析，这里是降到二维空间。

new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)

OneHotEncoding

One Hot Encoding在此处的作用是将LastName从字符串转换为数字矩阵。

new OneHotEncodingEstimator(mlContext, new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })

训练器

K-Means是常用的应对聚类问题的训练器，这里假设要分为三类。

mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3)

训练模型

trainingPipeline.Fit(pivotDataView);

评估模型

var predictions = trainedModel.Transform(pivotDataView);var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features");Console.WriteLine($"*************************************************");Console.WriteLine($"* Metrics for {trainer} clustering model ");Console.WriteLine($"*------------------------------------------------");Console.WriteLine($"* AvgMinScore: {metrics.AvgMinScore}");Console.WriteLine($"* DBI is: {metrics.Dbi}");Console.WriteLine($"*************************************************");

可得到如下的评估结果。

************************************************** Metrics for Microsoft.ML.Trainers.KMeans.KMeansPlusPlusTrainer clustering model*------------------------------------------------* AvgMinScore: 2.3154067927599* DBI is: 2.69100740819456*************************************************

使用模型

var clusteringPredictions = predictions .AsEnumerable<ClusteringPrediction>(mlContext, false) .ToArray();

画图

为了更直观地观察，可以用OxyPlot类库生成结果图片。

添加类库：

dotnet add package OxyPlot.Core

Plot生成处理：

var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true };var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x);foreach (var cluster in clusters){ var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true }; var series = clusteringPredictions .Where(p => p.SelectedClusterId == cluster) .Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray(); scatter.Points.AddRange(series); plot.Series.Add(scatter);}plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors;var exporter = new SvgExporter { Width = 600, Height = 400 };using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create)){ exporter.Export(plot, fs);}

最后的图片如下所示：

完整示例代码

Program类：

using CustomerSegmentation.DataStructures;using Microsoft.ML;using System;using System.IO;using System.Linq;using Microsoft.ML.Runtime.Api;using Microsoft.ML.Transforms.Projections;using Microsoft.ML.Transforms.Categorical;using Microsoft.ML.Runtime.Data;using OxyPlot;using OxyPlot.Series;using Microsoft.ML.Core.Data;namespace CustomerSegmentation{ class Program { private static float[] ToFeatures(ILookup<string, int> lookup, int count) { var result = new float[count]; foreach (var item in lookup) { var key = Convert.ToInt32(item.Key) - 1; result[key] = item.Sum(); } return result; } static readonly string _offersCsv = Path.Combine(Environment.CurrentDirectory, "assets", "offers.csv"); static readonly string _transactionsCsv = Path.Combine(Environment.CurrentDirectory, "assets", "transactions.csv"); static readonly string _plotSvg = Path.Combine(Environment.CurrentDirectory, "assets", "customerSegmentation.svg"); static void Main(string[] args) { var offers = Offer.ReadFromCsv(_offersCsv); var transactions = Transaction.ReadFromCsv(_transactionsCsv); var clusterData = (from of in offers join tr in transactions on of.OfferId equals tr.OfferId select new { of.OfferId, of.Campaign, of.Discount, tr.LastName, of.LastPeak, of.Minimum, of.Origin, of.Varietal, Count = 1, }).ToArray(); var count = offers.Count(); var pivotDataArray = (from c in clusterData group c by c.LastName into gcs let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count) select new PivotData() { LastName = gcs.Key, Features = ToFeatures(lookup, count) }).ToArray(); var mlContext = new MLContext(); var schemaDef = SchemaDefinition.Create(typeof(PivotData)); schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count); var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef); var dataProcessPipeline = new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2) .Append(new OneHotEncodingEstimator(mlContext, new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })); var trainer = mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3); var trainingPipeline = dataProcessPipeline.Append(trainer); ITransformer trainedModel = trainingPipeline.Fit(pivotDataView); var predictions = trainedModel.Transform(pivotDataView); var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features"); Console.WriteLine($"*************************************************"); Console.WriteLine($"* Metrics for {trainer} clustering model "); Console.WriteLine($"*------------------------------------------------"); Console.WriteLine($"* AvgMinScore: {metrics.AvgMinScore}"); Console.WriteLine($"* DBI is: {metrics.Dbi}"); Console.WriteLine($"*************************************************"); var clusteringPredictions = predictions .AsEnumerable<ClusteringPrediction>(mlContext, false) .ToArray(); var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true }; var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x); foreach (var cluster in clusters) { var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true }; var series = clusteringPredictions .Where(p => p.SelectedClusterId == cluster) .Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray(); scatter.Points.AddRange(series); plot.Series.Add(scatter); } plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors; var exporter = new SvgExporter { Width = 600, Height = 400 }; using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create)) { exporter.Export(plot, fs); } Console.Read(); } }}

Offer类：

using System.Collections.Generic;using System.IO;using System.Linq;namespace CustomerSegmentation.DataStructures{ public class Offer { //Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak public string OfferId { get; set; } public string Campaign { get; set; } public string Varietal { get; set; } public float Minimum { get; set; } public float Discount { get; set; } public string Origin { get; set; } public string LastPeak { get; set; } public static IEnumerable<Offer> ReadFromCsv(string file) { return File.ReadAllLines(file) .Skip(1) // skip header .Select(x => x.Split(‘,‘)) .Select(x => new Offer() { OfferId = x[0], Campaign = x[1], Varietal = x[2], Minimum = float.Parse(x[3]), Discount = float.Parse(x[4]), Origin = x[5], LastPeak = x[6] }); } }}

Transaction类：

using System.Collections.Generic;using System.IO;using System.Linq;namespace CustomerSegmentation.DataStructures{ public class Transaction { //Customer Last Name,Offer # //Smith,2 public string LastName { get; set; } public string OfferId { get; set; } public static IEnumerable<Transaction> ReadFromCsv(string file) { return File.ReadAllLines(file) .Skip(1) // skip header .Select(x => x.Split(‘,‘)) .Select(x => new Transaction() { LastName = x[0], OfferId = x[1], }); } }}

PivotData类：

namespace CustomerSegmentation.DataStructures{ public class PivotData { public float[] Features; public string LastName; }}

ClusteringPrediction类：

using Microsoft.ML.Runtime.Api;using System;using System.Collections.Generic;using System.Text;namespace CustomerSegmentation.DataStructures{ public class ClusteringPrediction { [ColumnName("PredictedLabel")] public uint SelectedClusterId; [ColumnName("Score")] public float[] Distance; [ColumnName("PCAFeatures")] public float[] Location; [ColumnName("LastName")] public string LastName; }}