• B7_208963
  • 497.6KB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-06-02 13:30
Chipotle聚类挑战 语境 项目名称:Chipotle聚类挑战项目背景:BeCode,列日校区,人工智能/数据运营商训练营,2021年2月该项目的目标: 能够使用geopandas,matplotlib(和seaborn)在地图上可视化群集数据。 能够确定适当的聚类方法和变量进行聚类。 能够为任务分配适当的时间,并将结果呈现在可读的GitHub页面上。 描述 在这个项目中,我们创建一个Python脚本来运行,测试和改善我们的集群模型。 聚类对于浏览数据很有用。 如果有很多情况并且没有明显的分组,则可以使用聚类算法查找自然分组。 聚类还可以用作有用的数据预处理步骤,以识别要在其上构建监督模型的同质组。 在将Chipotle商店的数据加载并绘制在美国地图上后,我们决定进行聚类分析。 第一种方法是执行树状图以查看我们是否可以获得一些有用的见解。 不幸的是,显示的信息不是结论性的。 因
  • clustering-main
  • us-states.json
  • .ipynb_checkpoints
  • project-checkpoint.ipynb
  • chipotle_stores.csv
  • README.md
  • project.ipynb
# Chipotle clustering challenge ![clustering](https://files.realpython.com/media/K-Means-Clustering-in-Python_Watermarked.14dc56523461.jpg) ## Context Name of the project: Chipotle clustering challenge Context of the project: BeCode, Liège Campus, AI/Data Operator Bootcamp, February 2021 Objective of the project: - be able to use geopandas, matplotlib (and seaborn) to visualize clustered data onto a map. - be able to determine appropriate clustering methods and variables to cluster. - be able to allot the right amount of time for the tasks and present the results onto a readable GitHub page. ### Description In this project we create a Python script to run, test and improve our cluster model. Clustering is useful for exploring data. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models. After loading and plotting the data of the Chipotle stores on the US map we decided to perform clustering analysis. The first method was to perform a dendrogram to see whether we could gain some useful insights. Unfortunately the information displayed was not conclusive. Therefore as a second step we decided to narrow down the number of states we were going to use based on a threshold we fixed at 70 by state. We further narrowed down our search by cities in those states with the highest density of locations. Finally we only kept the cities where the number of Chipotle locations were greater than 10. This gave us a more interesting dendrogram but not sufficiently conclusive. The next method was to use KMeans with our whole dataset and use k=24 as the number of clusters to base our model. We displayed the different clusters created on the US map as well and color-coded the different clusters. We then needed to create a function that would calculate the Euclidean distance between the locations in the cluster and the center point. The goal being to minimize the distance to be find a location as close as possible to outlets. We then used the silhouette_score technique to score the prediction of the clustering and we achieved a 0.72 score. We also built an elbow curve to have a graphical view of the optimal number of clusters to be used. Another method we decided to use was DBscan to make a comparison to what had been done so far. We achieved a prediction score of the clustering of 0.3478. ### Usage You can run the corresponding notebook to tune the hyperparameters and measure and the perfomances of our model. Otherwise, you have only to run the main.ipynb file which computes and displays evrythink. ### Steps - Plot the US map - Visualize our data on this map. - Plot a dendrogram of our data to help us decide the appropropriate clustering resolution. - Compare and analyse different clustering methods using intrinsic analysis to decide on a chosen method. - Choose a centroid/adress to live. ### Libraries and environment ```bash pip install -r requirements.txt pip install sklearn pip install pandas pip install matplotlib pip install geapandas pip install numpy pip install seaborn ``` ### Authors `Abdellah El Ghilbzouri` `Imad HajRashid` `Jonathan Decleire` `Ousmane Diop`