EY-Data-Science-Competition

所属分类:其他
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2019-07-30 17:38:26
上 传 者sh-1993
说明:  2019年安永NextWave数据科学竞赛解决方案
(Solution for EY NextWave Data Science Competition 2019)

文件列表:
challenge.ipynb (135003, 2019-07-30)
challenge.py (10735, 2019-07-30)
data/ (0, 2019-07-30)
data/processed/ (0, 2019-07-30)
data/processed/train_dataset_complete.csv (134, 2019-07-30)
data/raw/ (0, 2019-07-30)
data/raw/data_test.csv (133, 2019-07-30)
data/raw/data_train.csv (134, 2019-07-30)
reports/ (0, 2019-07-30)
reports/Challenge_Manual.pdf (613422, 2019-07-30)
reports/EY_PRESENTATION.pdf (6134427, 2019-07-30)
reports/images/ (0, 2019-07-30)
reports/images/change_grain.png (91368, 2019-07-30)
reports/images/distance_center_seaborn.png (32869, 2019-07-30)

# EY-Data-Science-Competition Solution for EY NextWave Data Science Competition 2019 ## Challenge This competition was to improve the Smart Cities with Data Science solution. The goal was predict how many people are in the city center between 15:00 and 16:00 in city center of Atlanta using geolocation records. ## Results The code in this repository is a clean and summarized version of my work. I put here only the main approaches that I used on final version submitted, other tecnicals ware omitted. Final ranking with F1-score: |Leaderboard |Score |Brazilian Ranking| Global Ranking| |----------------|----------|-----------------|---------------| |Public |0.89547 |1st |17th | |Private |0.88571 |2nd |25th | I presenting my project in the Brazilian final and I finished in 4th place. It was an amazing experience! ## Methodology ### Change grain and become trajectories as characteristics. First, I will explain the most important approach. In the initial data set, the grain is the trajectory. But each prediction was performed to a hash (user or device). So, for build the final data set, I used all trajectories as a sequence, they become features in descending order. The intention is to maintain the order of trajectories in relation to the last ones and give more importance to them. ![Approach](https://github.com/miltongneto/EY-Data-Science-Competition/blob/master/reports/images/change_grain.png) Each trajectory has some information witch generates many attributes, or columns, to be more exact. ### Feature Engineering This phase is very important for final result, the creation of features can be improve the performance. It is possible include human knowledge and create important features. The majority features was created by each trajectoy, remembering that for traj_last some features are not created, because the exit point is unknown (a posteriori information). #### Standard features: These features are named “standard” because they use the raw information. This is, the information is maintained. The features are: - Vmax - Vmin - Vmean - x_entry - y_entry - x_exit - y_exit #### Time features: - Duration of the trajectory - Hour and minute - Period of the day: - night, early morning, morning, afternoon #### Working with the points: With entry and exit points, some features were created, manipulating points and including human knowledge for answer the main question, “the user (hash) is in center”? Remembering that for traj_last some features are not created, because the exit point is unknown. ![Points on Map](https://github.com/miltongneto/EY-Data-Science-Competition/blob/master/reports/images/distance_center_seaborn.png) - Is the center? - Travelled distance - Distance from the center - Distance from the boundary center - Distance from the center point - Approach to the center - Approach to the center point - My velocity mean - My_Vmean = Travelled distance ÷ Duration of the trajectory #### Aggregation features: Each user (hash) has the collection of trajectories, and theses becomes features. But some information about the set is important, so some functions have been applied to extract some relevant information. - Number of trajectories - Average duration - Average distance - Average velocity (with My_Vmean) ### Modeling This is Machine Learning phase. Many classifiers and techniques were tested. The final was modeled with xbgoost (Extreme Gradient Boosting). Parameters: - n_estimators=1000 - learning_rate=0.05 - max_depth=6 - colsample_bytree=0.9 - subsample=0.8 - min_child_weight=4 - reg_alpha=0.005 ### Notes > - The data set afer preparation phase is on [data/processed/](https://github.com/miltongneto/EY-Data-Science-Competition/blob/master/data/processed/). You can use that if you do not want to wait for data preparation processing time and just want to investigate the modeling phase. > - In this repository there are a python script and a notebook. Both do the same thing, but the notebook describes in more details, but the script is faster. > - On [reports/](https://github.com/miltongneto/EY-Data-Science-Competition/blob/master/reports/) there is the challenge manual and the final presentation.

近期下载者

相关文件


收藏者