EY-DS-Competition

所属分类:数学计算
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2022-06-21 21:58:37
上 传 者sh-1993
说明:  Vopaaz和Xiaochr2019安永Nextwave数据科学挑战解决方案。
(Solution for 2019 EY Nextwave Data Science Challenge by Vopaaz and Xiaochr.)

文件列表:
Doc/ (0, 2020-09-25)
Doc/Solution/ (0, 2020-09-25)
Doc/Solution/Deep/ (0, 2020-09-25)
Doc/Solution/Deep/CNN.html (18583, 2020-09-25)
Doc/Solution/Deep/FullMatrixSolution.html (10015, 2020-09-25)
Doc/Solution/Deep/index.html (6457, 2020-09-25)
Doc/Solution/FinalResult.html (9467, 2020-09-25)
Doc/Solution/Machine/ (0, 2020-09-25)
Doc/Solution/Machine/Coordination.html (46942, 2020-09-25)
Doc/Solution/Machine/Preprocessing.html (17522, 2020-09-25)
Doc/Solution/Machine/Training.html (21089, 2020-09-25)
Doc/Solution/Machine/index.html (7432, 2020-09-25)
Doc/Solution/Machine/params.html (8887, 2020-09-25)
Doc/Solution/deeputil/ (0, 2020-09-25)
Doc/Solution/deeputil/MatrixProvider.html (19859, 2020-09-25)
Doc/Solution/deeputil/Matrixfy.html (27635, 2020-09-25)
Doc/Solution/deeputil/ValueFunc.html (6542, 2020-09-25)
Doc/Solution/deeputil/index.html (6835, 2020-09-25)
Doc/Solution/index.html (7378, 2020-09-25)
Doc/Solution/util/ (0, 2020-09-25)
Doc/Solution/util/BaseUtil.html (25789, 2020-09-25)
Doc/Solution/util/DFPreparation.html (20564, 2020-09-25)
Doc/Solution/util/Labelling.html (9566, 2020-09-25)
Doc/Solution/util/NaiveFeature.html (36992, 2020-09-25)
Doc/Solution/util/PathFilling.html (12132, 2020-09-25)
Doc/Solution/util/Submission.html (12971, 2020-09-25)
Doc/Solution/util/index.html (8236, 2020-09-25)
Doc/Solution/util/initLogging.html (7329, 2020-09-25)
Solution/ (0, 2020-09-25)
Solution/Deep/ (0, 2020-09-25)
Solution/Deep/CNN.py (4554, 2020-09-25)
Solution/Deep/FullMatrixSolution.py (2080, 2020-09-25)
Solution/Deep/__init__.py (102, 2020-09-25)
Solution/FinalResult.py (1483, 2020-09-25)
Solution/Machine/ (0, 2020-09-25)
Solution/Machine/Coordination.py (11806, 2020-09-25)
Solution/Machine/Preprocessing.py (2246, 2020-09-25)
Solution/Machine/Training.py (3430, 2020-09-25)
... ...

# EY Nextwave Data Science Challenge 2019 Solution Solution for 2019 EY Nextwave Data Science Challenge by Vopaaz and Xiaochr. - [EY Nextwave Data Science Challenge 2019 Solution](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#ey-nextwave-data-science-challenge-2019-solution) - [Getting the Final Result](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#getting-the-final-result) - [Prerequisites](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#prerequisites) - [Environment](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#environment) - [Installing Dependencies](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#installing-dependencies) - [Prepare Data](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#prepare-data) - [Running](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#running) - [Methodology](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#methodology) - [Feature Engineering](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#feature-engineering) - [Null Value Feature Handling](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#null-value-feature-handling) - [Algorithm Design](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#algorithm-design) - [Preprocessing](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#preprocessing) - [Model Training and Selection](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#model-training-and-selection) - [Module Documentation](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#module-documentation) - [Approach Explored but not Used](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#approach-explored-but-not-used) - [Final Presentation](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#final-presentation) ## Getting the Final Result ### Prerequisites #### Environment `Python 3.7.2 64-bit` #### Installing Dependencies Use virtrual environment if necessary. ```bash $ pip install -r requirements.txt ``` ### Prepare Data Place the train dataset in `OriginalFile/data_train/data_train.csv`. Place the test dataset in `OriginalFile/data_test/data_test.csv`. ### Running ```bash $ python Solution/FinalResult.py ``` This shall take about 1-2 hours. Then the `.csv` file to be submitted can be found in directory `Result/`. Note that as we neither saved the model nor set the `random_state` variable, the produced file may be slightly different from our last submission. ## Methodology ### Feature Engineering The paths given in the datasets are not fully connected. However, logically they should be connected end to end. Thus we firstly joined all the disconnected paths and do the following feature extraction. We listed features of each device that we think may affect the prediction target (i.e. whether the last exit point of this device is within the city center or not), they are: - The difference between 3 p.m. and the starting / ending time point of the unknown path. (in seconds) - The difference between the starting and ending time point of the unknown path. (in seconds) - The max, min, average level of the distance to the central area of all the points recorded by a device. - The difference between the distance to the central area of the entry of the first path and the exit of the last known path. - The difference between the distance to the central area of the entry and the exit of the last known path. - The min, max, average level of the length of all the paths recorded by a device - The min, max, average level of the average velocity of all the paths recorded by a device - The coordinate of the start point of the unknown path All the *distance to the central area* are measured by the l1 distance to the border of the central area. There are some devices which only records one path (the path to be predicted). Hence some of the above-mentioned features cannot be extracted. They are `Null` values in the Feature Panel. We came up with several strategies to deal with them (see the [Null Value Feature Handling](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#null-value-feature-handling) part). In the best prediction result, we used the `drop` strategy, that is, to remove these features. ### Null Value Feature Handling The traditional way of dealing with `Null` values are filling them with zeros or dropping them. Nevertheless, the `Null` values in our Feature Panel are not caused by common problems such as record error, but rather because the number of path recorded is only one. Based on the fact that there are considerable numbers of devices which are in this case. We develop another two strategies to deal with the `Null` values. Example DataFrame (`v` means valid value and `N` means `Null`, each column is a feature and each row is a device): ``` A B C 0 v v v 1 v v v 2 v v N 3 v v N 4 v N N 5 v N N ``` `separate_all` strategy: - Use `(0-5).A` to train the model and predict those whose non-null feature is only `A` - Use `(0-3).AB` to train the model and predict those whose non-null feature is `A, B` - Use `(0-1).ABC` to train the model and predict those whose non-null feature is `A, B, C` `separate_part` strategy: - Use `(4-5).A` to train the model and predict those whose non-null feature is only `A` - Use `(2-3).AB` to train the model and predict those whose non-null feature is `A, B` - Use `(0-1).ABC` to train the model and predict those whose non-null feature is `A, B, C` In this way we make full use of the extracted features. In some models, the `separate_all` strategy do provides a better result. ### Algorithm Design After having the features, we applied general machine learning approaches to do the classification. #### Preprocessing We used the `Isolation Forest` to detect and removed the 5% outliers in the train set, and then standardized the rest of them to have a standard normal distribution. The test data was also standardized. #### Model Training and Selection We have tried the following models: - Support Vector Machine - Random Forest - Gradient Boosting - XGBoosting For each model, we ran several runs of grid search, each adjusted 3-4 hyper-parameters and gradually narrowed down the range. The cross validation splitting was set to `k=5`. Finally, we selected the `XGBoosting` framework to provide the final submission based on the validation result. The best hyper-parameters we found are: - gamma: 0.01 - learning_rate: 0.1 - max_depth: 7 - n_estimators: 100 - others: default Although we have found the best parameters, the script we provide still runs the final round of grid search because we try to reproduct the final submission. Also, the time required is relatively acceptable. ## Module Documentation Please use the browser to open `Doc/Solution/index.html`. ## Approach Explored but not Used The feature engineering process described in [Feature Engineering](https://github.com/Vopaaz/EY-DS-Competition/blob/master/#feature-engineering) were effective, but obviously some information are still lost. We believe that if we can preserve more information, the classification result will be better. The intuitive method would be extract the location (coordinate) of each device at a series of time points. However, there are considerable numbers of devices who have only one path record. In this case, most coordinate values in the time series will be `Null`. Typical models cannot deal with it. Another approach we have come up with is to convert the paths of one device into an image, by connecting the entry and exit point of one path on the map panel with straight line. The shade of each pixel on the line is determined by the esimated time when the device is located there. This is also intuitive. We did not actually produce any image file but rather generated the equivalent matrix directly. After having these matrices (equivalent images), we apply the convolutional neural network for the classical image classification problem. The final result was not satisfing, though. The best result we got from this method is around 0.801. According to our analysis, the reason may be that CNN is designed to deal with real pictures, where the equivalent matrix is dense. Thus it can learn different local features from it. Our matrix, nevertheless, is quite sparse. It does not fit the application of CNN. The code implementing this approach are in `Solution/Deep` and `Solution/deeputil`. # Final Presentation Our presentation slides can be found at https://github.com/Vopaaz/EY-DS-Competition-Slides.

近期下载者

相关文件


收藏者