How-to-process-KDD-99-dataset-master
所属分类:其他
开发工具:matlab
文件大小:2365KB
下载次数:6
上传日期:2019-11-10 15:45:13
上 传 者:
孙云喜
说明: kdd cup99 数据集,一种用来做入侵检测的数据集。并附代码处理。
(KDD CUP99 data set, a data set used for intrusion detection. Code processing is attached.)
文件列表:
insert_into_sqlite_database.py (6250, 2019-01-18)
KDD_pyspark.py (5323, 2019-01-18)
LICENSE (35141, 2019-01-18)
merge_show.py (3404, 2019-01-18)
replace_string_to_value.py (6128, 2019-01-18)
论文.docx (2406551, 2019-01-18)
# How-to-process-KDD-99-dataset
Use Introduction
### Other undecided authors go round the curve
you can insert the dataset into database like mysql or sqlite3, but I do not recommend doing this
## I will replace the KDD-99 data string as a value
The original data is like this
```
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
```
And we could replace string at the 2, 3, 4 position and the result will like this
```
0,1,22,10,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
```
and if we do this, the accuracy of clustering will be `improved`
when i use the initial data without replace the string as a value, the accuracy of clustering is `97.79%`
and i use the data which replace the string as a value, the accuracy of clustering is `***.47%`
This confirms that the accuracy of clustering can be improved with this method
The replacement method we used is simple, like this
```
#example TCP->1
#example UDP->2
#example ICMP->3
```
and then
```
#example aol->1
#example Z39_50->70
```
and then
```
#example OTH->1
#example SH->11
```
---
### At the beginning of the time
You should make sure that place KDD dataset and programs `replace_string_to_value.py` in the same directory
and then open the programs with IDE
add you file name in `You_file_path' at the beginning
input you file path and name while the program runing
and the run the programs
## next we will use the Spark to clusering the data
We use the `KDD_pyspark.py`
I strongly recommend that you set the max_k value should not be too large
and if you do that
### the later processing will take a lot of time
So, you can try mak_k as `10` or `30`
Unless you have sufficient time and computing resources, it is not recommended do set the max_k value to 60, or 120.
and if you run the programs named `KDD_pyspark.py`, it will two directories are generated in the mian directory of the Spark
* The sample_standardized directory under the file named `re0`, `re1`, `re2`
* And the labels directory under the file named `la0`, `la1`, `la2`
* It must be name d in the order
## Now put them all in another directory
You can names this directory as `Text` or `Workstation`
put the programs `merge_show.py` into `Text` or `Workstation` directory
also put the `re0` to `rex`, and `la0` to `lax`
and run the `merge_show.py`
### It will be a long process...
### Next do
* Watch a movie that you want to see for a long time
* Play some funny things with you family
* Learn a new programs language
* ....
if you finish merge all the data
## Let's move on to the next step
put the `statistical_result.py` in the directory
run the programs and out put the result
##That's all
* if you have any question with this introduce, send me a email super_big_hero@sina.com
* see you
* 具体实现细节可参考我前年的论文(这里有一份,信息安全学报2016-7月刊第一篇也是) (2018-1-12)
近期下载者:
相关文件:
收藏者: