semi-supervised-svm:数据科学作业。 半监督分类算法

  • j1_373707
    了解作者
  • 2.4KB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • VIP专享
    资源类型
  • 0
    下载次数
  • 2022-06-14 04:04
    上传日期
半监督的svm 数据科学分配解决方案。 使用支持向量机作为基础分类器的半监督分类器的实现。 该数据集是在代码中随机生成的。 依存关系: 麻木 斯克莱恩 分类问题 给定数据: 大量未标记的数据 少量标注数据 能够正确标记未标记数据集中任何样本的人类专家,其费用与新标记样本的数量成正比 目标: 降低成本 提高分类器的准确性 解决方案 该解决方案将具有最高置信度的预测标签添加到标签数据集中。 置信度最低的标签表明分类器需要人工专家的帮助。 这些真实的标签将添加到数据集中,并且成本会增加。 人类专家的提示数量不能超过标记样本的初始数量-标记数据的数量只能加倍。 如果准确性为100%,成本达到先前说明的限制或没有将任何样本添加到标记的数据集中,则算法终止。 例子 设置: 数据集:10000个样本,3个类,每个类2个类,3个信息性特征。 最大限度。 迭代次数:100 数据集中未标记数据的
semi-supervised-svm-master.zip
  • semi-supervised-svm-master
  • TrainClassifier.py
    2.8KB
  • README.md
    2.5KB
内容介绍
# semi-supervised-svm Data science assignment solution. Implementation of a semi-supervised classifier using Support Vector Machines as the base classifier. The dataset is randomly generated in the code. ## Dependencies: - numpy - sklearn ## Classification problem Given data: - a large amount of unlabeled data - a small amout of labeled data - a human expert able to correctly label any sample in the unlabeled dataset for a cost proportional to the number of newly labeled samples Goal: - minimize the cost - improve the accuracy of the classifier ## Solution The solution adds the predicted labels with the highest confidence to the labeled dataset. The labels with the lowest confidence show the classifier needs help form the human expert. Those true labels are added to the dataset and the cost is incremented. The amount of hints from the human expert cannot exceed the initial number of labeled samples - the amount of labeled data can only be doubled. The algorithm terminates if the accuracy is 100%, if the cost reaches the previously explained limit, or if no samples have been added to the labeled dataset. ## Example Setup: - Dataset: 10000 samples, 3 classes, 2 clusters per class, 3 informative features. - Max. number of iterations: 100 - Percentage of unlabeled data in the dataset: 98% - Min. confidence at which the predicted label is considered accurate: 98% - Max. confidence at which the human expert will be asked to replace a bad label: 36.33% - Number of folds for cross-validation: 4 Output: ``` SVM (fully labeled): 0.82 Labeled data at the start: 192 SVM (small labeled): 0.66 SVM (with added pseudo-labels): 0.60 Cost: 46 SVM (with added pseudo-labels): 0.75 Cost: 60 SVM (with added pseudo-labels): 0.75 Cost: 74 SVM (with added pseudo-labels): 0.75 Cost: 82 SVM (with added pseudo-labels): 0.78 Cost: 91 SVM (with added pseudo-labels): 0.77 Cost: 101 SVM (with added pseudo-labels): 0.76 Cost: 109 SVM (with added pseudo-labels): 0.77 Cost: 111 SVM (with added pseudo-labels): 0.76 Cost: 115 SVM (with added pseudo-labels): 0.75 Cost: 117 SVM (with added pseudo-labels): 0.75 Cost: 118 SVM (with added pseudo-labels): 0.76 Cost: 122 SVM (with added pseudo-labels): 0.77 Cost: 123 SVM (with added pseudo-labels): 0.77 Cost: 126 SVM (with added pseudo-labels): 0.78 Cost: 127 SVM (with added pseudo-labels): 0.77 Cost: 128 SVM (with added pseudo-labels): 0.77 Cost: 128 SVM (with added pseudo-labels): 0.77 Cost: 128 Accuracy cannot be further improved! Improvement: 11.32 % ```
评论
    相关推荐