OCR_DataSet-master
OCR 

所属分类:模式识别(视觉/语音等)
开发工具:Python
文件大小:9366KB
下载次数:1
上传日期:2021-03-19 11:40:27
上 传 者7089762
说明:  其中文本检测是目标检测算法中的一种,由于目标检测算法发展比较快,所以文本检测算法的发展也比较快。早期,文本检测借鉴目标检测的思想,采用 YOLO V3 和 faster-RCNN 取得了一定的效果。
(Text detection is one of the target detection algorithms, because the development of target detection algorithm is relatively fast, so the development of text detection algorithm is also relatively fast. In the early stage, text detection used the idea of object detection for reference, and achieved certain results by using Yolo V3 and fast RCNN)

文件列表:
convert (0, 2020-06-19)
convert\__init__.py (73, 2020-06-19)
convert\crop_rec.py (4599, 2020-06-19)
convert\det (0, 2020-06-19)
convert\det\ArtS2json.py (1869, 2020-06-19)
convert\det\LSVT2json.py (1429, 2020-06-19)
convert\det\MTWI20182json.py (1742, 2020-06-19)
convert\det\RcCTS2json.py (3184, 2020-06-19)
convert\det\SROIE2json.py (1809, 2020-06-19)
convert\det\SynthText800k2json.py (2667, 2020-06-19)
convert\det\__init__.py (73, 2020-06-19)
convert\det\check_json.py (648, 2020-06-19)
convert\det\coco_text.py (10249, 2020-06-19)
convert\det\coco_text2json.py (2574, 2020-06-19)
convert\det\convert2jpg.py (538, 2020-06-19)
convert\det\icdar20152json.py (1773, 2020-06-19)
convert\det\icdar2017rctw2json.py (1674, 2020-06-19)
convert\det\mlt20192json.py (1859, 2020-06-19)
convert\move_imgs.py (607, 2020-06-19)
convert\rec (0, 2020-06-19)
convert\rec\360w2txt.py (1112, 2020-06-19)
convert\rec\__init__.py (73, 2020-06-19)
convert\rec\baidu2txt.py (1141, 2020-06-19)
convert\rec\mjsyhtn2txt.py (1023, 2020-06-19)
convert\simsun.ttc (18214472, 2020-06-19)
convert\utils.py (4529, 2020-06-19)
dataset (0, 2020-06-19)
dataset\__init__.py (73, 2020-06-19)
dataset\convert_det2lmdb.py (3646, 2020-06-19)
dataset\det.py (4015, 2020-06-19)
dataset\det_lmdb.py (3351, 2020-06-19)
dataset\rec.py (1947, 2020-06-19)
gt_detection.json (1050, 2020-06-19)
ocr公开数据集信息.xlsx (12388, 2020-06-19)

# Todo - [x] 提供数据集百度云链接 - [x] 数据集转换为统一格式(检测和识别) - [x] icdar2015 - [x] MLT2019 - [x] COCO-Text_v2 - [x] ReCTS - [x] SROIE - [x] ArT - [x] LSVT - [x] Synth800k - [x] icdar2017rctw - [x] MTWI 2018 - [x] 百度中文场景文字识别 - [x] mjsynth - [x] Synthetic Chinese String Dataset(360万中文数据集) - [x] 提供读取脚本 # 下载 [百度云](https://pan.baidu.com/s/1mRepVEvMa-U4e9ThiskVXg) 提取码:9s4x # 数据集 | 数据集 | 主页 | 适用情况 | 数据情况 | 标注形式 | 说明 | | ----------------------------------- | ------------------------------------------------------------ | --------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | ICDAR2015 | https://rrc.cvc.uab.es/?ch=4 | 检测&识别 | 语言: 英文 train:1,000 test:500 | x1, y1, x2, y2, x3, y3, x4, y4, transcription | 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息 | | MLT2019 | https://rrc.cvc.uab.es/?ch=15 | 检测&识别 | 语言: 混合 train:10,000 test:10,000 | x1,y1,x2,y2,x3,y3,x4,y4,script,transcription | 坐标: x1, y1, x2, y2, x3, y3, x4, y4 script: 文字所属语言 transcription : 框内的文字信息 | | COCO-Text_v2 | https://bgshih.github.io/cocotext/ | 检测&识别 | 语言: 混合 train:43,686 validation:10,000 test:10,000 | json | | | ReCTS | https://rrc.cvc.uab.es/?ch=12&com=introduction | 检测&识别 | 语言: 混合 train:20,000 test:5,000 | { “chars”: [ {“points”: [x1,y1,x2,y2,x3,y3,x4,y4], “transcription” : “trans1”, "ignore":0 }, {“points”: [x1,y1,x2,y2,x3,y3,x4,y4], “transcription” : “trans2”, " ignore ":0 }], “lines”: [ {“points”: [x1,y1,x2,y2,x3,y3,x4,y4] , “transcription” : “trans3”, "ignore ":0 }], } | points: x1,y1,x2,y2,x3,y3,x4,y4 chars: 字符级别的标注 lines: 行级别的标注. transcription : 框内的文字信息 ignore: 0:不忽略,1:忽略 | | SROIE | https://rrc.cvc.uab.es/?ch=13 | 检测&识别 | 语言: 英文 train:699 test:400 | x1, y1, x2, y2, x3, y3, x4, y4, transcription | 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息 | | ArT(已包含Total-Text和SCUT-CTW1500) | https://rrc.cvc.uab.es/?ch=14 | 检测&识别 | 语言: 混合 train: 5,603 test: 4,563 | { “gt_1”: [ {“points”: [[x1, y1], [x2, y2], ..., [xn, yn]], “transcription” : “trans1”, “language” : “Latin”, "illegibility": false }, {“points”: [[x1, y1], [x2, y2], ..., [xn, yn]], “transcription” : “trans2”, “language” : “Chinese”, "illegibility": false }], } | points: x1,y1,x2,y2,x3,y3,x4,y4...xn,yn transcription : 框内的文字信息 language: 语言信息 illegibility: 是否模糊 | | LSVT | https://rrc.cvc.uab.es/?ch=16 | 检测&识别 | 语言: 混合 全标注 train: 30,000 test: 20,000 只标注文本 400,000 | { “gt_1”: [ {“points”: [[x1, y1], [x2, y2], ..., [xn, yn]], “transcription” : “trans1”, "illegibility": false }, {“points”: [[x1, y1], [x2, y2], ..., [xn, yn]], “transcription” : “trans2”, "illegibility": false }], } | points: x1,y1,x2,y2,x3,y3,x4,y4...xn,yn transcription : 框内的文字信息 illegibility: 是否模糊 | | Synth800k | http://www.robots.ox.ac.uk/~vgg/data/scenetext/ | 检测&识别 | 语言: 英文 800,000 | imnames: wordBB: charBB: txt: | imnames: 文件名称 wordBB: 2*4*n,每张图像内的文本框 charBB: 2*4*n,每张图像内的字符框 txt: 每张图形内的字符串 | | icdar2017rctw | https://blog.csdn.net/wl1710582732/article/details/89761818 | 检测&识别 | 语言: 混合 train:8,034 test:4,229 | x1,y1,x2,y2,x3,y3,x4,y4,<识别难易程度>,transcription | 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息 | | MTWI 2018 | [识别: https://tianchi.aliyun.com/competition/entrance/231684/introduction](https://tianchi.aliyun.com/competition/entrance/231684/introduction) [检测: https://tianchi.aliyun.com/competition/entrance/231685/introduction](https://tianchi.aliyun.com/competition/entrance/231684/introduction) | 检测&识别 | 语言: 混合 train:10,000 test:10,000 | x1, y1, x2, y2, x3, y3, x4, y4, transcription | 坐标: x1, y1, x2, y2, x3, y3, x4, y4 transcription : 框内的文字信息 | | 百度中文场景文字识别 | https://aistudio.baidu.com/aistudio/competition/detail/20 | 识别 | 语言: 混合 train:未统计 test:未统计 | h,w,name,value | h: 图片高度 w: 图片宽度 name: 图片名 value: 图片上文字 | | mjsynth | http://www.robots.ox.ac.uk/~vgg/data/text/ | 识别 | 语言: 英文 9,000,000 | - | - | | Synthetic Chinese String Dataset(360万中文数据集) | 链接:https://pan.baidu.com/s/1jefn4Jh4jHjQdiWoanjKpQ 提取码:spyi | 识别 | 语言: 混合 300k | - | - | | 英文识别数据大礼包(https://github.com/clovaai/deep-text-recognition-benchmark) 训练:MJSynth和SynthText 验证:IIIT, SVT, IC03, IC13, IC15, SVTP, CUTE | 链接:https://pan.baidu.com/s/1KSNLv4EY3zFWHpBYlpFCBQ 提取码:rryk | 识别 | 语言: 英文 | - | - | # 数据生成工具 https://github.com/TianzhongSong/awesome-SynthText # 数据集读取脚本 - [检测读取脚本](dataset/det.py) - [识别读取脚本](dataset/rec.py)

近期下载者

相关文件


收藏者