mm-ocr

所属分类:内容生成
开发工具:Shell
文件大小:1318KB
下载次数:0
上传日期:2021-09-18 15:49:56
上 传 者sh-1993
说明:  no intro
(This project is image to text recognizing implementation for Myanmar language by Tessearct 4.0 OCR Engine.)

文件列表:
langdata_lstm (0, 2021-09-18)
langdata_lstm\01_train_font (0, 2021-09-18)
langdata_lstm\01_train_font\Myanmar3_V1.358.ttf (224661, 2021-09-18)
langdata_lstm\01_train_font\NotoSansMyanmar-Bold.ttf (154420, 2021-09-18)
langdata_lstm\01_train_font\NotoSansMyanmar-Regular.ttf (153036, 2021-09-18)
langdata_lstm\01_train_font\Padauk-Bold.ttf (570813, 2021-09-18)
langdata_lstm\01_train_font\Padauk-Regular.ttf (545505, 2021-09-18)
langdata_lstm\01_train_font\Pyidaungsu-2.5.3_Bold.ttf (234480, 2021-09-18)
langdata_lstm\01_train_font\Pyidaungsu-2.5.3_Regular.ttf (189468, 2021-09-18)
langdata_lstm\01_train_font\myanmarsanspro-regular.ttf (34568, 2021-09-18)
langdata_lstm\01_train_font\tharlon-regular.ttf (353228, 2021-09-18)
langdata_lstm\02_train_text (0, 2021-09-18)
langdata_lstm\02_train_text\mya.charlist.txt (17810, 2021-09-18)
langdata_lstm\02_train_text\mya.number.txt (40, 2021-09-18)
langdata_lstm\02_train_text\mya.pali.txt (1260, 2021-09-18)
langdata_lstm\02_train_text\mya.punc.txt (97, 2021-09-18)
langdata_lstm\02_train_text\mya.training_text (109345, 2021-09-18)
langdata_lstm\02_train_text\mya.traintext.exp0.txt (109345, 2021-09-18)
langdata_lstm\02_train_text\mya.wordlist.txt (343757, 2021-09-18)
langdata_lstm\start_train.sh (1610, 2021-09-18)
sample_langdata (0, 2021-09-18)
sample_langdata\mya (0, 2021-09-18)
sample_langdata\mya\mya.numbers (681, 2021-09-18)
sample_langdata\mya\mya.punc (97, 2021-09-18)
sample_langdata\mya\mya.training_text (109345, 2021-09-18)
sample_langdata\mya\mya.unicharset (15980, 2021-09-18)
sample_langdata\mya\mya.wordlist (343757, 2021-09-18)
test_image (0, 2021-09-18)
test_image\input1.jpg (216103, 2021-09-18)
test_image\input2.jpg (61070, 2021-09-18)
test_image\output1.txt (420, 2021-09-18)
test_image\output2.txt (1042, 2021-09-18)

# mm-ocr This project is data training for image to text recognization for Myanmar language by Tessearct 4.00 OCR Engine. ## 1. Installation ### 1.1 Requirement Test Enviroment is Ubuntu Bionic 18.04. sudo apt update sudo apt upgrade *Note: If proxy is used, need to modify visudo.* sudo visudo *Add following line to under of `Defaults env_reset`.* Default env_keep="http_proxy https_proxy ftp_proxy" **Tesseract 4 packages with LSTM engine and related traineddata.** * [Ubuntu Bionic 18.04](https://packages.ubuntu.com/bionic/tesseract-ocr-all) * [Ubuntu Bionic 18.04 - PPA ](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr?field.series_filter=bionic) Add PPA repository. sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt update *Note: If proxy is used, need to add proxy to add PPA.* Add the following to the `gedit /etc/profile` file (for system wide change) or to `gedit ~/.profile` for local user. export http_proxy=http://username:password@proxy:port export https_proxy=http://username:password@proxy:port export ftp_proxy=http://username:password@proxy:port Run following command to apply change. source ~/.profile *Run following command to add repository* sudo su add-apt-repository ppa:alex-p/tesseract-ocr sudo apt update ### 1.2 Installing Tesseract There is two way to install Tesseract. For test purpose only, use first way to install Tesseract (pre-built binary package). If purpose is to traing language data, use second way to install Tesseract (build from source). **1. Install Tesseract via pre-built binary package** sudo apt install tesseract-ocr sudo apt install libtesseract-dev sudo apt install tesseract-ocr-mya sudo apt install tesseract-ocr-script-mymr *Note: Installationg directory is /usr/share/tesseract-ocr/4.00* **2. Build it from source** Install enviroment dependencies sudo apt install g++ # or clang++ (presumably) sudo apt install autoconf automake libtool sudo apt install pkg-config sudo apt install libpng-dev sudo apt install libjpeg8-dev sudo apt install libtiff5-dev sudo apt install zlib1g-dev sudo apt install libtesseract-dev sudo apt install libleptonica-dev sudo apt install openjdk-8-jdk sudo apt install curl Installing Tesseract from Git mkdir ~/tesstutorial cd ~/tesstutorial git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git cd tesseract/tessdata wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata wget https://github.com/tesseract-ocr/tessdata/raw/master/mya.traineddata wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata mkdir best cd best wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata wget https://github.com/tesseract-ocr/tessdata_best/raw/master/mya.traineddata cd ~/tesstutorial/tesseract ./autogen.sh ./configure make sudo make install sudo ldconfig Add the following to the `gedit /etc/profile` file (for system wide change) or to `gedit ~/.profile` for local user. export TESSDATA_PREFIX=~/tesstutorial/tesseract/tessdata/ Run following command to apply change. source ~/.profile *Note: Installationg directory is ~/tesstutorial/tesseract* ### 1.3 Running Tesseract tesseract input.jpg output -l eng+mya ## 2. Training ### 2.1 Install Additional Libraries Required sudo apt install libicu-dev sudo apt install libpango1.0-dev sudo apt install libcairo2-dev ### 2.2 Install Sublime Text & Character Map wget -qO - https://download.sublimetext.com/sublimehq-pub.gpg | sudo apt-key add - echo "deb https://download.sublimetext.com/ apt/stable/" | sudo tee /etc/apt/sources.list.d/sublime-text.list sudo apt update sudo apt install sublime-text sudo apt install kcharselect ### 2.3 Build Training Enviroment cd ~/tesstutorial/tesseract make training sudo make training-install make ScrollView.jar Add the following to the `gedit /etc/profile` file (for system wide change) or to `gedit ~/.profile` for local user. export SCROLLVIEW_PATH=~/tesstutorial/tesseract/java/ Run following command to apply change. source ~/.profile ### 2.4 Create sample langdata from Git cd ~/tesstutorial mkdir langdata cd langdata wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/common.punc wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/font_properties wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.unicharset wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.xheights wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Myanmar.unicharset wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Myanmar.xheights mkdir mya cd mya wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/mya/mya.training_text wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/mya/mya.punc wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/mya/mya.numbers wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/mya/mya.wordlist ### 2.5 Install MS fonts Locations of your fonts are defined in `gedit /etc/fonts/fonts.conf`. /usr/share/fonts /usr/local/share/fonts /home//.fonts Command to install MS fonts sudo apt update sudo apt install ttf-mscorefonts-installer sudo apt install fonts-dejavu fc-cache -vf ### 2.6 Install Myanmar fonts mkdir mmfonts cd mmfonts wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/Myanmar3_V1.358.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/NotoSansMyanmar-Bold.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/NotoSansMyanmar-Regular.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/Padauk-Bold.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/Padauk-Regular.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/Pyidaungsu-2.5.3_Bold.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/Pyidaungsu-2.5.3_Regular.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/myanmarsanspro-regular.ttf wget https://github.com/kaungyeehein/mm-ocr/raw/master/langdata_lstm/01_train_font/tharlon-regular.ttf cd .. sudo mv mmfonts /usr/share/fonts/mmfonts fc-cache -vf ## 3 General Knowledge ### 3.1 Render text to image (auto) Create the following directorys. - `train\train_font` To place train font file. eg. `times.ttf`. - `train\train_text` To input train text file. eg. `eng.imp0.txt`. - `train\train_tif` To output auto generated tif file and box file. eg. `*.tif`, `*.box` Check list of avaliable fonts. cd training/ text2image --list_available_fonts --fonts_dir='./train_font' Command Option: text2image --text=[lang].imp0.txt --outputbase=[lang].[fontname].exp0 --font='Font Name' --fonts_dir=/path/to/your/fonts Command Example: ```Shell N=1 # set accordingly to the number of files that you have (count from 0 to N) for i in `seq 0 $N`; do text2image --text=./train_text/eng.imp$i.txt --outputbase=./train_tif/eng.time_new_roman_regular.exp$i --font='Times New Roman' --fonts_dir=./train_font done ``` *Note: `eng.time_new_roman_regular.exp0.tif`, `eng.time_new_roman_regular.exp1.tif`, `eng.time_new_roman_regular.exp0.box` and `eng.time_new_roman_regular.exp1.box` are outputed to `train_tif` directory.* The following command also generate a box file with name image.box for the image in the current directory. tesseract image.png image lstmbox ### 3.2 Install qt-box-editor to check box file Box file contain following format information.
` `
Install `qt-box-editor` to check that is correct or not in GUI. sudo apt install qt-box-editor Run qt-box-editor from command to correct character box file qt-box-editor ### 3.3 Install Myanmar language pack on Ubuntu sudo apt install ubuntu-restricted-extras check-language-support -l my sudo apt install language-pack-my ### 3.4 Combine Image and Box into Training Data set `*.lstmf` Combine multiple image and box files to lstmf files. ```shell cd path/to/dataset for file in *.tif; do echo $file base=`basename $file .tif` tesseract $file $base lstm.train done ``` Generate list of lstmf files. ls -1 *.lstmf | sort -R > all-lstmf ### 3.5 Temp ``` unicharset_extractor => unicharset * lang.fontname.exp0.box eg. unicharset_extractor --output_unicharset mya.unicharset --norm_mode 2 ./train_tif/mya.pyidaungsu.exp0.box eg. 110 NULL 0 NULL 0 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 a 3 58,65,186,1***,85,1***,0,26,97,185 Latin 56 0 5 a ... training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata * script_dir : the relevant .unicharset file(s) eg. ... ; 10 ... b 3 ... W 5 ... 7 8 ... = 0 ... ... combine_lang_model => lstm-recoder * input_unicharset * script_dir - langdata * lang * word list files (optional) # NormalizeCleanAndSegmentUTF8 # --pass_through_recoder wordlist2dawg mya.number.txt mya.number-dawg mya.unicharset wordlist2dawg mya.punc.txt mya.punc-dawg mya.unicharset wordlist2dawg mya.wordlist.txt mya.word-dawg mya.unicharset wordlist2dawg mya.frequencylist.txt mya.freq-dawg mya.unicharset lang.punc-dawg (Optional) A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space. lang.number-dawg (Optional) A dawg mad from tokens which originally contained digits. Each digit is replaced by a space character. lstmtraining * traineddata - lstm-unicharset - lstm-recoder - lstm-punc-dawg - lstm-word-dawg - lstm-number-dawg - config (optional) ``` ## This is not complete. To be continue and I need contributer to support Myanmar OCR

近期下载者

相关文件


收藏者