利用tess-two和cv4j實現簡單的ocr功能

Tony沈哲發表於2017-10-02

ocr

光學字元識別（英語：Optical Character Recognition, OCR）是指對文字資料的影象檔案進行分析識別處理，獲取文字及版面資訊的過程。

Tesseract

Tesseract是Ray Smith於1985到1995年間在惠普布里斯托實驗室開發的一個OCR引擎，曾經在1995 UNLV精確度測試中名列前茅。但1996年後基本停止了開發。2006年，Google邀請Smith加盟，重啟該專案。目前專案的許可證是Apache 2.0。該專案目前支援Windows、Linux和Mac OS等主流平臺。但作為一個引擎，它只提供命令列工具。現階段的Tesseract由Google負責維護，是最好的開源OCR Engine之一，並且支援中文。

tess-two是Tesseract在Android平臺上的移植。

下載tess-two：

compile 'com.rmtheis:tess-two:8.0.0'複製程式碼

然後將訓練好的eng.traineddata放入android專案的assets資料夾中，就可以識別英文了。

1. 簡單地識別英文

初始化tess-two，載入訓練好的tessdata

    private void prepareTesseract() {
        try {
            prepareDirectory(DATA_PATH + TESSDATA);
        } catch (Exception e) {
            e.printStackTrace();
        }

        copyTessDataFiles(TESSDATA);
    }

    /**
     * Prepare directory on external storage
     *
     * @param path
     * @throws Exception
     */
    private void prepareDirectory(String path) {

        File dir = new File(path);
        if (!dir.exists()) {
            if (!dir.mkdirs()) {
                Log.e(TAG, "ERROR: Creation of directory " + path + " failed, check does Android Manifest have permission to write to external storage.");
            }
        } else {
            Log.i(TAG, "Created directory " + path);
        }
    }

    /**
     * Copy tessdata files (located on assets/tessdata) to destination directory
     *
     * @param path - name of directory with .traineddata files
     */
    private void copyTessDataFiles(String path) {
        try {
            String fileList[] = getAssets().list(path);

            for (String fileName : fileList) {

                // open file within the assets folder
                // if it is not already there copy it to the sdcard
                String pathToDataFile = DATA_PATH + path + "/" + fileName;
                if (!(new File(pathToDataFile)).exists()) {

                    InputStream in = getAssets().open(path + "/" + fileName);

                    OutputStream out = new FileOutputStream(pathToDataFile);

                    // Transfer bytes from in to out
                    byte[] buf = new byte[1024];
                    int len;

                    while ((len = in.read(buf)) > 0) {
                        out.write(buf, 0, len);
                    }
                    in.close();
                    out.close();

                    Log.d(TAG, "Copied " + fileName + "to tessdata");
                }
            }
        } catch (IOException e) {
            Log.e(TAG, "Unable to copy files to tessdata " + e.toString());
        }
    }複製程式碼

拍完照後，呼叫startOCR方法。

    private void startOCR(Uri imgUri) {
        try {
            BitmapFactory.Options options = new BitmapFactory.Options();
            options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
            Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);

            String result = extractText(bitmap);
            resultView.setText(result);

        } catch (Exception e) {
            Log.e(TAG, e.getMessage());
        }
    }複製程式碼

extractText()會呼叫tess-two的api來實現ocr文字識別。

    private String extractText(Bitmap bitmap) {
        try {
            tessBaseApi = new TessBaseAPI();
        } catch (Exception e) {
            Log.e(TAG, e.getMessage());
            if (tessBaseApi == null) {
                Log.e(TAG, "TessBaseAPI is null. TessFactory not returning tess object.");
            }
        }

        tessBaseApi.init(DATA_PATH, lang);

        tessBaseApi.setImage(bitmap);
        String extractedText = "empty result";
        try {
            extractedText = tessBaseApi.getUTF8Text();
        } catch (Exception e) {
            Log.e(TAG, "Error in recognizing text.");
        }
        tessBaseApi.end();
        return extractedText;
    }複製程式碼

最後，顯示識別的效果，此時的效果還算可以。

2. 識別程式碼

接下來，嘗試用上面的程式識別一段程式碼。

此時，效果一塌糊塗。我們重構一下startOCR()，增加區域性的二值化處理。

    private void startOCR(Uri imgUri) {
        try {
            BitmapFactory.Options options = new BitmapFactory.Options();
            options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
            Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);

            CV4JImage cv4JImage = new CV4JImage(bitmap);
            Threshold threshold = new Threshold();
            threshold.adaptiveThresh((ByteProcessor)(cv4JImage.convert2Gray().getProcessor()), Threshold.ADAPTIVE_C_MEANS_THRESH, 12, 30, Threshold.METHOD_THRESH_BINARY);
            Bitmap newBitmap = cv4JImage.getProcessor().getImage().toBitmap(Bitmap.Config.ARGB_8888);

            ivImage2.setImageBitmap(newBitmap);

            String result = extractText(newBitmap);
            resultView.setText(result);

        } catch (Exception e) {
            Log.e(TAG, e.getMessage());
        }
    }複製程式碼

在這裡，使用cv4j來實現影象的二值化處理。

            CV4JImage cv4JImage = new CV4JImage(bitmap);
            Threshold threshold = new Threshold();
            threshold.adaptiveThresh((ByteProcessor)(cv4JImage.convert2Gray().getProcessor()), Threshold.ADAPTIVE_C_MEANS_THRESH, 12, 30, Threshold.METHOD_THRESH_BINARY);
            Bitmap newBitmap = cv4JImage.getProcessor().getImage().toBitmap(Bitmap.Config.ARGB_8888);複製程式碼

影象二值化就是將影象上的畫素點的灰度值設定為0或255，也就是將整個影象呈現出明顯的黑白效果。影象的二值化有利於影象的進一步處理，使影象變得簡單，而且資料量減小，能凸顯出感興趣的目標的輪廓。

cv4j的github地址：github.com/imageproces…

cv4j 是gloomyfish和我一起開發的影象處理庫，純java實現。

再來試試效果，圖片中間部分是二值化後的效果，此時基本能識別出程式碼的內容。

3. 識別中文

如果要識別中文字型，需要使用中文的資料包。可以去下面的網站上下載。

github.com/tesseract-o…

跟中文相關的資料包有chi_sim.traineddata、chi_tra.traineddata，它們分別表示是簡體中文和繁體中文。

tessBaseApi.init(DATA_PATH, lang);複製程式碼

前面的例子都是識別英文的，所以原先的lang值為"eng"，現在要識別簡體中文的話需要將其值改為"chi_sim"。

最後

本專案只是demo級別的演示，離生產環境的使用還差的很遠。
本專案的github地址：github.com/fengzhizi71…

為何說只是demo級別呢？

資料包很大，特別是中文的大概有50多M，放在移動端的肯定不合適。一般正確的做法，都是放在雲端。
識別文字很慢，特別是中文，工程上還有很多優化的空間。
做ocr之前需要做很多預處理的工作，在本例子中只用了二值化，其實還有很多預處理的步驟比如傾斜校正、字元切割等等。
為了提高tess-two的識別率，可以自己訓練資料集。

利用jQuery實現表單驗證功能
2021-09-09
jQuery
利用Wireshark和OSS的API文件簡單實現上傳和下載
2016-04-27
API
php實現簡單路由router功能
2017-12-27
PHP路由
超簡單實現iOS列表的索引功能
2019-03-01
iOS索引
Vue實現簡單的購物車功能
2019-07-06
Vue
Java Web簡單登陸功能的實現
2018-09-13
JavaWeb
【vue原始碼】簡單實現directive功能
2018-01-05
Vue原始碼
如何實現簡單的分散式鏈路功能？
2023-09-26
分散式
Redux 原理和簡單實現
2021-05-28
Redux
throttle和debounce簡單實現
2016-12-19
利用JQuery實現更簡單的Ajax跨域請求
2008-09-23
jQuery跨域
Dcat Admin實現簡單的excel匯入功能
2021-03-24
Excel
JSP(ajax)+Servlet實現簡單的登入功能
2018-09-19
JSServlet
利用 Redis 的 bitmap 實現簡單的布隆過濾器
2020-06-28
Redis過濾器
jQuery實現簡單點選隱藏和點選顯示列表的功能
2018-12-29
jQuery
AngularJS實現的表單編輯提交功能簡單介紹
2017-04-01
AngularJS
利用回撥函式實現簡單的輪播圖效果
2019-05-19
函式
spring aop實現簡單的許可權控制功能
2018-07-10
Spring
利用PHP SOAP擴充套件實現簡單Web Services
2014-08-05
PHP套件Web
android 簡單實現指紋識別功能
2018-09-28
Android
實現微信搖一搖功能簡單介紹
2017-04-09
Python Django實現簡單購物車功能
2017-06-26
PythonDjango
簡單實現安卓app自動更新功能
2018-01-03
安卓APP
node簡單實現一個更改頭像功能
2017-12-28
簡單實現附近周邊服務查詢功能
2015-03-04
利用tirpc庫實現簡單的客戶端和服務端
2024-09-11
RPC客戶端服務端
實現簡單的BitMap
2021-06-29
ArrayList的簡單實現
2021-02-21
AOP的簡單實現
2019-01-19
簡單的 HashMap 實現
2018-03-16
HashMap
簡單的Filter實現
2017-02-04
Filter
Promise的簡單實現
2018-01-12
Promise
rshiny1:簡單的接入api實現下載功能
2017-01-09
API
智慧家居簡單實現---使用ESP8266簡單實現和APP通訊
2018-03-27
APP
Java 實現《編譯原理》簡單詞法分析功能
2019-06-13
Java編譯原理詞法分析
簡單介紹SpringMVC RESTFul實現列表功能
2022-06-12
SpringMVCREST
教你python tkinter實現簡單計算器功能
2022-02-09
Python
1、簡單工廠模式實現計算器功能
2020-10-02
模式

利用tess-two和cv4j實現簡單的ocr功能

ocr

Tesseract

1. 簡單地識別英文

2. 識別程式碼

3. 識別中文

最後

相關文章