Laravel 中使用 PHP 分詞庫 (jieba) 和 (scws)

bestcyt發表於2018-06-24

堅持開源,堅持分享
這篇文章旨在介紹我用過的兩個PHP分詞庫以及他們的簡單使用

  • 目的:完成一段段落的分詞

1.Jieba分詞庫

Jieba分詞庫,GitHub地址
安裝:

composer require fukuball/jieba-php:dev-master

主要程式碼:

    //這邊要給記憶體,不然會炸
    ini_set('memory_limit', '1024M'); 

    //初始化
    $this->jieba = new Jieba();
    $this->finalseg = new Finalseg();

    $this->jieba->init();
    $this->finalseg->init();

    //使用
    $cut_array = $this->jieba->cut('分詞字串',false);
    //分詞後的結果是陣列

notice:

  1. Jieba分詞庫可以新增關鍵字,就是自定義詞彙來作分詞,有額外需求的可以看GitHub
  2. 詞彙的詞性是在'src/dict/pos_tag_readable.txt'

2. SCWS分詞

官方演示網站,scws4;

這個分詞庫,個人感覺很快,而且不需要像Jieba那樣需要記憶體那麼,當時使用完,感覺還不錯,我選擇的是 PSCWS4,就是以PHP環境的,而沒用PHP擴充套件,不支援composer;

1.下載安裝:

  1. pscws4
  2. 詞典(簡體中文-utf8)
  3. 將pscws4解壓後方到http/Help/scws目錄下(新建)
  4. 將詞典檔案放到public目錄下

2.準備:

  1. 修改解壓後的pscws4的pscws4.class.php檔名為PSCWS4.php,把require 檔案改為use App\Help\scws\XDB_R;
  2. 修改解壓後的pscws4的xdb_r.class.php檔名為XDB_R.php
  3. 給兩個類檔案新增名稱空間namespace App\Help\scws;

3.編碼測試
簡要實現程式碼(附錄有完整程式碼)

    //初始化 並設定utf8,設定詞典路徑和規則路徑
    $this->pscws = new PSCWS4('utf8');
    $this->pscws->set_charset('utf-8');
    $this->pscws->set_dict(public_path().'/dict.utf8.xdb');
    $this->pscws->set_rule(public_path().'/rules.ini');

    //使用:
    $this->pscws->send_text("分詞的字串。。。");
    while ($some = $this->pscws->get_result())
    {
        foreach ($some as $word)
        {
            $article[] = $word['word'];
        }
    }

4.效果圖
jieba效果圖:
image
pscws4效果圖
image

以上可以看出,jieba對於一些英文標點符號沒有很好的切割,例如 42的country;而scws對於每個標點符號都作了切割;對於我的需求來說,scws是比較適合我的,如何選擇看個人需求。
jieba

  • 優點:能新增關鍵字;自定義詞典
  • 缺點:需要記憶體大,對於英文分詞和標點符號支援不是很好

scws:

  • 優點:詞彙字典很大,有28w,可以精細切割每個字元
  • 缺點:無法自己擴充套件,貌似要錢

附錄程式碼

路由:web.php

Route::get('/scws', 'WordCutController@scwsCut');
Route::get('/jieba', 'WordCutController@jieBaCut');

控制器:WordCutController

<?php

namespace App\Http\Controllers;

use App\Help\scws\PSCWS4;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Jieba;
use Illuminate\Http\Request;

class WordCutController extends Controller
{
    public $pscws;
    public $jieba;
    public $finalseg;

    /*
     * pscws4分詞 例項
     */
    public function scwsCut(){
        $this->pscws = new PSCWS4('utf8');
        $this->pscws->set_charset('utf-8');
        $this->pscws->set_dict(public_path().'/dict.utf8.xdb');
        $this->pscws->set_rule(public_path().'/rules.ini');

        //使用:
        $this->pscws->send_text("Dragon Boat Festival is one the very classic traditional festivals, which has been celebrated since the old China. Firstly, it is to in honor of the great poet Qu Yuan, who jumped into the water and ended his life for loving the country. Nowadays, different places have different ways to celebrate.
    端午節是一個非常經典的傳統節日,自古以來就一直被人們所慶祝。首先,是為了紀念偉大的詩人屈原,屈原跳入水自殺,以此來表達了對這個國家的愛。如今,不同的地方有不同的慶祝方式。");
        while ($some = $this->pscws->get_result())
        {
            foreach ($some as $word)
            {
                $article[] = $word['word'];
            }
        }
        dd($article);

    }

    /*
     * jieba分詞 例項
     */
    public function jieBaCut(){
        ini_set('memory_limit', '1024M');

        //初始化
        $this->jieba = new Jieba();
        $this->finalseg = new Finalseg();

        $this->jieba->init();
        $this->finalseg->init();

        //使用
        $cut_array = $this->jieba->cut('Dragon Boat Festival is one the very classic traditional festivals, which has been celebrated since the old China. Firstly, it is to in honor of the great poet Qu Yuan, who jumped into the water and ended his life for loving the country. Nowadays, different places have different ways to celebrate.
端午節是一個非常經典的傳統節日,自古以來就一直被人們所慶祝。首先,是為了紀念偉大的詩人屈原,屈原跳入水自殺,以此來表達了對這個國家的愛。如今,不同的地方有不同的慶祝方式。',false);

        dd($cut_array);
    }
}

如果有不對或不足的,請大佬們指出來,畢竟我只是滿足需求,並沒有深入研究,謝謝各位大佬哦:)

相關文章