nlp 中文資料預處理

ckxllf發表於2019-12-02

  此博文詳細介紹中文資料預處理的過程並配上一定量的程式碼作為例項

  資料載入(預設csv格式)

  import pandas as pd

  datas = pd.read_csv("./test.csv", header=0, index_col=0) # DataFrame

  n_datas = data.to_numpy() # ndarray 轉成numpy更好處理(個人喜好)

  去除空行

  def delete_blank_lines(sentences):

  return [s for s in sentences if s.split()]

  no_line_datas = delete_blank_lines(n_datas)

  去除數字

  DIGIT_RE = re.compile(r'\d+')

  no_digit_datas = DIGIT_RE.sub('', no_line_datas)

  def delete_digit(sentences):

  return [DIGIT_RE.sub('', s) for s in sentences]

  判斷句子形式(簡單句或者複雜句)

  STOPS = ['。', '.', '?', '?', '!', '!'] # 中英文句末字元

  def is_sample_sentence(sentence):

  count = 0

  for word in sentence:

  if word in STOPS:

  count += 1

  if count > 1:

  return False

  return True

  去除中英文標點

  from string import punctuation

  import re

  punc = punctuation + u'

  def delete_punc(sentences):

  return [re.sub(r"[{}]+".format(punc), '', s) for s in a]

  去除英文(僅留漢字)

  ENGLISH_RE = re.compile(r'[a-zA-Z]+')

  def delete_e_word(sentences):

  return [ENGLISH_RE.sub('', s) for s in sentences]

  去除亂碼和特殊符號

  使用正規表示式去除相關無用符號和亂碼

  # 該操作可以去掉所有的符號,標點和英文,由於前期可能需要標點進一步判斷句子是否為簡單句,所以該操作可以放到最後使用。 鄭州做婦科檢查價格

  SPECIAL_SYMBOL_RE = re.compile(r'[^\w\s\u4e00-\u9fa5]+')

  def delete_special_symbol(sentences):

  return [SPECIAL_SYMBOL_RE.sub('', s) for s in sentences]

  中文分詞

  # 使用jieba

  def seg_sentences(sentences):

  cut_words = map(lambda s: list(jieba.cut(s)), sentences)

  return list(cut_words)

  # 使用pyltp分詞

  def seg_sentences(sentences):

  segmentor = Segmentor()

  segmentor.load('./cws.model') # 載入分詞模型引數

  seg_sents = [list(segmentor.segment(sent)) for sent in sentences]

  segmentor.release()

  return seg_sents

  去除停用詞

  # 停用詞列表需要自行下載

  stopwords = []

  def delete_stop_word(sentences):

  return [[word for word in s if word not in stopwords] for s in sentences]


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/69945560/viewspace-2666467/,如需轉載,請註明出處,否則將追究法律責任。

相關文章