Python多程式處理：如何將大量資料放入有限記憶體

Daetalus發表於2014-12-17

簡介

這是一篇有關如何將大量的資料放入有限的記憶體中的簡略教程。

與客戶工作時，有時會發現他們的資料庫實際上只是一個csv或Excel檔案倉庫，你只能將就著用，經常需要在不更新他們的資料倉儲的情況下完成工作。大部分情況下，如果將這些檔案儲存在一個簡單的資料庫框架中或許更好，但時間可能不允許。這種方法對時間、機器硬體和所處環境都有要求。

下面介紹一個很好的例子：假設有一堆表格（沒有使用Neo4j、MongoDB或其他型別的資料庫，僅僅使用csvs、tsvs等格式儲存的表格），如果將所有表格組合在一起，得到的資料幀太大，無法放入記憶體。所以第一個想法是：將其拆分成不同的部分，逐個儲存。這個方案看起來不錯，但處理起來很慢。除非我們使用多核處理器。

目標

這裡的目標是從所有職位中（大約1萬個），找出相關的的職位。將這些職位與政府給的職位程式碼組合起來。接著將組合的結果與對應的州（行政單位）資訊組合起來。然後用通過word2vec生成的屬性資訊在我們的客戶的管道中增強已有的屬性。

這個任務要求在短時間內完成，誰也不願意等待。想象一下，這就像在不使用標準的關係型資料庫的情況下進行多個表的連線。

資料

職位資料

referencenumber	title	postdate	url	company	city	state	description
1652398203	Sales Associate	2014-07-09 13:47:18	URL link	Company Name	City	State	Our Sales Associates are…

“表格太長，請到原文檢視。”

標題資料

ID	Title
82	Pediatricians, General

OES資料

area	area_title	area_type	naics	naics_title	own_code	後略…
99	U.S.	1	000000	Cross-industry	1235	00-0000

“表格太長，請到原文檢視。”

SOC表

2010 SOC Code	2010 SOC Title	2010 SOC Direct Match Title	llustrative Example
11-1011	Chief Executives	CEO

示例指令碼

下面的是一個示例指令碼，展示瞭如何使用multiprocessing來在有限的記憶體空間中加速操作過程。指令碼的第一部分是和特定任務相關的，可以自由跳過。請著重關注第二部分，這裡側重的是multiprocessing引擎。

#import the necessary packages
import pandas as pd
import us
import numpy as np
from multiprocessing import Pool,cpu_count,Queue,Manager

# the data in one particular column was number in the form that horrible excel version 
# of a number where '12000' is '12,000' with that beautiful useless comma in there. 
# did I mention I excel bothers me?
# instead of converting the number right away, we only convert them when we need to
def median_maker(column):
    return np.median([int(x.replace(',','')) for x in column])

# dictionary_of_dataframes contains a dataframe with information for each title; e.g title is 'Data Scientist'
# related_title_score_df is the dataframe of information for the title; columns = ['title','score'] 
### where title is a similar_title and score is how closely the two are related, e.g. 'Data Analyst', 0.871
# code_title_df contains columns ['code','title']
# oes_data_df is a HUGE dataframe with all of the Bureau of Labor Statistics(BLS) data for a given time period (YAY FREE DATA, BOO BAD CENSUS DATA!)

def job_title_location_matcher(title,location):
    try:
        related_title_score_df = dictionary_of_dataframes[title]
        # we limit dataframe1 to only those related_titles that are above 
        # a previously established threshold
        related_title_score_df = related_title_score_df[title_score_df['score']>80]

        #we merge the related titles with another table and its codes
        codes_relTitles_scores = pd.merge(code_title_df,related_title_score_df)
        codes_relTitles_scores = codes_relTitles_scores.drop_duplicates()

        # merge the two dataframes by the codes
        merged_df = pd.merge(codes_relTitles_scores, oes_data_df)
        #limit the BLS data to the state we want
        all_merged = merged_df[merged_df['area_title']==str(us.states.lookup(location).name)]

        #calculate some summary statistics for the time we want
        group_med_emp,group_mean,group_pct10,group_pct25,group_median,group_pct75,group_pct90 = all_merged[['tot_emp','a_mean','a_pct10','a_pct25','a_median','a_pct75','a_pct90']].apply(median_maker)
        row = [title,location,group_med_emp,group_mean,group_pct10,group_pct25, group_median, group_pct75, group_pct90]
        #convert it all to strings so we can combine them all when writing to file
        row_string = [str(x) for x in row]
        return row_string
    except:
        # if it doesnt work for a particular title/state just throw it out, there are enough to make this insignificant
        'do nothing'

#import the necessary packages

import pandas as pd

import us

import numpy as np

from multiprocessing import Pool,cpu_count,Queue,Manager

# the data in one particular column was number in the form that horrible excel version

# of a number where '12000' is '12,000' with that beautiful useless comma in there.

# did I mention I excel bothers me?

# instead of converting the number right away, we only convert them when we need to

def median_maker(column):

return np.median([int(x.replace(',','')) for x in column])

# dictionary_of_dataframes contains a dataframe with information for each title; e.g title is 'Data Scientist'

# related_title_score_df is the dataframe of information for the title; columns = ['title','score']

### where title is a similar_title and score is how closely the two are related, e.g. 'Data Analyst', 0.871

# code_title_df contains columns ['code','title']

# oes_data_df is a HUGE dataframe with all of the Bureau of Labor Statistics(BLS) data for a given time period (YAY FREE DATA, BOO BAD CENSUS DATA!)

def job_title_location_matcher(title,location):

try:

related_title_score_df = dictionary_of_dataframes[title]

# we limit dataframe1 to only those related_titles that are above

# a previously established threshold

related_title_score_df = related_title_score_df[title_score_df['score']>80]

#we merge the related titles with another table and its codes

codes_relTitles_scores = pd.merge(code_title_df,related_title_score_df)

codes_relTitles_scores = codes_relTitles_scores.drop_duplicates()

# merge the two dataframes by the codes

merged_df = pd.merge(codes_relTitles_scores, oes_data_df)

#limit the BLS data to the state we want

all_merged = merged_df[merged_df['area_title']==str(us.states.lookup(location).name)]

#calculate some summary statistics for the time we want

group_med_emp,group_mean,group_pct10,group_pct25,group_median,group_pct75,group_pct90 = all_merged[['tot_emp','a_mean','a_pct10','a_pct25','a_median','a_pct75','a_pct90']].apply(median_maker)

row = [title,location,group_med_emp,group_mean,group_pct10,group_pct25, group_median, group_pct75, group_pct90]

#convert it all to strings so we can combine them all when writing to file

row_string = [str(x) for x in row]

return row_string

except:

# if it doesnt work for a particular title/state just throw it out, there are enough to make this insignificant

'do nothing'

這裡發生了神奇的事情：

#runs the function and puts the answers in the queue
def worker(row, q):
        ans = job_title_location_matcher(row[0],row[1])
        q.put(ans)

# this writes to the file while there are still things that could be in the queue
# this allows for multiple processes to write to the same file without blocking eachother
def listener(q):
    f = open(filename,'wb')
    while 1:
        m = q.get()
        if m =='kill':
                break
        f.write(','.join(m) + 'n')
        f.flush()
    f.close()

def main():
    #load all your data, then throw out all unnecessary tables/columns
    filename = 'skill_TEST_POOL.txt'

    #sets up the necessary multiprocessing tasks 
    manager = Manager()
    q = manager.Queue()
    pool = Pool(cpu_count() + 2)
    watcher = pool.map_async(listener,(q,))

    jobs = []
    #titles_states is a dataframe of millions of job titles and states they were found in
    for i in titles_states.iloc:
        job = pool.map_async(worker, (i, q))
        jobs.append(job)

    for job in jobs:
        job.get()
    q.put('kill')
    pool.close()
    pool.join()

if __name__ == "__main__":
    main()

#runs the function and puts the answers in the queue

def worker(row, q):

ans = job_title_location_matcher(row[0],row[1])

q.put(ans)

# this writes to the file while there are still things that could be in the queue

# this allows for multiple processes to write to the same file without blocking eachother

def listener(q):

f = open(filename,'wb')

while 1:

m = q.get()

if m =='kill':

break

f.write(','.join(m) + 'n')

f.flush()

f.close()

def main():

#load all your data, then throw out all unnecessary tables/columns

filename = 'skill_TEST_POOL.txt'

#sets up the necessary multiprocessing tasks

manager = Manager()

q = manager.Queue()

pool = Pool(cpu_count() + 2)

watcher = pool.map_async(listener,(q,))

jobs = []

#titles_states is a dataframe of millions of job titles and states they were found in

for i in titles_states.iloc:

job = pool.map_async(worker, (i, q))

jobs.append(job)

for job in jobs:

job.get()

q.put('kill')

pool.close()

pool.join()

if __name__ == "__main__":

main()

由於每個資料幀的大小都不同（總共約有100Gb），所以將所有資料都放入記憶體是不可能的。通過將最終的資料幀逐行寫入記憶體，但從來不在記憶體中儲存完整的資料幀。我們可以完成所有的計算和組合任務。這裡的“標準方法”是，我們可以僅僅在“job_title_location_matcher”的末尾編寫一個“write_line”方法，但這樣每次只會處理一個例項。根據我們需要處理的職位/州的數量，這大概需要2天的時間。而通過multiprocessing，只需2個小時。

雖然讀者可能接觸不到本教程處理的任務環境，但通過multiprocessing，可以突破許多計算機硬體的限制。本例的工作環境是c3.8xl ubuntu ec2，硬體為32核60Gb記憶體（雖然這個記憶體很大，但還是無法一次性放入所有資料）。這裡的關鍵之處是我們在60Gb的記憶體的機器上有效的處理了約100Gb的資料，同時速度提升了約25倍。通過multiprocessing在多核機器上自動處理大規模的程式，可以有效提高機器的利用率。也許有些讀者已經知道了這個方法，但對於其他人，可以通過multiprocessing能帶來非常大的收益。順便說一句，這部分是skill assets in the job-market這篇博文的延續。

打賞支援我翻譯更多好文章，謝謝！
打賞譯者

打賞支援我翻譯更多好文章，謝謝！

任選一種支付方式

Python多程式處理：如何將大量資料放入有限記憶體

Spark在處理資料的時候，會將資料都載入到記憶體再做處理嗎？
2021-02-22
Spark記憶體
小程式處理大量資料列表的方法
2019-01-22
大量資料如何做分頁處理
2024-04-10
前端如何處理十萬級別的大量資料
2019-04-14
前端
將資料、程式碼、棧放入不同的段
2020-11-07
win10 Time broker大量佔用程式怎麼解決_win10 Time broker程式佔用記憶體如何處理
2020-08-03
Win10記憶體
幾百萬資料放入記憶體不會把系統撐爆嗎？
2019-12-21
記憶體
記憶體分配問題處理
2024-02-05
記憶體
Python中處理大型資料集而不會耗盡記憶體的方式推薦
2022-02-23
Python記憶體
python中多程式處理資料庫連線的問題
2020-12-18
Python資料庫
記憶體資料庫如何發揮記憶體優勢？
2023-02-21
記憶體資料庫
使用記憶體NewSQL資料平臺來處理實時資料流的三個好處
2021-01-20
記憶體SQL
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel
如何使用python多執行緒有效爬取大量資料？
2021-09-11
Python執行緒
Linux記憶體不足的處理方法
2021-09-09
Linux記憶體
資料庫物件比如表放入記憶體，行發生改變不會自動同步到記憶體的總結
2018-11-07
資料庫物件記憶體
python 處理資料
2020-10-29
Python
隨身碟資料看不見了,卻還佔用著記憶體空間如何處理
2023-05-18
記憶體
幾個例子理解不同資料型別的堆疊記憶體處理
2020-10-26
資料型別記憶體
[玩具程式碼]swoole多程式匯出大量資料
2020-10-13
處理大型資料集而不會耗盡記憶體的方式推薦
2022-02-23
記憶體
造成記憶體洩漏的異常處理
2022-04-10
記憶體
win10驅動記憶體洩露如何解決_win10記憶體洩露處理方法
2020-08-06
Win10記憶體洩露
【譯】JavaScript的記憶體管理和 4 種處理記憶體洩漏的方法
2019-03-24
JavaScript記憶體
python如何進行記憶體管理
2021-09-11
Python記憶體
Python如何進行記憶體管理？
2021-05-07
Python記憶體
Python如何管理記憶體？記憶體分配機制是什麼？
2024-01-11
Python記憶體
[資料處理]python基礎
2019-02-02
Python
Python資料處理典型用法
2024-11-03
Python
Python資料處理-pandas用法
2020-12-17
Python
一文帶你徹底瞭解大資料處理引擎Flink記憶體管理
2020-12-16
大資料記憶體
Python資料處理(一)：處理 JSON、XML、CSV 三種格式資料
2019-01-27
PythonJSONXML
利用 python 遍歷多級資料夾處理不同檔案
2019-10-18
Python
go 如何處理資料庫返回的多結果集
2019-05-28
Go資料庫
JavaScript 工作原理之三－記憶體管理及如何處理 4 類常見的記憶體洩漏問題(譯)
2019-03-04
JavaScript記憶體
win10遠端記憶體不足怎麼解決_win10遠端提示記憶體不足如何處理
2020-08-08
Win10記憶體
Win10 1909系統提示該記憶體不能為read如何處理
2020-05-05
Win10記憶體
Java 18 如何改進處理大型物件回收的記憶體碎片問題？ -
2021-11-21
Java物件記憶體
【記憶體資料庫】TimesTen
2018-04-28
記憶體資料庫

Python多程式處理：如何將大量資料放入有限記憶體

目標

資料

示例指令碼

打賞支援我翻譯更多好文章，謝謝！

相關文章