大資料技術與應用課堂測試-資料清洗同步

方自然發表於2024-04-09

原文網址 : https://www.cnblogs.com/kk4458/p/18124668

大資料

一、 資料結構分析：

（1）京津冀三省的2015年度的科技成果資料原始表，為Access資料庫，；

（2）要求將三省的科技成果資料彙總到同一表中（要求結果表為MySql資料表）；

（3）三個原始資料表結構不一致，要求結果表中包括所有的欄位，表達意思相同或相似的欄位要進行合併，不允許丟失欄位（若只有本表獨有欄位，在結果表中其他兩表資料在該欄位填入空值）。

二、 資料同步練習：要求採程式設計實現三個原始表資料同步功能，將三個表的資料同步到一個結果表中。

三、 資料清洗練習：

（1）重複記錄清洗，分析結果表中是否存在重複的資料記錄，主要是地域和成果名稱相同即判定為重複記錄，保留一條記錄，並補充其他重複記錄中獨有的資料欄位內容，再刪除其餘記錄。

（2）在結果表中追加年份和地域兩個標準維度欄位，如果原始表中存在該欄位則直接轉化成維度欄位，若不存在則根據單位名稱確定地域欄位內容，天津科技成果表中不存在年度欄位，則直接將年度維度欄位確定為2015年。

import java.sql.*;

public class thedataqingxi {

public static void main(String[] args) {

// 資料庫連線資訊

String url = "jdbc:mysql://localhost:3306/2024.2.28test";

String username = "root";

String password = "123456";

try {

// 連線資料庫

Connection connection = DriverManager.getConnection(url, username, password);

// 執行資料清洗操作

cleanData(connection);

// 關閉資料庫連線

connection.close();

} catch (SQLException e) {

e.printStackTrace();

}

private static void cleanData(Connection connection) throws SQLException {

// SQL查詢語句，查詢重複記錄並保留一條

String findDuplicatesSQL = "SELECT MIN(ID) as minID, name, danwei " +

"FROM huizongbiao " +

"GROUP BY name, danwei " +

"HAVING COUNT(*) > 1";

// SQL刪除語句，刪除除最小ID外的重複記錄

String deleteDuplicatesSQL = "DELETE FROM huizongbiao WHERE ID <> ? AND name = ? AND danwei = ?";

// 執行查詢

Statement statement = connection.createStatement();

ResultSet resultSet = statement.executeQuery(findDuplicatesSQL);

// 遍歷查詢結果

while (resultSet.next()) {

int minID = resultSet.getInt("minID");

String name = resultSet.getString("name");

String danwei = resultSet.getString("danwei");

// 執行刪除操作

PreparedStatement preparedStatement = connection.prepareStatement(deleteDuplicatesSQL);

preparedStatement.setInt(1, minID);

preparedStatement.setString(2, name);

preparedStatement.setString(3, danwei);

preparedStatement.executeUpdate();

preparedStatement.close();

}

// 關閉Statement和ResultSet

statement.close();

resultSet.close();

}

四、 資料分析：

根據提供的已知欄位名稱，自動將科技成果分類，並且分析京津冀三地的科技優勢。

import pandas as pd

import matplotlib.pyplot as plt

# 設定中文顯示

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.rcParams['axes.unicode_minus'] = False

# 讀取CSV檔案

df = pd.read_csv('huizongbiao.csv')

# 按地區分組並計算各地區的應用行業佔比

hebei_data = df[df['shengshiqu'] == '河北']

beijing_data = df[df['shengshiqu'] == '北京']

tianjin_data = df[df['shengshiqu'] == '天津']

def plot_pie_and_bar(data, title):

# 計算應用行業佔比

industry_counts = data['yingyonghangye'].value_counts()

# 只保留前五，其餘用"其他"代指

top_industries = industry_counts.head(5)

other_count = industry_counts[5:].sum()

top_industries['其他'] = other_count

total_count = len(data)

industry_percentages = top_industries / total_count * 100

# 繪製餅狀圖

plt.figure(figsize=(10, 6))

plt.subplot(1, 2, 1)

plt.pie(industry_percentages, labels=industry_percentages.index, autopct='%1.1f%%', startangle=140)

plt.title(f'{title} - 行業佔比')

# 繪製柱狀圖

plt.subplot(1, 2, 2)

top_industries.plot(kind='bar')

plt.title(f'{title} - 行業分佈 (前五)')

plt.xlabel('應用行業')

plt.ylabel('數量')

plt.tight_layout()

plt.show()

# 繪製河北地區的圖表

plot_pie_and_bar(hebei_data, '河北')

# 繪製北京地區的圖表

plot_pie_and_bar(beijing_data, '北京')

# 繪製天津地區的圖表

plot_pie_and_bar(tianjin_data, '天津')

大資料測試技術——課堂測試
2024-03-17
大資料
大資料技術原理與應用——大資料概述
2018-07-10
大資料
大資料技術原理與應用
2018-11-12
大資料
資料清洗如何測試？
2024-06-04
大資料建模、分析、挖掘技術應用
2022-08-19
大資料
大資料測試與傳統資料庫測試
2019-08-07
大資料資料庫
大資料技術在電商的應用
2019-04-22
大資料
產品資料管理(PDM)技術與應用
2019-01-18
大資料分析技術有哪些應用步驟
2021-10-18
大資料
圖書《資料資產管理核心技術與應用》分享
2024-08-02
PG技術大講堂 - Part 10：PostgreSQL資料庫管理
2023-03-09
SQL資料庫
物聯網之智慧農業應用分析&大資料之資料探勘技術的應用
2021-12-28
大資料
一篇文章詳解大資料技術和應用場景大資料
2018-10-22
大資料
關於大資料的建模、分析、挖掘技術應用
2022-08-03
大資料
大資料處理的關鍵技術及應用
2022-05-19
大資料
大資料測試之揭秘大資料的背景與發展
2019-08-07
大資料
大資料技術之大資料概論
2019-06-23
大資料
大資料技術 - Kyuubi
2024-03-05
大資料
大資料技術 - SuperSQL
2023-05-08
大資料SQL
大資料技術 - Directus
2023-12-18
大資料
大資料技術 - Druid
2023-12-05
大資料UI
大資料技術 - Ververica
2023-01-11
大資料
大資料技術 - Phoenix
2023-01-09
大資料
大資料技術 - Azkaban
2023-01-06
大資料
大資料技術 - Airflow
2023-01-06
大資料AI
大資料技術 - DolphinScheduler
2023-01-06
大資料
大資料技術 - DataX
2023-01-06
大資料
大資料技術 - Canal
2023-01-06
大資料
大資料技術 - Maxwell
2023-01-06
大資料
大資料技術 - Zookeeper
2023-02-28
大資料
大資料技術 - Hive
2023-02-24
大資料Hive
大資料技術 - Hbase
2023-02-24
大資料
大資料技術 - StarRocks
2023-01-03
大資料
大資料技術 - StreamX
2023-01-03
大資料
大資料技術 - Debezium
2023-01-03
大資料
大資料技術 - DragonflyDB
2023-04-16
大資料Go
軟體測試之資料庫測試技術系列七
2019-08-29
資料庫
石家莊鐵道大學2024年春季 2020 級課堂測試試卷—資料分析練習
2024-03-27

大資料技術與應用課堂測試-資料清洗同步

相關文章