爬取LeetCode資料,生成README檔案,美化GitHub倉庫

Zhaoxi_Zhang發表於2018-12-15

專案地址:LeetCodeCrawler

概述

  現在一般或多或少都會在LeetCode上面進行刷題練習,然後將程式碼放在GitHub上,當然我也一樣,這是我的刷題倉庫Algorithm。刷完題如果每次都去重新編輯README.md檔案進行更新,未免顯得有些費時,因此有了需求,個人就寫了一個工具——LeetCodeCrawler:爬取 LeetCode 題目內容以及提交的AC程式碼的工具,並支援生成相應的 README.md 檔案,美化你的 LeetCode 倉庫的README。

使用方法

下載LeetCodeCrawler.jar到本地

建立好如下config.json檔案(可直接對 repo 的config.json進行更改),config.json檔案需與LeetCodeCrawler.jar放置於同一目錄下:

{
    "username": "leetcode@leetcode",
    "password": "leetcode",
    "language": ["cpp", "java"],
    "outputDir": "."
}
複製程式碼
  • usernamepassword對應你的 LeetCode 賬號和密碼
  • language對應於你在 LeetCode 刷題使用的程式語言,可多選,選填欄位如下(請嚴格按照如下欄位填寫):
    • cpp
    • java
    • c
    • csharp
    • javascript
    • python
    • python3
    • ruby
    • swift
    • golang
    • scala
    • kotlin
  • outputDir欄位表示你希望存放原始碼檔案的目錄,預設為.,即當前目錄

執行java -jar LeetCodeCrawler.jar

效果

爬取LeetCode資料,生成README檔案,美化GitHub倉庫

爬取LeetCode資料,生成README檔案,美化GitHub倉庫

爬取LeetCode資料,生成README檔案,美化GitHub倉庫

爬取解析

幾個相關API

  主要通過兩種方式來獲取我們想要的資料:1.RESTful API 2.GraphQL 以下是爬取過程中幾個有用的 API:

  • 所有題目的相關資訊:https://leetcode.com/api/problems/all/,資料大致如下:
{
    "user_name": "",
    "num_solved": 0,
    "num_total": 949,
    "ac_easy": 0,
    "ac_medium": 0,
    "ac_hard": 0,
    "stat_status_pairs": [
    {
        "stat":
        {
            "question_id": 993,
            "question__article__live": true,
            "question__article__slug": "tallest-billboard",
            "question__title": "Tallest Billboard",
            "question__title_slug": "tallest-billboard",
            "question__hide": false,
            "total_acs": 1361,
            "total_submitted": 4295,
            "frontend_question_id": 956,
            "is_new_question": false
        },
        "status": null,
        "difficulty":
        {
            "level": 3
        },
        "paid_only": false,
        "is_favor": false,
        "frequency": 0,
        "progress": 0
    },
    ...省略
    ],
    "frequency_high": 0,
    "frequency_mid": 0,
    "category_slug": "all"
}
複製程式碼
  • 某道題目提交的程式碼的資訊:https://leetcode.com/api/submissions/two-sum/?offset=0&limit=10&lastkey=,提交的程式碼列表可能超過一頁的顯示篇幅,因此需要做翻頁判斷的邏輯,資料大致如下:
{
    "submissions_dump": [
    {
        "id": xxx,
        "lang": "java",
        "time": "2 weeks, 5 days",
        "timestamp": 154****320,
        "status_display": "Accepted",
        "runtime": "4 ms",
        "url": "/submissions/detail/19****359/",
        "is_pending": "Not Pending",
        "title": ""
    },
    ...省略
    ],
    "has_next": true,
    "last_key": "xxx"
}
複製程式碼
  • GraphQL:https://leetcode.com/graphql,向這個連結傳送query請求,獲取我們想要的資料

模擬登陸

  之前寫過一篇博文來說明了如何模擬登陸——使用OkHttp模擬登陸LeetCode,可進一步檢視,這裡簡單說一下。根據抓包結果可以得到:

爬取LeetCode資料,生成README檔案,美化GitHub倉庫
因此我們只要建立一個Content-Type型別為multipart/form-data的請求,然後帶上初始開啟登入頁返回的Cookie值即可完成模擬登陸。


/**
  * 模擬登陸 LeetCodo,登陸過程分析見:https://www.cnblogs.com/ZhaoxiCheung/p/9302510.html
  */
public boolean doLogin() throws IOException {
    boolean success;
    Connection.Response response = Jsoup.connect(URL.LOGIN)
                                   .method(Connection.Method.GET)
                                   .execute();

    csrftoken = response.cookie("csrftoken");
    __cfduid = response.cookie("__cfduid");

    OkHttpClient client = new OkHttpClient.Builder()
                      .followRedirects(false)
                      .followSslRedirects(false)
                      .cookieJar(new MyCookieJar())
                      .connectTimeout(30, TimeUnit.SECONDS)
                      .readTimeout(30, TimeUnit.SECONDS)
                      .writeTimeout(30, TimeUnit.SECONDS)
                      .build();

    String form_data = "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"csrfmiddlewaretoken\"" + "\r\n\r\n"
                       + csrftoken + "\r\n"
                       + "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"login\"" + "\r\n\r\n"
                       + usrname + "\r\n"
                       + "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"password\"" + "\r\n\r\n"
                       + passwd + "\r\n"
                       + "--" + boundary + "\r\n"
                       + "Content-Disposition: form-data; name=\"next\"" + "\r\n\r\n"
                       + "/problems" + "\r\n"
                       + "--" + boundary + "--";

    RequestBody requestBody = RequestBody.create(MULTIPART, form_data);

    Request request = new Request.Builder()
                    .addHeader("Content-Type", "multipart/form-data; boundary=" + boundary)
                    .addHeader("Connection", "keep-alive")
                    .addHeader("Accept", "*/*")
                    .addHeader("Origin", "https://leetcode.com")
                    .addHeader("Referer", URL.LOGIN)
                    .addHeader("Cookie", "__cfduid=" + __cfduid + ";" + "csrftoken=" + csrftoken)
                    .post(requestBody)
                    .url(URL.LOGIN)
                    .build();

    Response loginResponse = client.newCall(request).execute();

    if (Main.isDebug)   out.println(loginResponse.message());

    Headers headers = loginResponse.headers();
    List<String>cookies = headers.values("Set-Cookie");
    for (String cookie : cookies) {
        int found = cookie.indexOf("LEETCODE_SESSION");
        if (found > -1) {
            if (Main.isDebug)   out.println(cookie);
            int last = cookie.indexOf(";");
            LEETCODE_SESSION = cookie.substring("LEETCODE_SESSION".length() + 1, last);
            if (Main.isDebug)   out.println(LEETCODE_SESSION);
        }
    }


    if (LEETCODE_SESSION != null) {
        success = true;
        out.println("Login Successfully");
    } else {
        success = false;
        out.println("Login Unsuccessfully");
    }
    loginResponse.close();

    return success;
}
複製程式碼

利用 GraphQL 獲取資料

  並非所有的資料都可以通過RESTful API的形式獲取,LeetCode 對於有些資料用的是GraphQL的方式,比如題目的Description。之前也寫了一篇關於使用GraphQL來獲取 LeetCode 資料的文章——爬取LeetCode題目——如何傳送GraphQL Query獲取資料,可進一步檢視。這裡主要說一下怎麼知道我們要傳送怎樣的query語句。在 Chrome 瀏覽器下使用 F12,右鍵 Network 下,從Header中的Request Payload中我們可以看到一個 query 的欄位,這是我們要構造的 GraphQL Query 的一個重要資訊。,如下圖所示:

爬取LeetCode資料,生成README檔案,美化GitHub倉庫

其他

獲取題目的描述

public String getProblemDescription(String problemTitle) throws IOException {
    String problemDescriptionString = "";
    String postBody = "query{question(titleSlug:\"" + problemTitle + "\") {content}}\n";
    RequestBody requestBody = RequestBody.create(MediaType.parse("application/graphql; charset=utf-8"), postBody);
    Headers headers = new Headers.Builder()
                .add("Content-Type", "application/graphql")
                .add("Referer", "https://leetcode.com/problems/" + problemTitle)
                .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
                .add("x-csrftoken", Login.csrftoken)
                .build();

    Response graphqlResponse = okHttpHelper.post(URL.GRAPHQL, requestBody, headers);

    if (graphqlResponse != null) {
        ProblemContentBean problemContentBean = okHttpHelper.fromJson(graphqlResponse.body().string(), ProblemContentBean.class);
        problemDescriptionString = problemContentBean.getData().getQuestion().getContent();

        graphqlResponse.close();
    } else {
        //TODO 輸出錯誤資訊
    }
    return problemDescriptionString;
}
複製程式碼

獲取某道題對於某個語言提交的程式碼

public String getSubmissionCode(String submissionUrl) throws IOException {
    String url = URL.LEETCODE + submissionUrl;
    if (Main.isDebug)   out.println(url);
    String codeString = null;

    Headers headers = new Headers.Builder()
            .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
            .build();

    Response response = okHttpHelper.get(url, headers);

    if (response != null) {
        String htmlString = response.body().string();

        Document document = Jsoup.parse(htmlString);
        Elements elements = document.getElementsByTag("script");
        for (Element element : elements) {
            int indexStart = element.toString().indexOf("submissionCode: '");
            if (indexStart > -1) {
                int indexTo = element.toString().indexOf("editCodeUrl");
                codeString = element.toString().substring(indexStart + ("submissionCode: '").length(), indexTo - 5);
                break;
            }
        }

        response.close();
    } else {
        //TODO 錯誤資訊處理
    }

    codeString = encode(codeString);

    return codeString;
}
複製程式碼

獲取題目對於 config 檔案指定的語言提交的程式碼

public synchronized Map<String, String> getSubmissions(String problemTitle, ResultBean resultBean) throws IOException {
    if (Main.isDebug)   out.println("pre problemTitle = " + problemTitle);
    //儲存語言對應的提交程式碼
    Map<String, String> submissionMap = new HashMap<>();
    int offset = 0;
    int limit = 10;
    boolean hasNext = true;
    String lastKey = "";

    List<String> languageList = Config.getSingleton().getLanguageList();
    //已經在本地存有對應語言的程式碼
    List<String> savedLanguageList = resultBean != null ? resultBean.getLanguage() : new ArrayList<>(0);

    //儲存某個語言的程式碼是否已經抓取
    Map<String, Boolean>languageMap = new HashMap<>();
    for (int i = 0; i < languageList.size(); i++) {
        boolean hasExist = false;
        //資料量較小,暴力搜尋
        for (int j = 0; j < savedLanguageList.size(); j++) {
            if (languageList.get(i).equals(savedLanguageList.get(j))) {
                hasExist = true;
                break;
            }
        }
        if (!hasExist)  languageMap.put(languageList.get(i), false);
    }

    //想要爬取的題目的對應語言提交的程式碼已經儲存在本地了
    if (languageMap.size() == 0)    return submissionMap;

    while(hasNext) {
        String submissionsUrl = String.format(URL.SUBMISSIONS_FORMAT, problemTitle, offset, limit, lastKey);

        Headers headers = new Headers.Builder()
                    .add("Cookie", "__cfduid=" + Login.__cfduid + ";" + "csrftoken=" + Login.csrftoken + ";" + "LEETCODE_SESSION=" + Login.LEETCODE_SESSION)
                    .build();

        Response response = okHttpHelper.get(submissionsUrl, headers);

        if (response != null) {
            String responseData = response.body().string();

            SubmissionBean submissionBean = okHttpHelper.fromJson(responseData, SubmissionBean.class);
            List<SubmissionBean.SubmissionsDumpBean> submissionsDumpList = submissionBean.getSubmissions_dump();

            if (submissionsDumpList == null) {
                if (Main.isDebug) {
                    out.println("submissionsUrl = " + submissionsUrl);
                    out.println("problemTitle = " + problemTitle);
                    out.println("responseData = " + responseData);
                    out.println("status message = " + response.message());
                    out.println("message code = " + response.code());
                }

                continue;
            }

            for (int i = 0; i < submissionsDumpList.size(); i++) {
                SubmissionBean.SubmissionsDumpBean submission = submissionsDumpList.get(i);
                String language = submission.getLang();
                if (languageMap.containsKey(language) && languageMap.get(language) == false && submission.getStatus_display().equals("Accepted")) {
                    submissionMap.put(language, getSubmissionCode(submission.getUrl()));
                    languageMap.put(language, true);
                }
            }

            //翻頁邏輯
            hasNext = submissionBean.isHas_next();
            offset = (++offset) * limit;
            lastKey = submissionBean.getLast_key();

            response.close();
        } else {
            //TODO
        }
    }

    return submissionMap;
}
複製程式碼

  更詳細的程式碼可在 GitHub 檢視——LeetCodeCrawler

相關文章