Crawling World Wild Web at Scale

Omni-Space發表於2017-07-11

In this post we discuss some of the existing technologies for scraping, parsing and analyzing web pages. We also talk about some of the challenges software engineers might face while scraping dynamic web pages.

Scraping/Parsing/Mining Web Pages

In September 2012, iPhone 5 was released. We were interested in finding out people reactions to the new iPhone. We wanted to write a simple Python script to scrape and parse online reviews and run a sentiment analysis on the collected reviews. There are many applications for automated opinion mining where companies are interested in finding out their customers reactions to new products release.

For scraping reviews we used Python urllib module. For parsing pages contents and grabbing the required HTML elements we used a Python library called BeautifulSoup. For sentiment analysis, we found an API which is built on top of Python NLTK library for classification. Here is the link to sentiment API. Finally, for sending HTTP requests to text-processing website, we used another Python library called PycURL which is basically a Python interface to libcurl. You can pull/view the opinion mining code from its github repo: Opinion Mining.

Please note that there are other solutions for scraping web. For instance, another open source framework for scraping websites and extracting needed information isScrapy.

Scraping Amazon Reviews

For another research project, we were interested in scraping reviews for George Foreman Grill from Amazon website. For scraping this page, you can open the page on your Chrome browser and use Chrome inspector to inspect the review elements and figure out what HTML elements you need to grab when parsing the web page. If you inspect one of the reviews, you will see that the review is wrapped by a 'div' element using 'reviewText' for its style class.

The code snippet for scraping and parsing Amazon reviews has been shown below:


          amazon_url = "..." #add the link to amazon page here
          ur = urllib.urlopen(amazon_url)
          soup = BeautifulSoup(ur.read())
          posts = soup.select("div.reviewText")
          print posts[0].text   #this prints the first review

See how we grab the div elements for reviews by filtering by the style class. You can check Beautiful Soup Documentation for more details. With above snippet, one can get all reviews successfully.

Scraping Macys Reviews

We also wanted to scrape the reviews for the same product but from Macys wbsite. So, let's try the same approach as shown earlier and see what we get. The only difference is that when you inspect Macys page, you see that each review is wrapped in a span element using the style class of 'BVRRReviewText'. So, we make the following change to our snippet:


          macys_url = "..." #add the link to macys page
          ur = urllib.urlopen(macys_url)
          soup = BeautifulSoup(ur.read())
          posts = soup.select("span.BVRRReviewText")
          print posts[0].text   #this should print first review in theory!

If you try above code, you wont get anything for review content. And more interestingly if you try print ur.read() after the second line and ignore the rest of code, you'll get a None object. Why?

The issue is that the Macys reviews have been populated by Ajax calls from their web server. In other words, this is not a statically loaded html page. So, basically using urllib does not work here.

How to Scrape Dynamically Loaded Web Pages?

To resolve above issue, you need to figure out how Macys populate the reviews by making a POST call to a link on their web server. Then, you need to make that POST call request to populate the reviews. The other possible solution is to use a framework/library to simulate the operation of a browser. Here, we are going to use the PhantomJS which is a headless WebKit scriptable with a JavaScript API to scrape reviews from Macys.

You can download/use PhantomJS on your machine by following these instructions: How to build PhantomJS. Code below is our hack around getting the Macys reviews using PhantomJS:


// Get Macys reviews
var page = require('webpage').create(),
url = 'http://www1.macys.com/shop/product/george-foreman-grp95r-grill-6-servings?ID=797879';
page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
            var results = page.evaluate(function() {
            var allSpans = document.getElementsByTagName('span');
            var reviews = [];
            for(var i = 0; i < allSpans.length; i++) {
                if(allSpans[i].className === 'BVRRReviewText') {
                    reviews.push(allSpans[i].innerHTML);
                }
            }
            return reviews;
            });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

You see in above code how we go after grabbing reviews by getting the span elements with style class 'BVRRReviewText'. Another possible solution that we have found is ghost.py but didn't get chance to test it.

To check out our simple crawler/sentiment analyzer that we have developed for crawling Amazon/Macys reviews, visit our github repository here: opinion-mining.

Scraping Web at Scale

One problem that we have come across often when scraping the web at scale is the time-consuming nature of scraping. Most of the time we need to scrape billions of pages. This arises the need for coming up with a distributed solution to optimize the scraping time.

One simple solution that we came up with for designing a distributed web crawler was to use Work Queue to distribute time-consuming tasks among multiple workers (i.e. web crawlers). So, the basic idea is to use work queues to schedule tasks for scraping/parsing many pages by running multiple workers simultaneously.

You can view a simple example for a distributed web crawler here: Distributed Crawling with RMQ

Last words

So, this was a quick review on scrapping of web pages and some of the challenges you may encounter when you go out to world wild web.

Source: http://www.aioptify.com/crawling.php

Flutter Web 之 Hello World
2019-09-22
FlutterWeb
Web-Scale IT：對企業的影響
2015-12-29
Web
canvas scale()
2019-08-24
Canvas
Go Web 程式設計之 Hello World
2021-09-09
GoWeb程式設計
使用Golang語言編寫Hello World Web應用
2019-04-20
GolangWeb
wildfly 21的domain配置
2020-12-23
AI
使用Docker WildFly和wildfly-maven-plugin實現Java應用容器化
2022-12-04
DockerMavenPluginJava
《Wir wilden weisen Frauen》翻譯——
2016-10-23
CSS3 scale(x,y)
2018-08-08
CSSS3
Dealing with Scale in AR
2018-04-04
css zoom與scale區別
2017-12-05
CSSOOM
Scale-up（縱向擴充套件）和Scale-out（橫向擴充套件）
2011-12-23
套件
負載均衡的原理（垂直擴充套件 Scale Up、橫向擴充套件 Scale Out）
2019-09-06
負載套件
R語言的scale函式
2016-05-19
R語言函式
Hello, World
2024-10-11
Helio World！
2024-10-23
Hello World!
2024-10-31
Hello World
2024-11-03
wildfly 21中應用程式的部署
2020-12-27
在wildfly 21中搭建cluster叢集
2020-12-29
Leetcode-Wildcard Matching
2014-11-15
LeetCode
Wildcard Matching leetcode java
2014-08-06
LeetCodeJava
Managing Rails Apps at Massive Scale
2012-07-18
AIAPP
JBoss Wildfly (1) —— 7.2.0.Final編譯
2015-12-23
編譯
使用Maven配置JBoss、Wildfly資料來源
2014-11-19
Maven
Go - Hello World
2018-10-18
Go
Docker Hello World
2020-10-31
Docker
【Java】Hello world
2017-09-02
Java
React Hello,World
2017-07-26
React
Mockito Hello World
2015-04-10
Mockito
Kamus Oracle World
2005-06-05
Oracle
ant Hello World
2024-07-04
Deep "Hello world!"
2024-04-22
Go：Hello World
2024-09-16
Go
Canvas之translate、scale、rotate、skew方法講解！
2017-06-09
Canvas
wildfly 21的配置檔案和資源管理
2020-12-31
Struts官方示例學習-Wildcard_method示例
2010-08-13
[CISCN2019 華北賽區 Day2 Web1]Hack World
2024-11-10
Web