Crawling World Wild Web at Scale
In this post we discuss some of the existing technologies for scraping, parsing and analyzing web pages. We also talk about some of the challenges software engineers might face while scraping dynamic web pages.
Scraping/Parsing/Mining Web Pages
In September 2012, iPhone 5 was released. We were interested in finding out people reactions to the new iPhone. We wanted to write a simple Python script to scrape and parse online reviews and run a sentiment analysis on the collected reviews. There are many applications for automated opinion mining where companies are interested in finding out their customers reactions to new products release.
For scraping reviews we used Python urllib module. For parsing pages contents and grabbing the required HTML elements we used a Python library called BeautifulSoup. For sentiment analysis, we found an API which is built on top of Python NLTK library for classification. Here is the link to sentiment API. Finally, for sending HTTP requests to text-processing website, we used another Python library called PycURL which is basically a Python interface to libcurl. You can pull/view the opinion mining code from its github repo: Opinion Mining.
Please note that there are other solutions for scraping web. For instance, another open source framework for scraping websites and extracting needed information isScrapy.
Scraping Amazon Reviews
For another research project, we were interested in scraping reviews for George Foreman Grill from Amazon website. For scraping this page, you can open the page on your Chrome browser and use Chrome inspector to inspect the review elements and figure out what HTML elements you need to grab when parsing the web page. If you inspect one of the reviews, you will see that the review is wrapped by a 'div' element using 'reviewText' for its style class.
The code snippet for scraping and parsing Amazon reviews has been shown below:
amazon_url = "..." #add the link to amazon page here
ur = urllib.urlopen(amazon_url)
soup = BeautifulSoup(ur.read())
posts = soup.select("div.reviewText")
print posts[0].text #this prints the first review
See how we grab the div elements for reviews by filtering by the style class. You can check Beautiful Soup Documentation for more details. With above snippet, one can get all reviews successfully.
Scraping Macys Reviews
We also wanted to scrape the reviews for the same product but from Macys wbsite. So, let's try the same approach as shown earlier and see what we get. The only difference is that when you inspect Macys page, you see that each review is wrapped in a span element using the style class of 'BVRRReviewText'. So, we make the following change to our snippet:
macys_url = "..." #add the link to macys page
ur = urllib.urlopen(macys_url)
soup = BeautifulSoup(ur.read())
posts = soup.select("span.BVRRReviewText")
print posts[0].text #this should print first review in theory!
If you try above code, you wont get anything for review content. And more interestingly if you try print ur.read() after the second line and ignore the rest of code, you'll get a None object. Why?
The issue is that the Macys reviews have been populated by Ajax calls from their web server. In other words, this is not a statically loaded html page. So, basically using urllib does not work here.
How to Scrape Dynamically Loaded Web Pages?
To resolve above issue, you need to figure out how Macys populate the reviews by making a POST call to a link on their web server. Then, you need to make that POST call request to populate the reviews. The other possible solution is to use a framework/library to simulate the operation of a browser. Here, we are going to use the PhantomJS which is a headless WebKit scriptable with a JavaScript API to scrape reviews from Macys.
You can download/use PhantomJS on your machine by following these instructions: How to build PhantomJS. Code below is our hack around getting the Macys reviews using PhantomJS:
// Get Macys reviews
var page = require('webpage').create(),
url = 'http://www1.macys.com/shop/product/george-foreman-grp95r-grill-6-servings?ID=797879';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var results = page.evaluate(function() {
var allSpans = document.getElementsByTagName('span');
var reviews = [];
for(var i = 0; i < allSpans.length; i++) {
if(allSpans[i].className === 'BVRRReviewText') {
reviews.push(allSpans[i].innerHTML);
}
}
return reviews;
});
console.log(results.join('\n'));
}
phantom.exit();
});
You see in above code how we go after grabbing reviews by getting the span elements with style class 'BVRRReviewText'. Another possible solution that we have found is ghost.py but didn't get chance to test it.
To check out our simple crawler/sentiment analyzer that we have developed for crawling Amazon/Macys reviews, visit our github repository here: opinion-mining.
Scraping Web at Scale
One problem that we have come across often when scraping the web at scale is the time-consuming nature of scraping. Most of the time we need to scrape billions of pages. This arises the need for coming up with a distributed solution to optimize the scraping time.
One simple solution that we came up with for designing a distributed web crawler was to use Work Queue to distribute time-consuming tasks among multiple workers (i.e. web crawlers). So, the basic idea is to use work queues to schedule tasks for scraping/parsing many pages by running multiple workers simultaneously.
You can view a simple example for a distributed web crawler here: Distributed Crawling with RMQ
Last words
So, this was a quick review on scrapping of web pages and some of the challenges you may encounter when you go out to world wild web.
Source: http://www.aioptify.com/crawling.php
相關文章
- Flutter Web 之 Hello WorldFlutterWeb
- Web-Scale IT:對企業的影響Web
- canvas scale()Canvas
- Go Web 程式設計之 Hello WorldGoWeb程式設計
- 使用Golang語言編寫Hello World Web應用GolangWeb
- wildfly 21的domain配置AI
- 使用Docker WildFly和wildfly-maven-plugin實現Java應用容器化DockerMavenPluginJava
- 《Wir wilden weisen Frauen》翻譯——
- CSS3 scale(x,y)CSSS3
- Dealing with Scale in AR
- css zoom與scale區別CSSOOM
- Scale-up(縱向擴充套件)和Scale-out(橫向擴充套件)套件
- 負載均衡的原理(垂直擴充套件 Scale Up、橫向擴充套件 Scale Out)負載套件
- R語言的scale函式R語言函式
- Hello, World
- Helio World!
- Hello World!
- Hello World
- wildfly 21中應用程式的部署
- 在wildfly 21中搭建cluster叢集
- Leetcode-Wildcard MatchingLeetCode
- Wildcard Matching leetcode javaLeetCodeJava
- Managing Rails Apps at Massive ScaleAIAPP
- JBoss Wildfly (1) —— 7.2.0.Final編譯編譯
- 使用Maven配置JBoss、Wildfly資料來源Maven
- Go - Hello WorldGo
- Docker Hello WorldDocker
- 【Java】Hello worldJava
- React Hello,WorldReact
- Mockito Hello WorldMockito
- Kamus Oracle WorldOracle
- ant Hello World
- Deep "Hello world!"
- Go:Hello WorldGo
- Canvas之translate、scale、rotate、skew方法講解!Canvas
- wildfly 21的配置檔案和資源管理
- Struts官方示例學習-Wildcard_method示例
- [CISCN2019 華北賽區 Day2 Web1]Hack WorldWeb