Tuesday, 30 June 2015

Tuesday, 23 June 2015

Making data on the web useful: scraping


Many times data is not easily accessible – although it does exist. As much as we wish everything was available in CSV or the format of our choice – most data is published in different forms on the web. What if you want to use the data to combine it with other datasets and explore it independently?

Scraping to the rescue!

Scraping describes the method to extract data hidden in documents – such as Web Pages and PDFs and make it useable for further processing. It is among the most useful skills if you set out to investigate data – and most of the time it’s not especially challenging. For the most simple ways of scraping you don’t even need to know how to write code.

This example relies heavily on Google Chrome for the first part. Some things work well with other browsers, however we will be using one specific browser extension only available on Chrome. If you can’t install Chrome, don’t worry the principles remain similar.

Code-free Scraping in 5 minutes using Google Spreadsheets & Google Chrome

Knowing the structure of a website is the first step towards extracting and using the data. Let’s get our data into a spreadsheet – so we can use it further. An easy way to do this is provided by a special formula in Google Spreadsheets.

Save yourselves hours of time in copy-paste agony with the ImportHTML command in Google Spreadsheets. It really is magic!


In order to complete the next challenge, take a look in the Handbook at one of the following recipes:

    Extracting data from HTML tables.

    Scraping using the Scraper Extension for Chrome

Both methods are useful for:

    Extracting individual lists or tables from single webpages

The latter can do slightly more complex tasks, such as extracting nested information. Take a look at the recipe for more details.

Neither will work for:

    Extracting data spread across multiple webpages


Task: Find a website with a table and scrape the information from it. Share your result on datahub.io (make sure to tag your dataset with schoolofdata.org)


Once you’ve got your table into the spreadsheet, you may want to move it around, or put it in another sheet. Right click the top left cell and select “paste special” – “paste values only”.

Scraping more than one webpage: Scraperwiki

Note: Before proceeding into full scraping mode, it’s helpful to understand the flesh and bones of what makes up a webpage. Read the Introduction to HTML recipe in the handbook.

Until now we’ve only scraped data from a single webpage. What if there are more? Or you want to scrape complex databases? You’ll need to learn how to program – at least a bit.

It’s beyond the scope of this course to teach how to scrape, our aim here is to help you understand whether it is worth investing your time to learn, and to point you at some useful resources to help you on your way!

Structure of a scraper

Scrapers are comprised of three core parts:

1.    A queue of pages to scrape
2.    An area for structured data to be stored, such as a database
3.    A downloader and parser that adds URLs to the queue and/or structured information to the database.

Fortunately for you there is a good website for programming scrapers: ScraperWiki.com

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!


In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Further Reading

1.    Scraping for Journalism: A Guide for Collecting Data: ProPublica Guides

2.    Scraping for Journalists (ebook): Paul Bradshaw

3.    Scrape the Web: Strategies for programming websites that don’t expect it : Talk from PyCon

4.    An Introduction to Compassionate Screen Scraping: Will Larson

Any questions? Got stuck? Ask School of Data!

Thursday, 18 June 2015

Scraping Services - Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Saturday, 6 June 2015

Twitter Scraper Python Library

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it.

Import it. (It only works on ScraperWiki, unfortunately.)

from scraperwiki import swimport

search = swimport('twitter_search').search

Then search for terms.

search(['picnic #tcamp12', 'from:TCampDC', '@TCampDC', '#tcamp12', '#viphack'])

A separate search will be run on each of these phrases. That’s it.

A more complete search

Searching for #tcamp12 and #viphack didn’t get me all of the tweets because I waited like a week to do this. In order to get a more complete list of the tweets, I looked at the tweets returned from that first search; I searched for tweets referencing the users who had tweeted those tweets.

from scraperwiki.sqlite import save, select

from time import sleep

# Search by user to get some more

users = [row['from_user'] + ' tcamp12' for row in \

select('distinct from_user from swdata where from_user where user > "%s"' \

% get_var('previous_from_user', ''))]

for user in users:

    search([user], num_pages = 2)

    save_var('previous_from_user', user)


By default, the search function retrieves 15 pages of results, which is the maximum. In order to save some time, I limited this second phase of searching to two pages, or 200 results; I doubted that there would be more than 200 relevant results mentioning a particular user.

The full script also counts how many tweets were made by each user.


Remember, this is a library, so you can easily reuse it in your own scripts, like Max Richman did.

Sunday, 31 May 2015

Data Scraping Services - Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Thursday, 28 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Monday, 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

    Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)


It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

  •     Your tasks are represented by classes and not processes
  •     You can set and forget tasks and scale across multiple servers
  •     You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Sunday, 24 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Friday, 22 May 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Monday, 18 May 2015

Scraping Twitter Lists To Boost Social Outreach (+ Free Tool!)

I published a post a few weeks ago describing how to build your own twitter custom audience list, outlining a variety of techniques to build up your list.

This post outlines another method (hat tip to Ade Lewis for the idea) which requires you to scrape Twitter directly.

If you want to skip all the explanations and just want to download the Twitter List Scraper tool, here you go…

Download the Twitter Scraper Tool for Windows or Mac (completely free)

Disclaimer: Scraping Twitter is against their Terms of Service, so if you decide to do this you do it at your own risk.

Some Benchmarks

Building custom audiences on Twitter requires you to identify Twitter usernames that might be interested in your service or product.

In my previous posts, one of the methods I employed was to pull a competitor’s link profile and scrape social accounts from the linking domains.

Once you upload a custom list, Twitter goes through a process of ‘matching’ against profiles in their system, to make sure the user exists and hasn’t opted out of tailored ads.

As our data was scraped from a list of unqualified websites, the data matching wasn’t likely to be perfect.


Since I published that post, I have been experimenting a fair bit with list building, and have built up around 10 custom audience lists. I‘ve uploaded a total of 48,857 Twitter usernames using this method, but only 29,260 were matched by Twitter (just less than 60% match rate).

From some other experiments where I have had better control over the input data, this match rate was between 70-80%.

Since we’ll be scraping Twitter directly, I expect our match rate to be much higher – 90%+

Finding Relevant Twitter Lists

So, we’re going to scrape Twitter, and the first step is to find Twitter lists that will contain users potentially interested in what we have to offer.

As an example, we’ll pretend we’re marketing a music website, and we’ve produced a survey we want to collect responses for.

An advanced Google query can give us lists of music bloggers: site:twitter.com inurl:lists inurl:members inurl:music “music blogger”

Thursday, 14 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Monday, 4 May 2015

Lawyers & Attorneys Website Data Scraping Services

There are so many instances where one end’s up needing information from lawyers or bar associations. However, if you approach them directly or look for other ways to get information it might either be difficult or you might not get the information you are looking for. Thus, the best way to go about the scraping lawyer data.

Scraping lawyer data allow you to get information from various attorney websites, bar association websites, or other related websites. Using web scraping tools for getting such information makes it much easier to get all the relevant and important information without actually having to worry about the same.

If you wish to scrape data from lawyer, you are entitled to information such as lawyer name, firm names, address, contact details, history about the lawyers, educational qualifications, the bar association they are part of and much more.

Scraping lawyer data ensure that you also have images of the lawyer you are concentrating on. The result of scrape data form lawyer can be obtained in any format the user wants such as csv, excel, MySql etc. Scraping lawyer data also ensures that none of the information provided are repetitive or redundant.

If you are in need of information regarding any lawyer such as their contact details, address etc. it could end up being a huge and difficult task to get it manually or physically. Thus, taking off the help of scraping tools would ensure that you get all the needed information without actually having to bother about anything at all. The presence of lots of attorney websites and the fact that more and more lawyers are moving to the internet makes getting information easy with the help of some great tools. Scraping data is a very useful and handy method in which one can get all the required and relevant information and that too in a very easy to read format, which makes the method even worthier.

There are quite a few tools or services that you can take help of to get lawyers data scraped. Most of these services also provide with a sample demo and that free of cost. From the sample one can decide if they wish to continue with the services or try some other services. Thus, if you want any information from attorney websites or information about any lawyers, data scraping is a great way to get the same.

Wednesday, 29 April 2015

Web Scraping – An Illegal Activity or Simple Data Collection?

Gone are the days when skillful extraction of information pertaining to real estate such as foreclosures, homes for sale, or mortgage records was considered difficult. Now, it is not only easy to extract data from real estate websites but also scrape real estate data on a consistent basis to add more value to your portal, or ensure that updated data is available to your visitors at all times. From downloading actual scanned documents in the form of PDF files to scraping websites for deeds or mortgages, smartly designer data extraction tools can do it all.

However, the one question that still manages to come to the front in the minds of those who scrape real estate listings and others are whether the act is illegal in nature or a simple way of collecting data.

Take a look.

Web Scraping—What is it?

Generally speaking, web scraping refers to programs that are designed to simulate human internet surfing and access websites on behalf of their users. These tools are effective in collecting large quantities of data that are otherwise difficult for end users to access. They process semi-structured or unstructured data pages of targeted websites and transform available data into a more structured format that can be extracted or manipulated by the user easily.

Quite similar to web indexing that is used by search engines, the end motivation of web scraping is much different. While web indexing makes search engines far more efficient, the latter is used for reasons like market research, change detection, data monitoring, or in some events, theft. But then, it is not always a bad thing. You just need to know if a website allows web scraping before proceeding with the act.

Fine Line between Stealing and Collecting Information

Web scraping rides an extremely fine line between the acts of collecting relevant information and stealing the same. The websites that have copyright disclosure statements in place to protect their website information are offended by outsiders raiding their data without due permission. In other words, it amounts to trespassing on their portal, which is unacceptable—both ethically and legally. So, it is very important for you to read all disclosure statements carefully and follow along in the right way. As web scraping cases may turn into felony offenses, it is best to guard against any kind of scrupulous activity and take permission before scraping data.

The Good News

However, all is not grey in data extraction processes. Reputed agencies are helping their clients scrape valuable data for gaining more value through legal means and carefully used tools. If you are looking for such services, then do get in touch with a reliable web scraping company of your choice and take your business to the next levels of success.

Monday, 27 April 2015

I Don’t Need No Stinking API: Web Scraping For Fun and Profit

If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?

scraper toolThat’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:

Any content that can be viewed on a webpage can be scraped. Period.

If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.

Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.

There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.

Why You Should Scrape

But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.

Websites are More Important Than APIs

The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.

We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.

Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.

Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.

No Rate-Limiting

Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.

Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.

Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.

Anonymous Access

There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.

With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.

The Data’s Already in Your Face

Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.

Let’s Get to Scraping

So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.

Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.

I’ll share some of the tips I’ve learned along the way.

Fetching the Data

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Dealing with Pagination

At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.

Usually, clicking to page 2 adds some sort of offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.

With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.

The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.

Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.

AJAX Isn’t That Bad!

Sometimes people see web pages with URL fragments # and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.

The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.

All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.

Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.

(Un)structured Data?

Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.

Use CSS Hooks

In my experience, this is usually straightforward since most web designers litter the markup with tons of classes and ids to provide hooks for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost <div> around the item you want.

This <div> should be the outer wrapper around a single item you want access to. It probably has some class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.

A note here though: the DOM tree that is presented by the inspector isn’t always the same as the DOM tree represented by the HTML sent back by the website. It’s possible that the DOM you see in the inspector has been modified by Javascript — or sometime even the browser, if it’s in quirks mode.

Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.

This issue has caused me a number of head-scratchers.

Get a Good HTML Parsing Library

It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.

Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:

    You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. :)

You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.

A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.

Some Traps To Know About

I should mention that some websites explicitly prohibit the use of automated scraping, so it’s a good idea to read your target site’s Terms of Use to see if you’re going to make anyone upset by scraping.

For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.

But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.

When In Doubt, Spoof Headers

Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.

Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.

Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.

Content Behind A Login

Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work.

Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.

Rate Limiting

I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.

Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients.

This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.

Depending on your application, this could work for you.

Poorly Formed Markup

Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.

It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.

If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.

Tuesday, 21 April 2015

Data Mining and Predictive Analysis

Data collection and curing is the core foundation of most businesses. Database building thus is an important function and activity where enterprises invest heavily. With information now available on the Internet and easily obtained, it raises the importance of having professionals who crawl data and offer web scraping services.

Once the data is accessed, though, it is important to filter out the relevant data based on the business need. Although Many DaaS provider convert the unstructured web data into meaningful structured data it is recommended to be internally equipped to use the data to its maximum.

This understanding has given rise to the field of Data Mining. Data Mining is designed to explore large amounts of data in search of consistent patterns and connections between the variables and validate the findings by applying the detected patterns to the new sets of the data. Once these connections are established and understood, the end goal is to be able to predict the possible outcomes using predictive analysis techniques.

Together, both Data Mining and predictive analysis aid in making marketing campaigns more efficient. While predictive analysis helps simulate and understand what may happen, data mining helps identify exciting data patterns and connections.

The process of Data Mining and Predictive analysis consists of 3 steps


Once a database is compiled, it needs to be cleaned, analysed and potential connections need to be built. This process involves filtering the relevant data and identifying the possible predictors. Data Exploration also sets a premise for preliminary feature selection to manage number of variables. This data is then prepared for statistical analysis using a wide variety of graphical and statistical parameters. This helps identify the most relevant variables and setups the predictive models to be built.

Data mining process


Next comes building various models and choosing the most relevant ones. This decision is based on their possible predictive performance and of being able to produce stable results across all the samples. Simple as it sounds, to truly get the results, all possible models must be treated with data to simulate scenarios. The model with most stable statistical feature is validated.


Once the relevant models are finalised, the same is applied to new data to understand and predict the estimated outcomes. Application of data models is an ongoing and complex process since every new dataset needs to be configured in the model.

Data Mining and predictive analysis essentially involves blending statistical methodology where the traditional statistics machine learning and complex algorithms. This greatly increases the need for efficient and skilled data handlers. This could include data analysts and scientists.

See how you can become data scientist here:

Data crunchers use data mining and predictive analysis actively to get an edge in the big data management. Database platforms like Hadoop assist in database management and large-scale distribution. But the costs involved in setting up data centres and big data management capacity are high. Budgets allocated within the enterprise are more project-focussed and analytics budgets are usually limited. Quite often, big data and analytics project fail to launch because of this problem! The other problem is that to run effective predictive models, data requires to be handled by scientists with experience. Finding and setting together a technologically-advanced team is a daunting task most enterprises face outside the tech domain.

Predictive Analysis model

A predictive analysis model is essentially predicting the all possible outcomes from a given set of data. Here are a few steps that can be taken to help build and identify the “ideal” predictive analysis model. These steps more or less mirror the usual statistical methodology of building a test model.

Defining an objective

This is the first and a critical step. Unless the objective is identified and defined there can be no concrete results since there wouldn’t be clarity to compare the final outcome to the expected result. It also helps understand the scope of the project.

Preparing the data

This is more to do with data mining. Historic data used for training the model is scattered across multiple platforms and sources. To compound the problem, data can be unstructured with possible duplicate accounts and missing values! Data quality determines the quality of the model, and thus it becomes imperative that data is healthy and relevant.

Data Sampling

Once mined, Data is essentially split into 2 parts. One set is for training that is used to build the model and the second is the ‘test’ set that is used to verify the accuracy of the final output. This also helps identify and filter the noise component.

Model Building

Sampling cam equally result in a single algorithm or parallel & connected algorithms. In such a case the data goes through multiple testing and a decision is based on the final output.


Once a model gets finalised, the other teams in the organization need to be involved to build a deployable model and understand its impact on the overall business.

The possibilities with Data mining & Predictive analysis are huge. It also gives a huge room for learning and experimenting. There are several tools available in the industry to aid through all the steps of data mining and predictive analysis. The combination of human expertise and intellect along with the help of the available tools and the overall cooperation within the multiple channels within the organization essentially ensures a stronger grip on the ability to build a solid predictive model.

When used together, predictive analytics and data mining help marketing professionals anticipate and get ready for customer needs, rather than just reacting to them.

Wednesday, 8 April 2015

The Nasty Problem with Scraping Results from the Engines

One theme that I've been concerned with this week centers around data transparency in the search engine world. Search engines provide information that is critical to the business of optimizing and growing a business on the web, yet barriers to this data currently force many companies to use methods of data extraction that violate the search engines' terms of service.

Specifically, we're talking about two pieces of information that no large-scale, successful web operation should be without. These include rankings (the position of their site(s) vs. their competitors) for important keywords and link data (currently provided most accurately through Yahoo!, but also available through MSN and in lower quality formats from Google).

Why do marketers and businesses need this data so badly? First we'll look at rankings:

•    For large sites in particular, rankings across the board will go up or down based on their actions and the actions of their competition. Any serious company who fails to monitor tweaks to their site, public relations, press and optimization tactics in this way will lose out to competitors who do track this data and, thus, can make intelligent business decisions based on it.

•    Rankings provide a benchmark that helps companies estimate their global reach in the search results and make predictions about whether certain areas of extension or growth make logical sense. If a company must decide on how to expand their content or what new keywords to target or even if they can compete in new markets, the business intelligence that can be extracted from large swaths of ranking data is critical.

•    Rankings can be mapped directly to traffic, allowing companies to consider advertising, extending their reach or forming partnerships

And, on the link data side:

•    Temporal link information allows marketers to see what effects certain link building, public relations and press efforts have on a site's link profile. Although some of this data is available through referring links in analytics programs, many folks are much more interested in the links that search engines know about and count, which often includes many more than those that pass traffic (and also ignores/doesn't count some that do pass traffic).

•    Link data may provide references for reputation management or tracking of viral campaigns - again, items that analytics don't entirely encompass.

•    Competitive link data may be of critical importance to many marketers - this information can't be tracked any other way.

I admit it. SEOmoz is a search engine scraper - we do it for our free public tools, for our internal research and we've even considered doing it for clients (though I'm seriously concerned about charging for data that's obtained outside TOS). Many hundreds of large firms in the search space (including a few that are 10-20X our size) do it, too. Why? Because search engine APIs aren't accurate.

Let's look at each engine's abilities and data sources individually. Since we've got a few hundred thousand points of data (if not more) on each, we're in a good position to make calls about how these systems are working.

Google (all APIs listed here):

•    Search SOAP API - provides ranking results that are massively different from almost every datacenter. The information is often less than useless, it's actually harmful, since you'll get a false sense of what's happening with your positions.

•    AJAX Search API - This is really designed to be integrated with your website, and the results can be of good quality for that purpose, but it really doesn't serve the job of providing good stats reporting.

•    AdSense & AdWords APIs - In all honesty, we haven't played around with these, but the fact that neither will report the correct order of the ads, nor will they show more than 8 ads at a time tells me that if a marketer needed this type of data, the APIs wouldn't work.

Yahoo! (APIs listed here):

•    Search API - Provides ranking information that is a somewhat accurate map to Yahoo!'s actual rankings, but is occassionally so far off-base that they're not reliable. Our data points show a lot more congruity with Yahoo!'s than Google's, but not nearly enough when compared with scraped results to be valuable to marketers and businesses.

•    Site Explorer API - Shows excellent information as far as number of pages indexed on a site and the link data that Yahoo! knows about. We've been comparing this information with that from scraped Yahoo! search results (for queries like linkdomain: and site:) and those at the Site Explorer page and find that there's very little quality difference in the results returned, though the best estimate numbers can still be found through a last page search of results.

•    Search Marketing API - I haven't played with this one at all, so I'd love to hear comments from those who have.


•    Doesn't mind scraping as long as you use the RSS results. We do, we love them and we commend MSN for giving them out - bravo! They've also got a web search SDK program, but we've yet to give it a whirl. The only problem is the MSN estimates, which are so far off as to be useless. The links themselves, though, are useful.


•    Though it's somewhat hidden, the XML.Teoma.com page allows for scraping of results and Ask doesn't seem to mind, though they haven't explicitly said anything. Again, bravo! - the results look solid, accurate and match up against the Ask.com queries. Now, if Ask would only provide links

I know a lot of you are probably asking:

•    "Rand, if scraping is working, why do you care about the search engines fixing the APIs?"

•    The straight answer is that scraping hurts the search engines, hurts their users and isn't the most practical way to get the data. Let me give you some examples:

•    Scraped queries have to look as much like real users as possible to avoid detection and banning - thus, they affect the query data that search engineers use to improve web search.

•    These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.

•    They take up search engine resources and though even the heaviest scraping barely impacts their server loads, it's still an annoyance.

•    With all these negative elements, and so many positive incentives to have the data, it's clear what's needed - a way for marketers/businesses to get the data they need without hurting the search engines. Here's how they can do it:

•    Provide the search ranking position of a site in the referral string - this works for ranking data, but not for link data and since Yahoo! (and Google) both send referrals through re-directs at times, it wouldn't be a hard piece to add.

•    Make the API's accurate, complete and unlimited

•    If the last option is too ambitious, the search engines could charge for API queries - anyone who needs the data would be more than happy to pay for it. This might help with quality control, too.

•    For link data - serve up accurate, wholistic data in programs like Google Sitemaps and Yahoo! Search Submit (or even, Google Analytics). Obviously, you'd only get information about your own site after verifying.

I've talked to lots of people at the search engine level about making changes this week (including Jeremy, Priyank, Matt, Adam, Aaron, Brett and more). I can only hope for the best...

Monday, 6 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.


Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s

During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.


Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Sunday, 29 March 2015

Scraping expert's Amazon Scraper provides huge access to find your desired product on Amazon

Today, with latest advancement of technology we find plenty of ecommerce websites offering huge benefits to people by giving out various products from different categories at an affordable cost. One of the most renowned ecommerce website Amazon has come up with its all new launch of Amazon Scraper for the comfort of their customers. This product Amazon Scraper is also called web harvesting which is a computer software technique for getting out data from websites.

Today anyone can find such web scraping tools that are specifically designed for particular websites. Like for example, Amazon Scraper is also a web scraper tool or technique utilised to crawl, or scrap or even extract the data from the largest e commerce website called Amazon.com. Scrapingexpert.com offers best Amazon scraper for extracting plenty of products from websites easily.

Amazon scraper

Let us see how the Amazon Scraper works:

How to use: Step 1) Select the Category; Enter the (Keyword, UPC, and ASIN) Step 2) Set the delay in seconds Step 3) Click Start

Also you can Scrape the below given details from Amazon.com:

  •     Product Title & Description
  •     Category & Cost Manufacture,
  •     QTY Seller Name,
  •     Total Sellers Shipping Cost,
  •     Shipping / Product Weight ImageURL, IsBuyBoxFBA, Source Link
  •     Stars, Customer Reviews
  •     ASIN, UPC, Model Number Sales Rank,
  •     Sales Rank In Category

Here are some interesting Product Features:

  •     Single Screen Dashboard that shows total extracted records, extracted keywords, and elapse.
  •     Filter Search - Skip data that do not match phrases or keywords
  •     Compatible for Microsoft XP/Vista/Windows 7
  •     Option to set delay between requests to simulate a human surfing in a browser
  •     Extracted data is stored in CSV format, which you can easily open in excel
  • Benefits:
  •     Less Expensive - With our valuable services, we allow you to save both your efforts and money. We have some competitors who outsource their scraping projects to us.
  •     Guaranteed Accurate Results - We assure you get most reliable solutions with accurate results that cannot be collected by any ordinary human being or anyone else.
  •     Delivers Fast Results - We promise to get your work done in just few hours, which can take plenty of time if done by someone else. We save your time, workforce and money and give you an opportunity to stand at a distinction over your multiple competitors.
  •     System Requirement: Operating System - Windows XP, Windows Vista, Windows 7 Net Framework 2.0

Are you searching for some cost effective programs to extract data of other users? If your answer is yes, then we offer Amazon Screen Scraping which is the best Amazon Screen Scraping method of data extraction. Today, in this competitive world of advanced technology there are multiple companies who claim to offer best Amazon Screen Scraping services, so hiring their services for Amazon Screen Scraping can allow you to scrap almost any data in any format you wish to obtain. Well, we at Scrapingexpert.com study each and every single bit of little details of the scraping project and then provide you with a free quote and the date of completing the work

In order to get accurate data pertaining to a specific product, you can use our Awesome Amazon Scraper Tool. This Awesome Amazon Scraping Tool is very effective tool that will help you to extract information about any product from Amazon.

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.

Thursday, 26 March 2015

Web Data Extraction

The Internet as we know today is a repository of information that can be accessed across geographical societies. In just over two decades, the Web has moved from a university curiosity to a fundamental research, marketing and communications vehicle that impinges upon the everyday life of most people in all over the world. It is accessed by over 16% of the population of the world spanning over 233 countries.

As the amount of information on the Web grows, that information becomes ever harder to keep track of and use. Compounding the matter is this information is spread over billions of Web pages, each with its own independent structure and format. So how do you find the information you're looking for in a useful format - and do it quickly and easily without breaking the bank?

Search Isn't Enough

Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes. For all the power of Google and its kin, all that search engines can do is locate information and point to it. They go only two or three levels deep into a Web site to find information and then return URLs. Search Engines cannot retrieve information from deep-web, information that is available only after filling in some sort of registration form and logging, and store it in a desirable format. In order to save the information in a desirable format or a particular application, after using the search engine to locate data, you still have to do the following tasks to capture the information you need:

• Scan the content until you find the information.

• Mark the information (usually by highlighting with a mouse).

• Switch to another application (such as a spreadsheet, database or word processor).

• Paste the information into that application.

Its not all copy and paste

Consider the scenario of a company is looking to build up an email marketing list of over 100,000 thousand names and email addresses from a public group. It will take up over 28 man-hours if the person manages to copy and paste the Name and Email in 1 second, translating to over $500 in wages only, not to mention the other costs associated with it. Time involved in copying a record is directly proportion to the number of fields of data that has to copy/pasted.

Is there any Alternative to copy-paste?

A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors available on the Internet, lies with usage of custom Web harvesting software and tools.

Web harvesting software automatically extracts information from the Web and picks up where search engines leave off, doing the work the search engine can't. Extraction tools automate the reading, the copying and pasting necessary to collect information for further use. The software mimics the human interaction with the website and gathers data in a manner as if the website is being browsed. Web Harvesting software only navigate the website to locate, filter and copy the required data at much higher speeds that is humanly possible. Advanced software even able to browse the website and gather data silently without leaving the footprints of access.

The next article of this series will give more details about how such softwares and uncover some myths on web harvesting.

Monday, 23 March 2015

Predictive Analytics and Web Scraping

The integration of web scraping and predictive analytics can be used to make the marketing process an efficient. This is possible by use of a number of techniques such as business intelligence. The main aim of any business is to make profit, in this article we are looking at the web scraping process and predictive analytics in marketing your products. Integrating the two processes is quite beneficial for business. Web scraping plays the role of harvesting data and predictive analytics in determining the best methods to be used in marketing campaigns.

Business intelligence may be regarded as a decision support system where data is harvested for the purposes of predictive analysis. It can also be used for supporting business decisions. Over the years business intelligence data has been gathered manually. The emergence of the internet has madPredictive Analytics and Web Scrapinge it possible a lot of data for the purposes of business intelligence. The collection of information from various sources or departments of a company such as finance, sales and purchasing consumed a lot of time before correlating such information into any meaningful application.

web scraping plays an important role in collecting data to be used in business intelligence. This is so because normal web scraping process involves data harvesting, selection and even pre-processing.Web scraping makes the business intelligence a reality and a dynamic process. This is so because the business intelligence data needed can be accessed from the internet by the use of web scraping process. There is absolutely no reason why managers ought to wait for a number of months to get data for decision making when they can use specialized companies in the data mining sector such as Loginworks softwares. This is so because these companies have taken a number of years in providing these services and have professional staff on the same.

There is a great need for businesses to engage in predictive analytics. Predictive analytics can be defined as method of using business intelligence. This is because it is used in modeling and forecasting. It is a method of predicting patterns and has wide applications in credit, medical and insurance industries. The most common application of integration between web scraping and predictive analytics is credit assessment. The use past events in estimating the future of a business and markets is an integral part for any business.

Web scraping aids the predictive analysis process by provision of data from the past which can be analyzed and prediction of the customer behaviors such customers who are likely to purchase, renew or even purchase similar products. Predictive analysis and web scraping are very important for any business marketing campaigns. Since marketing is an investment by a company it is therefore necessary for businesses to employ web scraping to get the appropriate data for making business decisions. Predictive analysis narrows your target market and enables you to tailor your campaigns to specific customers. This enables the market teams to come up with a number of advertisements which may be based on your traffic.

Since web scraping is an integral part of predictive analysis, it is therefore important for a company to invest in the process. There is a need for companies to contact customers who are likely to respond positively. Marketing methods will only become efficient if a company is able to target goods and services that are required by customers at the required time. Predictive analytics plays an important role in reducing the amount of investment done to make a sale.

Business intelligence plays an important role in helping marketing teams prepare and anticipate customer needs, rather than reacting to them. Web scraping can present data based on the demographics that may have been overlooked in the past. Any combination of customer demographics is useful in the determination of which platform to use in marketing and what method of marketing can be used and when applicable.

The combination of web scraping and predictive analytics can be useful to managers to bring more sales at the same time spending less. Maximizing profits and minimizing loses is one of the goals of a business. Therefore for a business whether online or offline it is important for companies to engage in web scraping and predictive anal.

