Wyoming Web Scraping

Tuesday, 30 June 2015

Data Scraping - Hand Scraped Hardwood Flooring Gives Your Home That Exclusive Look

Today hand scraped hardwood flooring is becoming extremely popular in the more opulent homes as well as in some commercial properties. Although this type of flooring has only recently become fashionable it has been around for many centuries.

Certainly before the invention of modern sanding techniques all floors where hand scraped at the location where they were to be installed to ensure that the floor would be flat and even. However today this method is used instead to provide texture, richness as well as a unique look and feel to the flooring.

Although manufacturers have produced machines which can provide a scraped look to their flooring it looks cheap compared to the real thing. Unfortunately the main problem with using a machine to scrape the flooring is that it provides a uniform look to the pattern of the wood. Because of this it lacks the natural feel that you would see with a floor which has been scraped by hand.

When done by hand, scraping creates a truly unique look to the floor. However the actual look and feel of each floor will vary as it depends on the skills of the person actually carrying out the work. If there is no control in place whilst the work is being carried out this can result in disastrous look to the finished product.

Many manufacturers who actually provide hand scraped hardwood flooring will either just dent, scoop or rough the floor up. But others will use sanding techniques in order to create a worn and uneven look to the flooring. The more professional teams will scrape the entire surface of the wood in order to create the unique hand made look for their customers.

Many companies will allow their customers to choose what type of scraping takes place on their wood. They can choose between light, medium and heavy. The companies who are really good at hand scraping will be able give the hardwood floor a reclaimed look by including wormholes, splits and other naturally-occurring features within the wood.

If you do decide to choose hand scraped hardwood flooring you will need to factor the costs that are associated with it into your budget. Unfortunately this type of flooring does not come cheap and you can find yourself paying upwards of $15 per sq ft. But once it is installed it will give a room a unique and warm rich feel to it and is certainly going to wow your friends and family when they see it for the first time.

Source: http://ezinearticles.com/?Hand-Scraped-Hardwood-Flooring-Gives-Your-Home-That-Exclusive-Look&id=572577

Tuesday, 23 June 2015

Making data on the web useful: scraping

Introduction

Many times data is not easily accessible – although it does exist. As much as we wish everything was available in CSV or the format of our choice – most data is published in different forms on the web. What if you want to use the data to combine it with other datasets and explore it independently?

Scraping to the rescue!

Scraping describes the method to extract data hidden in documents – such as Web Pages and PDFs and make it useable for further processing. It is among the most useful skills if you set out to investigate data – and most of the time it’s not especially challenging. For the most simple ways of scraping you don’t even need to know how to write code.

This example relies heavily on Google Chrome for the first part. Some things work well with other browsers, however we will be using one specific browser extension only available on Chrome. If you can’t install Chrome, don’t worry the principles remain similar.

Code-free Scraping in 5 minutes using Google Spreadsheets & Google Chrome

Knowing the structure of a website is the first step towards extracting and using the data. Let’s get our data into a spreadsheet – so we can use it further. An easy way to do this is provided by a special formula in Google Spreadsheets.

Save yourselves hours of time in copy-paste agony with the ImportHTML command in Google Spreadsheets. It really is magic!

Recipes

In order to complete the next challenge, take a look in the Handbook at one of the following recipes:

    Extracting data from HTML tables.

    Scraping using the Scraper Extension for Chrome

Both methods are useful for:

    Extracting individual lists or tables from single webpages

The latter can do slightly more complex tasks, such as extracting nested information. Take a look at the recipe for more details.

Neither will work for:

    Extracting data spread across multiple webpages

Challenge

Task: Find a website with a table and scrape the information from it. Share your result on datahub.io (make sure to tag your dataset with schoolofdata.org)

Tip

Once you’ve got your table into the spreadsheet, you may want to move it around, or put it in another sheet. Right click the top left cell and select “paste special” – “paste values only”.

Scraping more than one webpage: Scraperwiki

Note: Before proceeding into full scraping mode, it’s helpful to understand the flesh and bones of what makes up a webpage. Read the Introduction to HTML recipe in the handbook.

Until now we’ve only scraped data from a single webpage. What if there are more? Or you want to scrape complex databases? You’ll need to learn how to program – at least a bit.

It’s beyond the scope of this course to teach how to scrape, our aim here is to help you understand whether it is worth investing your time to learn, and to point you at some useful resources to help you on your way!

Structure of a scraper

Scrapers are comprised of three core parts:

1.    A queue of pages to scrape
2.    An area for structured data to be stored, such as a database
3.    A downloader and parser that adds URLs to the queue and/or structured information to the database.

Fortunately for you there is a good website for programming scrapers: ScraperWiki.com

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!

Summary:

In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Further Reading

1.    Scraping for Journalism: A Guide for Collecting Data: ProPublica Guides

2.    Scraping for Journalists (ebook): Paul Bradshaw

3.    Scrape the Web: Strategies for programming websites that don’t expect it : Talk from PyCon

4.    An Introduction to Compassionate Screen Scraping: Will Larson

Any questions? Got stuck? Ask School of Data!

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!

Summary:

In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Source: http://schoolofdata.org/handbook/courses/scraping/

Thursday, 18 June 2015

Scraping Services - Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Source: http://ezinearticles.com/?Assuring-Scraping-Success-with-Proxy-Data-Scraping&id=248993

Saturday, 6 June 2015

Twitter Scraper Python Library

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it.

Import it. (It only works on ScraperWiki, unfortunately.)

from scraperwiki import swimport

search = swimport('twitter_search').search

Then search for terms.

search(['picnic #tcamp12', 'from:TCampDC', '@TCampDC', '#tcamp12', '#viphack'])

A separate search will be run on each of these phrases. That’s it.

A more complete search

Searching for #tcamp12 and #viphack didn’t get me all of the tweets because I waited like a week to do this. In order to get a more complete list of the tweets, I looked at the tweets returned from that first search; I searched for tweets referencing the users who had tweeted those tweets.

from scraperwiki.sqlite import save, select

from time import sleep

# Search by user to get some more

users = [row['from_user'] + ' tcamp12' for row in \

select('distinct from_user from swdata where from_user where user > "%s"' \

% get_var('previous_from_user', ''))]

for user in users:

    search([user], num_pages = 2)

    save_var('previous_from_user', user)

    sleep(2)

By default, the search function retrieves 15 pages of results, which is the maximum. In order to save some time, I limited this second phase of searching to two pages, or 200 results; I doubted that there would be more than 200 relevant results mentioning a particular user.

The full script also counts how many tweets were made by each user.

Library

Remember, this is a library, so you can easily reuse it in your own scripts, like Max Richman did.

Source: https://scraperwiki.wordpress.com/2012/07/04/twitter-scraper-python-library/

Sunday, 31 May 2015

Data Scraping Services - Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from. In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors. Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Thursday, 28 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Monday, 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)

1

It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

Your tasks are represented by classes and not processes
You can set and forget tasks and scale across multiple servers
You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Source: http://programmers.stackexchange.com/questions/118077/data-scraping-one-application-or-multiple/118098#118098