Wednesday, 27 August 2014

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

Tuesday, 26 August 2014

PDF scraping using R


I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Monday, 25 August 2014

Php Scraping data from a website

I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.



You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
    @$row[$n]['tr']  = $element->find('tr')->text;
    $n++;
}

// output your data
print_r($row);



Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website

Obtaining reddit data


I am interested in obtaining data from different reddit subreddits. Does anyone know if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API (follow the rules)
    automatically generated API docs -- provides information on the requests needed to access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you should check out the existing set of API wrappers for various languages. Despite my bias (I am the package maintainer) I am quite certain PRAW, for python, has support for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data

Tuesday, 19 August 2014

What is the right way of storing screen-scraping data?


i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?

Source: http://stackoverflow.com/questions/13686474/what-is-the-right-way-of-storing-screen-scraping-data

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top

most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return

Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60

posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a

few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Scrape Data Point Using Python


I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

What is Web Scraping and is Python the best language to use for this? [closed]


What is Web Scraping and is Python the best language to use for this? If so why is python

the best?

What is Web Scraping

The process of making HTTP requests to websites and then extracting data from HTML

documents (as opposed to using an official API)

    and is Python the best language to use for this?

Subjective. Argumentative. Insert favourite language (Perl Perl Perl) here.

Web scraping is a computer software technique of extracting information from websites.

Usually, such software programs simulate human exploration of the Web by either

implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-

fledged Web browsers. Web scraping is closely related to Web indexing, which indexes Web

content using a bot and is a universal technique adopted by most search engines. In

contrast, Web scraping focuses more on the transformation of unstructured Web content,

typically in HTML format, into structured data that can be stored and analyzed in a

central local database or spreadsheet. Web scraping is also related to Web automation,

which simulates human Web browsing using computer software. Uses of Web scraping include

online price comparison, weather data monitoring, website change detection, Web research,

Web content mashup and Web data integration. (Wikipedia)

Like any language, Python has certain advantages and disadvantages with it. For example,

Perl is an easier language to do regular expression with. Then again, BeautifulSoup, a

module for Python, makes web scraping really, really easy.

Web scraping is the process of automatically collecting Web information.

Python and Perl have vast libraries for web scraping. Even though I have a personal

preference towards Perl when it comes to data extraction, because the ease at which you

can use regex, technically Python is no less. However based on the active community,

Python would be a better choice.

Python:

BeautifulSoap module which can extract HTML content.

Scrapy Open source framework for web scraping in Python. It is a very elegant framework.

Html5lib HTML Parser based on HTML5 specification

Perl

WWW::Mechanize -- Very powerful module for extracting webpages, and parsing, and easy to

use.

Win32::IE::Mechanize -- If you are using windows and wanted to scrape Javascript based

pages.

Mozilla::Mechanize -- Scrape JavaScript based pages from Linux, but you need to install

GTK

If you want to use Ruby for this task, Nokogiri and Mechanize are probably right tools for

the job.

If you are looking into web scraping with python, you may also want to look at lxml

package. It has same ElementTree API as stdlib parser, but is much more powerfull and

fast.

Plus lxml package provides brilliant lxml.cssselect module.

Depending on your needs, you may want to start with either bare lxml or some web-scraping

library like Scrapy, mechanize or SmartHTTP.


Source: http://stackoverflow.com/questions/5926863/what-is-web-scraping-and-is-python-

the-best-language-to-use-for-this

Scraping Data from Table Python


I'm trying to scrape data from a website's table using Python.

from bs4 import BeautifulSoup
from mechanize import Browser

BASE_URL = "http://www.ggp.com/properties/mall-directory"

def main():
    mech = Browser()
    url = "http://www.ggp.com/properties/mall-directory"
    page1 = mech.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1, 2007)


def extract(soup,year):
    table = soup.find("table")
    for row in table.findAll('option'):
        print row


main()

Row prints out:

<option value="184">Yakima, WA</option>
<option value="896">Yankton, SD</option>
<option value="851">Yazoo City, MS</option>
<option value="113">York-Hanover, PA</option>
<option value="87">Youngstown-Warren, OH-PA</option>
<option value="235">Yuba City, CA</option>
<option value="205">Yuma, AZ</option>
<option value="424">Zanesville, OH</option>

But what I need is

Yakima, WA
Yankton, SD
Yazoo City, MS
York-Hanover, PA
etc...

I've tried row.findAll('option value') but this doesn't work...



Source: http://stackoverflow.com/questions/24124291/scraping-data-from-table-python

What is Web Scraping and is Python the best language to use for this? [closed]


What is Web Scraping and is Python the best language to use for this? If so why is python the best?



    What is Web Scraping

The process of making HTTP requests to websites and then extracting data from HTML documents (as opposed to using an official API)

    and is Python the best language to use for this?

Subjective. Argumentative. Insert favourite language (Perl Perl Perl) here.



Web scraping is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration. (Wikipedia)

Like any language, Python has certain advantages and disadvantages with it. For example, Perl is an easier language to do regular expression with. Then again, BeautifulSoup, a module for Python, makes web scraping really, really easy.




Web scraping is the process of automatically collecting Web information.

Python and Perl have vast libraries for web scraping. Even though I have a personal preference towards Perl when it comes to data extraction, because the ease at which you can use regex, technically Python is no less. However based on the active community, Python would be a better choice.

Python:

BeautifulSoap module which can extract HTML content.

Scrapy Open source framework for web scraping in Python. It is a very elegant framework.

Html5lib HTML Parser based on HTML5 specification

Perl

WWW::Mechanize -- Very powerful module for extracting webpages, and parsing, and easy to use.

Win32::IE::Mechanize -- If you are using windows and wanted to scrape Javascript based pages.

Mozilla::Mechanize -- Scrape JavaScript based pages from Linux, but you need to install GTK

If you want to use Ruby for this task, Nokogiri and Mechanize are probably right tools for the job.



If you are looking into web scraping with python, you may also want to look at lxml package. It has same ElementTree API as stdlib parser, but is much more powerfull and fast.

Plus lxml package provides brilliant lxml.cssselect module.

Depending on your needs, you may want to start with either bare lxml or some web-scraping library like Scrapy, mechanize or SmartHTTP.


Source: http://stackoverflow.com/questions/5926863/what-is-web-scraping-and-is-python-the-best-language-to-use-for-this

Scraping Data from Table Python


I'm trying to scrape data from a website's table using Python.

from bs4 import BeautifulSoup
from mechanize import Browser

BASE_URL = "http://www.ggp.com/properties/mall-directory"

def main():
    mech = Browser()
    url = "http://www.ggp.com/properties/mall-directory"
    page1 = mech.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1, 2007)


def extract(soup,year):
    table = soup.find("table")
    for row in table.findAll('option'):
        print row


main()

Row prints out:

<option value="184">Yakima, WA</option>
<option value="896">Yankton, SD</option>
<option value="851">Yazoo City, MS</option>
<option value="113">York-Hanover, PA</option>
<option value="87">Youngstown-Warren, OH-PA</option>
<option value="235">Yuba City, CA</option>
<option value="205">Yuma, AZ</option>
<option value="424">Zanesville, OH</option>

But what I need is

Yakima, WA
Yankton, SD
Yazoo City, MS
York-Hanover, PA
etc...

I've tried row.findAll('option value') but this doesn't work...



Source: http://stackoverflow.com/questions/24124291/scraping-data-from-table-python

web scraping groupon


i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting.

also if anyone could suggest a framework or library in php which makes scraping easy it would be great.


I would investigate the cURL library for grabbing website content. I'm not sure on the exact information you want to scrape, or if the refresh will cause an issue, but hopefully this launches your attempt.

We use iMacros. PRO: Works in browser, works with any website. CON: Not as fast as CURL. - of course, nothing stops you from using both.



Must you stick with PHP for the scraping? TestPlan makes this type of testing easy. You can either access the page again, or simply use TestPlan to sign up for their email list to gain extended access to their site.

Here's a rough example that takes you to the main page and closes the little popup:

GotoURL http://www.groupon.com/
Click id:step_one

SubmitForm with
    %Params:subscription[email_address]% somewhere@test.domain.xx
end


They have an API http://www.groupon.com/pages/api if that helps.


Source: http://stackoverflow.com/questions/3843733/web-scraping-groupon

Getting Different Results For Web Scraping


I was trying to do web scraping and was using the following code :

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

link_dictionary = {}
soup = BeautifulSoup(htmltext)

for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
    for link in tag_li.findAll('a'):
        link_dictionary[link.string] = link.get('href')
        print link_dictionary[link.string]
        urlnew = link_dictionary[link.string]

        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()

        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text

        print articletext

I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?


READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING

If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.

For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.

If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.

Cited here

from selenium import webdriver
from bs4 import BeautifulSoup
import time   

driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)

soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium     
...

driver.close()

And the output for me is 8

Source: http://stackoverflow.com/questions/19918153/getting-different-results-for-web-scraping

Monday, 18 August 2014

Scraping data from National weather services

Hi I want to get the latest weather data from the link below, meaning the very first row of data, I am trying to scrape the data but have not been successful in doing so. Can anyone please help, or if you know any better way point me to that direction thanks!

this is what i have,

$data = file_get_contents('http://w1.weather.gov/obhistory/KIKV.html');

$regex = '/<td>(.+?) </td>/';

preg_match($regex,$data,$match);

var_dump($match);

echo $match[1];

i get an error on '/(.+?) /' this line that "Unknown modifier 't'" ? Any suggestion

1 Answer

Although parsing html with a regex is usually a very bad idea, the error you are receiving, is caused by the / character. You are using that character as the delimiter for your pattern, so you need to escape it or use another character:

$regex = '/<td>(.+?) <\/td>/';

or

$regex = '#<td>(.+?) </td>#';

Source:http://stackoverflow.com/questions/11943420/scraping-data-from-national-weather-services

Monday, 11 August 2014

Collecting Data With Web Scrapers

There is a large amount of data available only through websites. However, as many people have found out, trying to copy data into a usable database or spreadsheet directly out of a website can be a tiring process. Data entry from internet sources can quickly become cost prohibitive as the required hours add up. Clearly, an automated method for collating information from HTML-based sites can offer huge management cost savings.

Web scrapers are programs that are able to aggregate information from the internet. They are capable of navigating the web, assessing the contents of a site, and then pulling data points and placing them into a structured, working database or spreadsheet. Many companies and services will use programs to web scrape, such as comparing prices, performing online research, or tracking changes to online content.

Let's take a look at how web scrapers can aid data collection and management for a variety of purposes.

Improving On Manual Entry Methods

Using a computer's copy and paste function or simply typing text from a site is extremely inefficient and costly. Web scrapers are able to navigate through a series of websites, make decisions on what is important data, and then copy the info into a structured database, spreadsheet, or other program. Software packages include the ability to record macros by having a user perform a routine once and then have the computer remember and automate those actions. Every user can effectively act as their own programmer to expand the capabilities to process websites. These applications can also interface with databases in order to automatically manage information as it is pulled from a website.

Aggregating Information

There are a number of instances where material stored in websites can be manipulated and stored. For example, a clothing company that is looking to bring their line of apparel to retailers can go online for the contact information of retailers in their area and then present that information to sales personnel to generate leads. Many businesses can perform market research on prices and product availability by analyzing online catalogues.

Data Management

Managing figures and numbers is best done through spreadsheets and databases; however, information on a website formatted with HTML is not readily accessible for such purposes. While websites are excellent for displaying facts and figures, they fall short when they need to be analyzed, sorted, or otherwise manipulated. Ultimately, web scrapers are able to take the output that is intended for display to a person and change it to numbers that can be used by a computer. Furthermore, by automating this process with software applications and macros, entry costs are severely reduced.

This type of data management is also effective at merging different information sources. If a company were to purchase research or statistical information, it could be scraped in order to format the information into a database. This is also highly effective at taking a legacy system's contents and incorporating them into today's systems.

Overall, a web scraper is a cost effective user tool for data manipulation and management.

Source:http://ezinearticles.com/?Collecting-Data-With-Web-Scrapers&id=4223877

Data Extraction - A Guideline to Use Scrapping Tools Effectively

So many people around the world do not have much knowledge about these scrapping tools. In their views, mining means extracting resources from the earth. In these internet technology days, the new mined resource is data. There are so many data mining software tools are available in the internet to extract specific data from the web. Every company in the world has been dealing with tons of data, managing and converting this data into a useful form is a real hectic work for them. If this right information is not available at the right time a company will lose valuable time to making strategic decisions on this accurate information.

This type of situation will break opportunities in the present competitive market. However, in these situations, the data extraction and data mining tools will help you to take the strategic decisions in right time to reach your goals in this competitive business. There are so many advantages with these tools that you can store customer information in a sequential manner, you can know the operations of your competitors, and also you can figure out your company performance. And it is a critical job to every company to have this information at fingertips when they need this information.

To survive in this competitive business world, this data extraction and data mining are critical in operations of the company. There is a powerful tool called Website scraper used in online digital mining. With this toll, you can filter the data in internet and retrieves the information for specific needs. This scrapping tool is used in various fields and types are numerous. Research, surveillance, and the harvesting of direct marketing leads is just a few ways the website scraper assists professionals in the workplace.

Screen scrapping tool is another tool which useful to extract the data from the web. This is much helpful when you work on the internet to mine data to your local hard disks. It provides a graphical interface allowing you to designate Universal Resource Locator, data elements to be extracted, and scripting logic to traverse pages and work with mined data. You can use this tool as periodical intervals. By using this tool, you can download the database in internet to you spread sheets. The important one in scrapping tools is Data mining software, it will extract the large amount of information from the web, and it will compare that date into a useful format. This tool is used in various sectors of business, especially, for those who are creating leads, budget establishing seeing the competitors charges and analysis the trends in online. With this tool, the information is gathered and immediately uses for your business needs.

Another best scrapping tool is e mailing scrapping tool, this tool crawls the public email addresses from various web sites. You can easily from a large mailing list with this tool. You can use these mailing lists to promote your product through online and proposals sending an offer for related business and many more to do. With this toll, you can find the targeted customers towards your product or potential business parents. This will allows you to expand your business in the online market.

There are so many well established and esteemed organizations are providing these features free of cost as the trial offer to customers. If you want permanent services, you need to pay nominal fees. You can download these services from their valuable web sites also.

Source:http://ezinearticles.com/?Data-Extraction---A-Guideline-to-Use-Scrapping-Tools-Effectively&id=3600918

Digging Up Dollars With Data Mining - An Executive's Guide

Introduction

Traditionally, organizations use data tactically - to manage operations. For a competitive edge, strong organizations use data strategically - to expand the business, to improve profitability, to reduce costs, and to market more effectively. Data mining (DM) creates information assets that an organization can leverage to achieve these strategic objectives.

In this article, we address some of the key questions executives have about data mining. These include:
  •     What is data mining?
  •     What can it do for my organization?
  •     How can my organization get started?
Business Definition of Data Mining

Data mining is a new component in an enterprise's decision support system (DSS) architecture. It complements and interlocks with other DSS capabilities such as query and reporting, on-line analytical processing (OLAP), data visualization, and traditional statistical analysis. These other DSS technologies are generally retrospective. They provide reports, tables, and graphs of what happened in the past. A user who knows what she's looking for can answer specific questions like: "How many new accounts were opened in the Midwest region last quarter," "Which stores had the largest change in revenues compared to the same month last year," or "Did we meet our goal of a ten-percent increase in holiday sales?"

We define data mining as "the data-driven discovery and modeling of hidden patterns in large volumes of data." Data mining differs from the retrospective technologies above because it produces models - models that capture and represent the hidden patterns in the data. With it, a user can discover patterns and build models automatically, without knowing exactly what she's looking for. The models are both descriptive and prospective. They address why things happened and what is likely to happen next. A user can pose "what-if" questions to a data-mining model that can not be queried directly from the database or warehouse. Examples include: "What is the expected lifetime value of every customer account," "Which customers are likely to open a money market account," or "Will this customer cancel our service if we introduce fees?"

The information technologies associated with DM are neural networks, genetic algorithms, fuzzy logic, and rule induction. It is outside the scope of this article to elaborate on all of these technologies. Instead, we will focus on business needs and how data mining solutions for these needs can translate into dollars.

Mapping Business Needs to Solutions and Profits

What can data mining do for your organization? In the introduction, we described several strategic opportunities for an organization to use data for advantage: business expansion, profitability, cost reduction, and sales and marketing. Let's consider these opportunities very concretely through several examples where companies successfully applied DM.

Expanding your business: Keystone Financial of Williamsport, PA, wanted to expand their customer base and attract new accounts through a LoanCheck offer. To initiate a loan, a recipient just had to go to a Keystone branch and cash the LoanCheck. Keystone introduced the $5000 LoanCheck by mailing a promotion to existing customers.

The Keystone database tracks about 300 characteristics for each customer. These characteristics include whether the person had already opened loans in the past two years, the number of active credit cards, the balance levels on those cards, and finally whether or not they responded to the $5000 LoanCheck offer. Keystone used data mining to sift through the 300 customer characteristics, find the most significant ones, and build a model of response to the LoanCheck offer. Then, they applied the model to a list of 400,000 prospects obtained from a credit bureau.

By selectively mailing to the best-rated prospects determined by the DM model, Keystone generated $1.6M in additional net income from 12,000 new customers.

Reducing costs: Empire Blue Cross/Blue Shield is New York State's largest health insurer. To compete with other healthcare companies, Empire must provide quality service and minimize costs. Attacking costs in the form of fraud and abuse is a cornerstone of Empire's strategy, and it requires considerable investigative skill as well as sophisticated information technology.

The latter includes a data mining application that profiles each physician in the Empire network based on patient claim records in their database. From the profile, the application detects subtle deviations in physician behavior relative to her/his peer group. These deviations are reported to fraud investigators as a "suspicion index." A physician who performs a high number of procedures per visit, charges 40% more per patient, or sees many patients on the weekend would be flagged immediately from the suspicion index score.

What has this DM effort returned to Empire? In the first three years, they realized fraud-and-abuse savings of $29M, $36M, and $39M respectively.

Improving sales effectiveness and profitability: Pharmaceutical sales representatives have a broad assortment of tools for promoting products to physicians. These tools include clinical literature, product samples, dinner meetings, teleconferences, golf outings, and more. Knowing which promotions will be most effective with which doctors is extremely valuable since wrong decisions can cost the company hundreds of dollars for the sales call and even more in lost revenue.

The reps for a large pharmaceutical company collectively make tens of thousands of sales calls. One drug maker linked six months of promotional activity with corresponding sales figures in a database, which they then used to build a predictive model for each doctor. The data-mining models revealed, for instance, that among six different promotional alternatives, only two had a significant impact on the prescribing behavior of physicians. Using all the knowledge embedded in the data-mining models, the promotional mix for each doctor was customized to maximize ROI.

Although this new program was rolled out just recently, early responses indicate that the drug maker will exceed the $1.4M sales increase originally projected. Given that this increase is generated with no new promotional spending, profits are expected to increase by a similar amount.

Looking back at this set of examples, we must ask, "Why was data mining necessary?" For Keystone, response to the loan offer did not exist in the new credit bureau database of 400,000 potential customers. The model predicted the response given the other available customer characteristics. For Empire, the suspicion index quantified the differences between physician practices and peer (model) behavior. Appropriate physician behavior was a multi-variable aggregate produced by data mining - once again, not available in the database. For the drug maker, the promotion and sales databases contained the historical record of activity. An automated data mining method was necessary to model each doctor and determine the best combination of promotions to increase future sales.

Getting Started

In each case presented above, data mining yielded significant benefits to the business. Some were top-line results that increased revenues or expanded the customer base. Others were bottom-line improvements resulting from cost-savings and enhanced productivity. The natural next question is, "How can my organization get started and begin to realize the competitive advantages of DM?"

In our experience, pilot projects are the most successful vehicles for introducing data mining. A pilot project is a short, well-planned effort to bring DM into an organization. Good pilot projects focus on one very specific business need, and they involve business users up front and throughout the project. The duration of a typical pilot project is one to three months, and it generally requires 4 to 10 people part-time.

The role of the executive in such pilot projects is two-pronged. At the outset, the executive participates in setting the strategic goals and objectives for the project. During the project and prior to roll out, the executive takes part by supervising the measurement and evaluation of results. Lack of executive sponsorship and failure to involve business users are two primary reasons DM initiatives stall or fall short.

In reading this article, perhaps you've developed a vision and want to proceed - to address a pressing business problem by sponsoring a data mining pilot project. Twisting the old adage, we say "just because you should doesn't mean you can." Be aware that a capability assessment needs to be an integral component of a DM pilot project. The assessment takes a critical look at data and data access, personnel and their skills, equipment, and software. Organizations typically underestimate the impact of data mining (and information technology in general) on their people, their processes, and their corporate culture. The pilot project provides a relatively high-reward, low-cost, and low-risk opportunity to quantify the potential impact of DM.

Another stumbling block for an organization is deciding to defer any data mining activity until a data warehouse is built. Our experience indicates that, oftentimes, DM could and should come first. The purpose of the data warehouse is to provide users the opportunity to study customer and market behavior both retrospectively and prospectively. A data mining pilot project can provide important insight into the fields and aggregates that need to be designed into the warehouse to make it really valuable. Further, the cost savings or revenue generation provided by DM can provide bootstrap funding for a data warehouse or related initiatives.

Source:http://ezinearticles.com/?Digging-Up-Dollars-With-Data-Mining---An-Executives-Guide&id=6052872