Sports Site Spyder
Project ID: 1228843656
Project Details
  • Status:
    Closed (Cancelled)
  • Posted:
    12/9/2008 at 12:27 EST
  • Cancelled:
    1/20/2009 at 18:35 EST
  • Project Creator:
  • Budget:
    N/A
  • Description:
    1. Primary purpose is to crawl specific sport sites for news, scores, statistics and photos.

    2. INPUTS –

    a) Crawler – A sophisticated crawler which crawls target
    sites and indexes the listings based on fields such as those listed above. Each site being crawled will have differing layouts, and so it must be possible to easily specify per site the layout and what to search for/return. As sites change regularly, it must be easy from the admin console to modify the crawler setup on a per site.

    b) XML Feeds – The crawler must have the ability to accept full
    or partial feeds from target sites. An admin interface must
    enable the matching of the RSS/XML feed fields into
    apropriate database fields and the ability as with the
    crawler to schedule the feed downloads to the database.


    3. ADMIN CONSOLE - The software must have a sophisticated management
    console enabling the following functions :

    a) Automated Scheduling of crawlers/feeds for each individual
    target site [every hour, daily, weekly, bi-weekly, etc] and
    setting specific time & interval for the crawler to run

    b) Detailed reporting of crawl progress, results, log

    c) Exception handling – providing details of items not crawled,
    and listing items that were not matched with location
    entries on our database

    d) Sophisticated duplicate handling, to match and group
    duplicate listings from a number of sites

    e) Sophisticated deletions handling to recognise that
    previously crawled listings are no longer listed on the
    target site and to handle these accordingly by moving these
    listings into an inactive or archive table separate to the
    main listings

    f) Backup functions to enable all or part of the database to be
    backed up

    g) The ability to easily search for and edit listing records

    h) It must be easy from the admin console to handle target site
    layout changes, and to specify/edit the layout/target fields
    the crawler is using to gather data.

    i) FEED OUTPUTS – the admin console must have the ability to
    create a number of XML / RSS feeds based on the database.
    The console must allow the creation/management of output
    feeds specifying the relevant SQL and storing each feed
    setup and outputting the results required a specified xml
    file. The scheduling system above must allow the feed
    outputs to be scheduled automatically also.
  • Tags:
Project Bids



(10 bids have been placed. garza has chosen to keep all bids for this project hidden.)