Sports Site Spyder
Project ID: 1228843656
Project Details
- Status: Closed (Cancelled)
- Posted: 12/9/2008 at 12:27 EST
- Cancelled: 1/20/2009 at 18:35 EST
- Project Creator:
- Budget: N/A
- Description: 1. Primary purpose is to crawl specific sport sites for news, scores, statistics and photos.
2. INPUTS –
a) Crawler – A sophisticated crawler which crawls target
sites and indexes the listings based on fields such as those listed above. Each site being crawled will have differing layouts, and so it must be possible to easily specify per site the layout and what to search for/return. As sites change regularly, it must be easy from the admin console to modify the crawler setup on a per site.
b) XML Feeds – The crawler must have the ability to accept full
or partial feeds from target sites. An admin interface must
enable the matching of the RSS/XML feed fields into
apropriate database fields and the ability as with the
crawler to schedule the feed downloads to the database.
3. ADMIN CONSOLE - The software must have a sophisticated management
console enabling the following functions :
a) Automated Scheduling of crawlers/feeds for each individual
target site [every hour, daily, weekly, bi-weekly, etc] and
setting specific time & interval for the crawler to run
b) Detailed reporting of crawl progress, results, log
c) Exception handling – providing details of items not crawled,
and listing items that were not matched with location
entries on our database
d) Sophisticated duplicate handling, to match and group
duplicate listings from a number of sites
e) Sophisticated deletions handling to recognise that
previously crawled listings are no longer listed on the
target site and to handle these accordingly by moving these
listings into an inactive or archive table separate to the
main listings
f) Backup functions to enable all or part of the database to be
backed up
g) The ability to easily search for and edit listing records
h) It must be easy from the admin console to handle target site
layout changes, and to specify/edit the layout/target fields
the crawler is using to gather data.
i) FEED OUTPUTS – the admin console must have the ability to
create a number of XML / RSS feeds based on the database.
The console must allow the creation/management of output
feeds specifying the relevant SQL and storing each feed
setup and outputting the results required a specified xml
file. The scheduling system above must allow the feed
outputs to be scheduled automatically also. - Tags:
| Project Bids |






