Script to convert website pages into text file
Project ID: 1328477448
Project Details
  • Status:
    Closed (Cancelled)
  • Posted:
    2/5/2012 at 16:30 EST
  • Cancelled:
    2/13/2012 at 10:50 EST
  • Project Creator:
    (No Feedback Yet)
  • Budget:
    $10-200
  • Description:
    I need a script that can be run periodically to aggregate the HTML source data from the Texas Statutes (http://www.statutes.legis.state.tx.us/) into a txt file that preserves the structure of the data consistent with the Title, Chapter, Section sequence displayed on the website. One text file for each Code - i.e., 30 separate files for each of the Agriculture Code, Alcoholic Beverage Code, Auxiliary Water Laws, etc through to Vernon's Civil Statutes.
    Additional Info (Added 2/6/2012 at 16:02 EST)...
    I need a script does the following:

    (A) when first run, it will download the .DOC source data from the Texas Statutes (http://www.statutes.legis.state.tx.us/Download.aspx) onto a remote server and unzip it.
    (B) Once the data is initially downloaded, the script will then periodically check this site to determine if the files have been updated or new ones have been added, and download these changed files. If any of the files are no longer hosted on this website, the script will automatically alert a pre-defined email account of the change.
    Additional Info (Added 2/6/2012 at 17:40 EST)...
    Please disregard the previous task descriptions.

    I need a script does the following:

    (A) The first time it is run, it will download each of the .zip files that have .HTML from the Texas Statutes (http://www.statutes.legis.state.tx.us/Download.aspx) onto a remote linux server and unzip them. For each unzipped file, the script will concatenate the text in the .HTML files of that particular .zip into a single .HTML file that preserves the numbering order of the original files.

    Once this task runs, the concatenated .html files (one for each .zip) will be saved, and the .zip files and the non-concatenated .HTML files will be deleted.

    (B) Once the data is initially downloaded, the script will then periodically check this site to determine if the files have been updated or new ones have been added, and process these changed files pursuant to Step A, above. If any of the files are no longer hosted on this website, the script will automatically alert a pre-defined email account of this change.
  • Tags:
Project Bids



(14 bids have been placed. iant has chosen to keep all bids for this project hidden.)