------------------------------------------------------------------------------ URL General Syntax (rfc1738) ... ://:@:/ For example wget ftp://mylogin:mypassword@unix.server.com/emacs Some or all of the parts ":@", ":", ":", and "/" may be excluded. The scheme specific data start with a double slash "//" to indicate that it complies with the common Internet scheme syntax. The user name (and password), if present, are followed by a commercial at-sign "@". Within the user and password field, any ":", "@", or "/" must be encoded. Note that an empty user name or password is different than no user name or password; there is no way to specify a password without specifying a user name. E.g., has an empty user name and no password, has no user name, while has a user name of "foo" and an empty password. ------------------------------------------------------------------------------ URL obfuscation... Watch out for URLs of the form... http://www.some_bank.com.au:UserSession=2f4d0&state=Update@208.56.4.83/ Which is actually of the form: http://user@host/ (where the real host is 208.56.4.83) IE: the URL does NOT go where, at first glance, it appears to go. Not only that but many browsers (EG; IE) will accept the IP as a single 32 bit number EG 208.56.4.83 will be 3493332051, which can easily be hidden in what looks like a session key. For example http://216.239.53.99/ can be written as http://www.some.host@3639555427/ or http://0xd8ef3563/ for Microsoft Internet Explorer. But you don't even really need "@" format URL's. You can always have http say one thing and do another... https://olb.westpac.com.au/validate.asp?fi=WBS The URL the user is directed to a PHP script which promptly opened a new browser window without a menu, address or status bar. With a page that looked identical to a "westpac" login page but wasn't. Of course you can also get a domain that looks roughly like a valid domain That is instead of... http://mybank.com/ you get a domain for... http://mybank.org.us/ http://mybank.com.tv/ http://mybank.com.au/ ------------------------------------------------------------------------------ HTML Download lynx -source http://www.cit.gu.edu.au/images/Balls/smiley.gif > image.gif wget -O image.gif http://www.cit.gu.edu.au/images/Balls/smiley.gif curl -o image.gif http://www.cit.gu.edu.au/images/Balls/smiley.gif or if to file of same name wget http://www.cit.gu.edu.au/images/Balls/smiley.gif curl -O http://www.cit.gu.edu.au/images/Balls/smiley.gif ------------------------------------------------------------------------------ RAW (by hand) 'telnet' web requests.... =======8<-------- :::prompt:::> telnet www.cit.gu.edu.au 80 <- the telnet command to server Trying 132.234.5.1 ... <- telnet junk Connected to kurango.cit.gu.edu.au. Escape character is '^]'. HEAD /~davida/ HTTP/1.0 <- You type this for the doc <- and a blank line HTTP/1.0 200 Document follows Date: Mon, 18 Mar 1996 12:09:03 GMT <- Current date Server: NCSA/1.4.2 <- server being used Content-type: text/html <- document type Last-modified: Wed, 06 Mar 1996 06:09:14 GMT <-- last modified date of file Content-length: 848 <- document size (if known) Connection closed by foreign host. =======8<-------- To talk to the wever via HTTPS use 'openssl' instead of telnet like this.. =======8<-------- :::prompt:::> openssl s_client -connect www.cit.gu.edu.au:443 CONNECTED(00000003) ...lots of output about server and its certificate... --- HEAD / HTTP/1.0 <- You type this for the doc <- and a blank line HTTP/1.1 200 OK Date: Tue, 27 Oct 2009 05:23:53 GMT Server: Apache/2.0.52 (CentOS) Accept-Ranges: bytes Content-Length: 1532 Connection: close Content-Type: text/html; charset=iso-8859-1 closed =======8<-------- To retrieve the doument use "GET" instead of "HEAD" WARNING: Some versions of telnet has two problems associated with it. First it does not read 'piped' input (especially of Sun Micosystems Computers). Second some versions flushes its input after it has connected, meaning you have to delay input until after that point. The "mconnect" command (undet solaris) or "netcat" (often installed as "nc") or even a perl equivelent (like "tcp_client") can be used instead. =======8<-------- mconnect -p 80 www.cit.gu.edu.au > output <<-EOF HEAD /~davida/ HTTP/1.0 EOF =======8<-------- NOTE: some servers REQUIRE you to specify a "Accept" header line for it to work and in HTTP 1.1 you must also specify a "Host" header line. HTTP 1.1 alos allows you to specify a "Range" in bytes to download. =======8<-------- mconnect -p 80 www.cit.gu.edu.au > output <<-EOF HEAD / HTTP/1.0 Host: lyrch.cit.gu.edu.au Range: bytes=6000- EOF =======8<-------- Modify the web server used and Host: line appropriately. AND try it!!! The only difference when using a HTTP proxy server is that you contact the proxy instead of the real site, and must give the proxy the full URL you are wanting. Here is a proxy retrieve using a "tcp_client" perl script =======8<-------- mconnect -p 8080 211.251.139.129 <<-EOF GET http://www.gu.edu.au/ HTTP/1.0 User-Agent: http_get/1.0 perl/5.0 Accept: */* EOF =======8<-------- You can do this in raw BASH shell scripts too... =======8<-------- # open network connection exec 4<>/dev/tcp/www.griffith.edu.au/80 # send request echo >&4 "HEAD http://www.griffith.edu.au/ HTTP/1.0" echo >&4 "" # read response while read line; do echo $line done <&4 #close exec 4>&- =======8<-------- ------------------------------------------------------------------------------ Downloading Images and other referenced files.. 1/ Netscape Save Image If you press the rightmost mouse button of your mouse (however many buttons that is) over an image, in a moment or two netscape will popup an menu which will allow you to save the image to a file on the local disk. 2/ Figure out URL and download that If all else fails you can still download WWW images look at the source of the html refering to that icon to find its URL. EG: http://www.cit.gu.edu.au/images/Balls/Images.html #=======8<------CUT HERE--------axes/crowbars permitted---------------

Misc Balls

earth
smiley
gold
#=======8<------CUT HERE--------axes/crowbars permitted--------------- Then download the smiley ball use the URL... http://www.cit.gu.edu.au/images/Balls/smiley.gif This will either save the gif image directly to disk (an option under mosaic) or display it in a image viewer (default for mosaic and netscape) which should allow you to save it to disk. ------------------------------------------------------------------------------ Looking at HTTP protocol requests... Curl has options to let you see or save the actual request, and the response headers as well as the data itself... curl --verbose --include http://www.cit.gu.edu.au/ It has a lot of options for saving information into files, progress bars, cookies set and retreieve, extra headers, lying about what client is making the request, etc.. making it very useful for shell scripts, Their is a perl interface to the curl library, which I use in my own web spider programs. ------------------------------------------------------------------------------ Extract links from a page! Quick and dirty using lynx Not very good -- see my html2urls script instead #!/usr/bin/perl -0777 # # Written by Sarang Gupta (sarang@sarangworld.com) for $x ( split(/\n/,<>) ){ $_ = `lynx -source $x`; s/[^<]*(<[^>]*>)/$1/gi; s/\n/ /g; $x =~ s/^(.*)\/[^\/]*.htm[a-z]*$/$1/;s/(<[^<>]*>)/$1\n/g; for(split(/\n/)){ if( s/^.*(href|src)[^=]*=\s*"([^"]*)".*$/$2/i ) { s/^(?!(http|ftp|mailto))/$x\/$2/i; print"$_\n"; } } } Alturnative use lynx to dump a reference list Directly from the URL lynx -dump URL |\ sed '1,/^References$/d; /^[ 0-9]*\. /!d; s///' Indirectly from a already downloaded file lynx -dump -force_html -base URL file.html |\ sed '1,/^References$/d; /^[ 0-9]*\. /!d; s///' However the perl library modules can do this too... Directly GET -o links "$1" |\ sed -n "s/^\(FRAME\|A\|IMG\)[ \t]*//p" Indirectly GET -o links -b "$1" file.html |\ sed -n "s/^\(FRAME\|A\|IMG\)[ \t]*//p" Wget does not provide a way to directly extract URL's. Though it must do this internally. You will need to use one of the other methods above to do this. ------------------------------------------------------------------------------ Html to Text (remove html formating) Lynx does this very well cat file | lynx -width=80 -force_html -dump -nolist Remove -width if wrapp around is not needed Other options include... -justify justify the text -dont_wrap_pre no not wrap the
 formated blocks

But it is not perfect!

No other simple solutions are known.

This sed method will remove tags from MOST html
  sed -e :a -e 's/<[^>]*>//g;/ mdk-80.iso
  rm mdk-iso.part?

------------------------------------------------------------------------------
WGet Spider downloads...

Download/Mirror a single page and all part needed.
   wget -p --convert-links -nd http://...


Download a whole sub-directory recursivally
   wget -nv -nc -r -l inf -nH -np http://...
Just that directory and don't create sub-directories
   wget -nv -nc -r -l2 -np -nd  http://...

Download all the URls in a file
   wget -nv -i 
All files in an index file (fails???)
   ????

All images in sub-directory (not thumbnail directories)
   wget -nv -nc -r -l inf -nH -np -A '.gif,.jpg'  http://...
   find . -depth -type d | xargs rmdir

Just a directory of images (no sub-directories)
this is equivelent to  the ftp globbing ftp://.../dir/*.gif  .
   wget -nv -nc -r -l1 -nH -np -A gif http://...
OR a range of suffixes...
   wget -nv -nc -r -l1 -nH -np -A .gif,.jpg http://...

Restrict downloaded of files to a specific subdirectory
(and remove top path elements)
   wget ...  -I /galleries --cut-dirs=1  ...

Not a specific thumbnail directory  (no pattern match?)
   wget ...  -X /galleries/b/c/thumbnails ...



NOTE that wget does not have a "NOT these URLs" file rejection list,
though it must have such a list internally for recursive download.
We can however fake it (see later)


You can use wget to download files by number
   seq -f %03g 1 100 | xargs -i%   wget http://.../file_%.gif

OR using shell expandsion....
   wget -x -nH --cut-dirs=2 \
      http://../s{1,2,3,4}g{3,4,5,6}/{1,2,3,4}.mpg

------------------------------------------------------------------------------
For Lynx Spider Downloads

See the updated "CRAWL.announce" document in lynx install documents.

It uses a set of files in the download directory...
"traversal.dat"  List of URLs which has already been downloaded
                 Files in this list at start will not be repeated.
"reject.dat"     List of URLs that should be (or was) rejected (not downloaded)
                 A final '*' can be used to ignore a whole sub-tree
"traversal2.dat" Trace of URL searched/rejected and their titles
traverse.errors  URLs that failed to be retrieved.
                 These can case traversal loops, so move them to reject.dat
                 to stop looking at them.

lnk########.dat  Downloaded doument as plain text (if -crawl switch is given)
                 With a two line header listing the URL and page TITLE.  All
                 URLs in doc will be striped unless a "-number_links" option
                 is added, which all also include a final references list at
                 the end. Adding a "-nolist" will strip the reference list.

NOTE lynx will not download HTML source in this way (see wget)
only the formated text from a web site which for some purposes may be better.
Adding a -dump will only download the first 'starting document'.

I also have a script to use lynx crawl to download a web site text and a
second script to rename the downloaded pages back into a directory tree
structure again (as text files).  Mail me if you would like it.

------------------------------------------------------------------------------
Wget "traversal.dat" handling

"traversal.dat" is a file "lynx" uses to specify URLs it has already handled

Using this file, you can 'touch' (create an empty file) of all the files we
don't want wget to download, in the same way lynx does.

NOTE: this does not work with ".html" files, as wget will traverse them and
probably will re-download them anyway.  But for .txt or image or other data
this prevents a lot of re-downloading of files you already retrieved.

Touch all 'unwanted' files from a "traversal.dat" file.
  perl -ne "s/'/'\\''/g;"' \\
     m|//[^/]*/(.*)/([^/]*)\n| or print STDERR "BADURL: $_"; \\
     print "echo '\''$1'\''; mkdir -p '\''$1'\''\n" unless $dir_done{$1}++; \\
     print "touch '\''$1/$2'\''\n" if $2; ' traverse.dat | sh

Now proceed with wget as above with  -nH (no hostname directory)  option.

Remove empty results (the files we created)...
  find . -type f -size 0 | xargs rm
  find . -depth -type d | xargs rmdir

Fake a "traverse.dat" (cf lynx) file from the newly downloaded files..
   find . -type f > traverse.new
   vi traverse.new
      remove "traverse.new" and "traverse.dat" entries
      Replace "./" prefix with download site URL

Re-add it to the original "traverse.dat" file...
  sort -u -o traverse.dat   traverse.dat traverse.new
  rm traverse.new

------------------------------------------------------------------------------
Access Counters for WWW pages.

See the www laboratory notes on counters
   http://www.cit.gu.edu.au/~anthony/wwwlab/count/notes.html

------------------------------------------------------------------------------