------------------------------------------------------------------------------ URL General Syntax (rfc1738) ... ://:@:/ For example wget ftp://mylogin:mypassword@unix.server.com/emacs Some or all of the parts ":@", ":", ":", and "/" may be excluded. The scheme specific data start with a double slash "//" to indicate that it complies with the common Internet scheme syntax. The user name (and password), if present, are followed by a commercial at-sign "@". Within the user and password field, any ":", "@", or "/" must be encoded. Note that an empty user name or password is different than no user name or password; there is no way to specify a password without specifying a user name. ftp://host.com/ has no user name (anonymous), while ftp://@host.com/ has an empty user name and no password, ftp://foo:@host.com/ has a user name of "foo" and an empty password. ------------------------------------------------------------------------------ URL encoding Moved to "info/data/urlencoded.txt" ------------------------------------------------------------------------------ Insecure web sites... For sites with 'snakeoil certificates wget: Use --no-check-certificate curl: Use -k or --insecure ------------------------------------------------------------------------------- URL obfuscation... Watch out for URLs of the form... http://www.some_bank.com.au:UserSession=2f4d0&state=Update@208.56.4.83/ Which is actually of the form: http://user@host/ (where the real host is 208.56.4.83) IE: the URL does NOT go where, at first glance, it appears to go. Not only that but many browsers (EG; IE) will accept the IP as a single 32 bit number EG 208.56.4.83 will be 3493332051, which can easily be hidden in what looks like a session key. For example http://216.239.53.99/ can be written as http://www.some.host@3639555427/ or http://0xd8ef3563/ for Microsoft Internet Explorer. But you don't even really need "@" format URL's. You can always have http say one thing and do another... https://olb.westpac.com.au/validate.asp?fi=WBS The URL the user is directed to a PHP script which promptly opens a new browser window without a menu, address or status bar. With a page that looked identical to the expected login page. Of course a hacker can also get a domain that looks very simular That is instead of... http://mybank.com/ hackers buy and use the domain... http://mybank.org.us/ http://mybank.com.tv/ http://mybank.com.au/ http://my.bank.com/ http://my-bank.com/ Basically don't trust that you read a URL correctly, get it from a known valid source and not a random web page or email. ------------------------------------------------------------------------------- Client Method Summary... Dedicated Web TTY Commands wget -O save.txt http://www.site/file.txt curl -o save.txt http://www.site/file.txt lynx -source http://www.cit.gu.edu.au/images/Balls/smiley.gif > smiley.gif elinks lftp -c get http://www.site/file.txt save.txt or to download to file of same name wget http://www.site/file.txt curl -O http://www.site/file.txt lftp -c get http://www.site/file.txt Lower level web requests.... EG: telnet-like =======8<-------- ::::> telnet www-server 80 <- the telnet command to server Trying 132.234.5.1 ... <- telnet junk Connected to web_server.example.com Escape character is '^]'. HEAD /~davida/ HTTP/1.0 <- You type this for the doc <- and a blank line HTTP/1.0 200 Document follows Date: Mon, 18 Mar 1996 12:09:03 GMT <- Current date Server: NCSA/1.4.2 <- server being used Content-type: text/html <- document type Last-modified: Wed, 06 Mar 1996 06:09:14 GMT <-- last modified date of file Content-length: 848 <- document size (if known) Connection closed by foreign host. =======8<-------- To retrieve the actual doument use "GET" instead of "HEAD" WARNING: Telnet traditionally has problems... 1/ It does not read 'piped' input (newer versions fix this). 2/ It could flush its input after it has connected, meaning you have to delay input until after that point. 3/ it has a lot of extra junk to be filtered out. But then so does the next method... For Virtual Web servers (multiple DNS Aliases pointing to same IP) use a full URL in the request line. (using netcat) =======8<-------- ::::> printf 'HEAD http://www-server/ HTTP/1.0\n\n' |\ nc www-server 80 HTTP/1.1 301 Moved Permanently Date: Mon, 30 Mar 2015 01:49:37 GMT Server: Apache/2.2.15 (Red Hat) Location: http://www-serer.example.com/sub-directory/ Connection: close Content-Type: text/html; charset=iso-8859-1 =======8<-------- socat is the more advanced method, this does proper EOL handling =======8<-------- ::::> printf 'HEAD http://www.webkb.org/ HTTP/1.0\n\n' |\ socat TCP:www-server:80,crnl - HTTP/1.1 200 OK Date: Thu, 20 Jun 2013 00:04:41 GMT Server: Apache/2.2.15 (Red Hat) Last-Modified: Mon, 14 May 2012 10:06:35 GMT ETag: "a0ac0-265c-4bffc3db860c0" Accept-Ranges: bytes Content-Length: 9820 Connection: close Content-Type: text/html; charset=UTF-8 =======8<-------- Note socat not only links the TTY to the webserver, but also will correct the EOL newline character to/from the return/newline pair for correct network handling. Newer Web Servers will report "could not understand" errors if incorrect end-of-line is sent. NOTE: some servers REQUIRE you to specify a "Accept" header line for it to work. Also for HTTP 1.1 you must also specify a "Host" header line. HTTP 1.1 also allows you to specify a "Range" in bytes to download. =======8<-------- ::::> socat tcp:www._server:80,crnl - HEAD / HTTP/1.1 Host: web_server.example.com User-Agent: bash/4.2.45 socat/1.7.2.2 Accept: */* Range: bytes=200- HTTP/1.1 301 Moved Permanently Date: Mon, 30 Mar 2015 01:51:20 GMT Server: Apache/2.2.15 (Red Hat) Location: http://www.example.com/sub-directory/ Connection: close Content-Type: text/html; charset=iso-8859-1 =======8<-------- Modify the web server used and Host: line appropriately. AND try it!!! A HTTP proxy server is exactly the same but you must give the proxy the full URL as part of the request. contact the proxy instead of the real site, and must give the proxy the full URL you are wanting. Here is a proxy retrieve using a "tcp_client" perl script =======8<-------- ::::> tcp_client 211.251.139.129:8080 GET http://anonymouse.org/cgi-bin/anon-snoop.cgi HTTP/1.0 Accept: */* =======8<-------- --- Using encrypted HTTPS openssl is a obvious method =======8<-------- ::::> openssl s_client -connect www.example.com:443 CONNECTED(00000003) ...lots of output about server and its certificate... HEAD / HTTP/1.0 HTTP/1.1 200 OK Date: Tue, 27 Oct 2009 05:23:53 GMT Server: Apache/2.0.52 (CentOS) Accept-Ranges: bytes Content-Length: 1532 Connection: close Content-Type: text/html; charset=iso-8859-1 closed =======8<-------- socat can also do this without the verbosity =======8<-------- printf 'HEAD / HTTP/1.0\n\n' |\ socat openssl:www-server.example.com:443,verify=0,crnl - =======8<-------- stunnel is another method but sets up a 'port forward' so it not really recommended Using a Proxy using ncat =======8<-------- ncat --http-proxy "www-server.example.com:80" \ --proxy-auth "user:pass" localhost 3128 =======8<-------- --- Simple API Programmed methods.... Bash can do network connections =======8<-------- # open network connection exec 4<>/dev/tcp/www-server/80 # send request echo >&4 "HEAD http://www-server.example.com/ HTTP/1.0" echo >&4 "" # read response while read line; do echo $line done <&4 # close the connection # WARNING: BASH has no concept of 'shutdown()' (though a sub-program can) # This will always close() both input and output, making the FD invalid. exec 4>&- =======8<-------- Gawk can do it too =======8<-------- ::::> gawk -f - <<'EOF' BEGIN { RS = ORS = "\r\n" WebServer = "/inet/tcp/0/www-server.example.com/80" print "HEAD http://www-server.example.com:80/ HTTP/1.0" |& WebServer print |& WebServer while ((WebServer |& getline) > 0) print $0 close(WebServer) } EOF =======8<-------- Perl can, of course! =======8<-------- perl -e ' use IO::Socket; $s = new IO::Socket::INET( PeerAddr => "www-server.example.com:80" ); print $s "HEAD http://www-server.example.com:80/ HTTP/1.0\r\n\r\n"; print while ( <$s> ); close $s; ' =======8<-------- ------------------------------------------------------------------------------- Curl Hints For a great set of curl usage notes see Curl Commandline Tutorial https://curl.haxx.se/docs/httpscripting.html Curl Cheat Sheet https://opensource.com/article/20/5/curl-cheat-sheet Add a -L or --location option to have it automatically follow redirects. and "-o {name}" to save to a given name Use "-O" to save into a filename from URL (also called "--remote-name") Use "--remote-name-all" to use "-O" on ALL the URL's given. -s Silent -k ignore SSL certificate failure --- You can curl multiple files from one command (and from one TCP/SSLconnection if keep-alive is enabled) curl -sk -o file1 URL1 -o file2 URL2 -o file3 URL3 --- Curl can download a globbed set of URLs!!!! curl -O -f 'http://example.com/Miloa[01-99].jpg' When using '[...]' you can also use the numver in the output filenames... curl "https://example.com/image_[1-4].jpg" --output "example_images_#1.jpg" or multiple globs... curl "https://example.com/images_00[0-9]/file_[1-4].webp" \ --output "file_#1-#2.webp" You can get curl to continue an aborted download (-C or --continue-at) curl -C - --remote-name "https://example.com/linux-distro.iso" --- Send Data with request to CGI Script curl --silent --show-error --connect-timeout ${TIMEOUT} --retry ${RETRY} \ --proxy "$https_proxy" \ --data "action=$action" \ --data "user=$username" --data "password=$password" \ --data-urlencode "text=${Msg}" \ "$URL" --- You can test a specific apache virtual server before setting up the DNS for it... h=virtual_server.domain # the virtual server URL to test f=this_server.domain # the actual server to connect to curl -k --connect-to ${h}:443:${f}:443 https://${h} The above contects 'this_server' and sets 'Host: virtual_server.domain' in the header. --- Curl has options to let you see or save the actual request, and the response headers as well as the data itself... curl --verbose --include http://www.example.com/ To save just the header use --dump-header header_save.txt --- Dump raw communications to a file use curl --trace-ascii dump.txt http://www.example.com/ It has a lot of options for saving information into files, progress bars, cookies set and retreieve, extra headers, lying about what client is making the request, etc.. making it very useful for shell scripts, --- There is a perl interface to the curl library, which I have used in my own web spider programs. to save cookies recieved use --cookie-jar cookie_jar to add cookies use --cookie "key=value" OR -cookie cookie_jar You can use mozilla cookie jar files. --- Download all PNG images from a web page... =======8<-------- url https://example.com |\ grep --only-matching 'src="[^"]*.[png]"' |\ cut -d\" -f2 |\ while read i; do \ curl https://example.com/"${i}" -o "${i##*/}"; \ done =======8<-------- ------------------------------------------------------------------------------- Parallel download of pieces (for large files) If you have lots of bandwidth but limited by network bottlenecks you can download one large file that is available from multiple sources... aria2 - CLI multiconnection downloaded for HTTP, and BitTorrent yum install aria2 pcurl - shell script to download in parallel sections using curl http://sourceforge.net/projects/pcurl/ httrack (site mirror) Curl... url1=http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/Mandrake80-inst.iso url2=http://ftp.rpmfind.net/linux/Mandrake/iso/Mandrake80-inst.iso url3=http://ftp.wayne.edu/linux/mandrake/iso/Mandrake80-inst.iso The length of the file is 677281792, so initiate three simultaneous downloads using curl's "--range" option. curl -r 0-199999999 -o mdk-iso.part1 $url1 & curl -r 200000000-399999999 -o mdk-iso.part2 $url2 & curl -r 400000000- -o mdk-iso.part3 $url3 & wait for all curl requests to finish. The "-r" options specifies a subrange of bytes to extract from the target file. When completed, simply cat all three parts together... cat mdk-iso.part? > mdk-80.iso rm mdk-iso.part? ------------------------------------------------------------------------------ For Lynx Spider Downloads See the updated "CRAWL.announce" document in lynx install documents. lynx -traversal -crawl -realm -nopause -traversal -crawl # traverse the tree, output each page to a file -realm # limit traversal to the starting realm -nopause # don't do 'statusline waits' -accept_all_cookies # if needed -number_links # turn on adding ref numbers to links in files -dont_wrap_pre # inhibit wrapping of text in
       --width=120           # display size for text wrapping

   NOTE: -dump outputs formatted text to stdout, -source does the same thing!
   That is neither will preserve the raw HTML during a traverse!


It uses a set of files in the download directory...
"traversal.dat"  List of URLs which has already been downloaded
                 URLs in this list (on startup) will not be repeated.
                 It will append to this list as it works.
"reject.dat"     List of URLs that is is to be ignored (not tried)
                 A final '*' can be used to ignore a whole sub-tree
"traversal2.dat" Trace of URL searched/rejected and their titles
traverse.errors  URLs that failed to be be retrieved.
                 These can cause traversal loops, so move them to reject.dat
                 to stop looking at them.

lnk########.dat  Downloaded doument as plain text (if -crawl switch is given)
                 With a two line header listing the URL and page TITLE.  All
                 URLs in doc will be striped unless a "-number_links" option
                 is added, which all also include a final references list at
                 the end. Adding a "-nolist" will strip the reference list.

NOTE: lynx will not download HTML source in this way (see wget) only the
formated text from a web site.
Adding a -dump will only download the first 'starting document' in HTML.

I also have a script to use lynx crawl, to download a web site text and a
second script to rename the downloaded pages back to a directory tree
structure again (as plain text files).

------------------------------------------------------------------------------
WGet Spider downloads...

Download website

   wget -nd -mk http://example.com

     -nd  flattens directory structure
     -m   mirror site
     -k   converts links to local filesystem links


Download/Mirror a single page and all part needed.
   wget -p --convert-links -nd http://...

Download a whole sub-directory recursively
   wget -nv -nc -r -l inf -nH -np http://...
Just that directory and don't create sub-directories
   wget -nv -nc -r -l2 -np -nd  http://...

Download all the URls in a file
   wget -nv -i 
All files in an index file (fails???)
   ????

All images in sub-directory (not thumbnail directories)
   wget -nv -nc -r -l inf -nH -np -A '.gif,.jpg,.png'  http://...
   find . -depth -type d | xargs rmdir

Just a directory of images (no sub-directories)
this is equivalent to  the ftp globbing ftp://.../dir/*.gif  .
   wget -nv -nc -r -l1 -nH -np -A '.gif' http://...
OR a range of suffixes...
   wget -nv -nc -r -l1 -nH -np -A '.gif,.jpg,.png' http://...

Restrict downloaded of files to a specific subdirectory
(and remove top path elements)
   wget ...  -I /galleries --cut-dirs=1  ...

Not a specific thumbnail directory  (no pattern match?)
   wget ...  -X /galleries/b/c/thumbnails ...


NOTE that wget does not have a "NOT these URLs" file rejection list,
though it must have such a list internally, as part of recursive download.
We can however fake it (see later)


You can use wget to download files by number
   seq -f %03g 1 100 | xargs -i%   wget http://.../file_%.gif

OR using shell expansion....
   wget -x -nH --cut-dirs=2 \
      http://.../s{1,2,3,4}g{3,4,5,6}/{1,2,3,4}.mpg

------------------------------------------------------------------------------
Wget "traversal.dat" handling (as per 'lynx' above)

"traversal.dat" is a file "lynx" uses to specify URLs it has already handled
downloaded.

Using this file, you can 'touch' (create an empty file) of all the files we
don't want "wget" to download, in the same way "lynx" does.

NOTE: This does not work with ".html" files, as "wget" will traverse them and
probably will re-download them anyway.  But for ".txt", image files, or other
data this prevents a lot of re-downloading of files you already retrieved.

Touch all 'unwanted' files from a "traversal.dat" file.
  perl -ne "s/'/'\\''/g;"' \\
     m|//[^/]*/(.*)/([^/]*)\n| or print STDERR "BADURL: $_"; \\
     print "echo '\''$1'\''; mkdir -p '\''$1'\''\n" unless $dir_done{$1}++; \\
     print "touch '\''$1/$2'\''\n" if $2; ' traverse.dat | sh

Now proceed with "wget" as above with  -nH (no hostname directory)  option.


Remove empty results (the files we created)...
  find . -type f -size 0 | xargs rm
  find . -depth -type d | xargs rmdir

Create a "traverse.dat" (cf lynx) file from the newly downloaded files..
   find . -type f > traverse.new
   vi traverse.new
      remove "traverse.new" and "traverse.dat" entries
      Replace "./" prefix with download site URL

Merge it back into the original "traverse.dat" file...
  sort -u -o traverse.dat   traverse.dat traverse.new
  rm traverse.new

------------------------------------------------------------------------------
Wget/Curl Multiple files in parallel....

Using Xargs...

  xargs /dev/null || echo {} >> url_failed "

Note that wget can use 'keep-alive' with connections so this variation will
give each wget a list of 20 URL's each to work with.

  parallel