HTML file Hints and Tips ------------------------------------------------------------------------------ Site News URL Refering to a newsgroup at a particular site Example: snews://secnews.netscape.com/netscape.communicator ------------------------------------------------------------------------------ Special Mime Types text/html Normal HTML document text/plain Plain old text, not formating (even if HTML) application/x-msdownload Force browser to save file to disk you can force a particular file to be a specific type by adding to a ".htaccess" file a line like the following... AddType text/plain .hints .pl .sh This overrides the normal "mime-type" suffix lookup, however client browsers may ignore the suggested mime-type and look up its own type based on the URL links suffix. :-( ------------------------------------------------------------------------------ Comments However double hyphens are not allowed in comments. As such for separators in HTML I suggest using equal signs...
This double hyphen rule however can cause problems when commenting shell code. For example this fails (openbox xml or HTML)... The solution is to use - to replace at least one of the hyphens Which will work regardless of if the code is commented or uncommented. But makes copy and paste from the HTML source more difficult. ------------------------------------------------------------------------------ Breaking up Long HTML lines You can insert newlines which do not appear as white space in the final html document by placing them INSIDE html tags. EG: will add newlines to the html source but no whitespace will appear around the image. It used to be common practice to CAPITALISE the html tags and options so they standout from the arguments. But that does not appear to be the case any more. Adding newline inside a comment will also work... This is one very long line. Question.. can this be done automatically? ------------------------------------------------------------------------------ Print contents between tags over mulitple lines EG: .... .... .... awk 'BEGIN{ RS=""}{gsub(/.*/,"")}1' file ------------------------------------------------------------------------------ Extract links from a page! See my "html2urls" script #!/usr/bin/perl -0777 # # Written by Sarang Gupta (sarang@sarangworld.com) for $x ( split(/\n/,<>) ){ $_ = `lynx -source $x`; s/[^<]*(<[^>]*>)/$1/gi; s/\n/ /g; $x =~ s/^(.*)\/[^\/]*.htm[a-z]*$/$1/;s/(<[^<>]*>)/$1\n/g; for(split(/\n/)){ if( s/^.*(href|src)[^=]*=\s*"([^"]*)".*$/$2/i ) { s/^(?!(http|ftp|mailto))/$x\/$2/i; print"$_\n"; } } } Alturnative use lynx to dump a reference list Get the list of URLs lynx -listonly -image_links -dump URL Directly from the URL lynx -dump URL |\ sed '1,/^References$/d; /^[ 0-9]*\. /!d; s///' Indirectly from a already downloaded file lynx -dump -force_html -base URL file.html |\ sed '1,/^References$/d; /^[ 0-9]*\. /!d; s///' However the perl library modules can do this too... Directly GET -o links "$1" |\ sed -n "s/^\(FRAME\|A\|IMG\)[ \t]*//p" Indirectly GET -o links -b "$1" file.html |\ sed -n "s/^\(FRAME\|A\|IMG\)[ \t]*//p" Wget does not provide a way to directly extract URL's. Though it must do this internally. You will need to use one of the other methods above to do this. ------------------------------------------------------------------------------ HTML to Plain Text (remove html formatting) See my "html2txt" script --- Lynx lynx -dump -width=80 -force_html -nolist file://localhost`pwd`/file.html Or as a pipeline lynx -width=80 -nomargins -force_html -dump -nolist -stdin Remove -width to turn off word wrap Other options include... -justify justify the text -dont_wrap_pre no not wrap the
 formated blocks

---
ELinks is simular

  elinks -dump -dump-width 80 -force-html -no-references -no-numbering \
         file://localhost`pwd`/file.html

---
Sed

This will remove tags from MOST html
  sed -e :a -e 's/<[^>]*>//g; /]*>\( *<[^>]*>\)*/ /g; /new->parse($_)'

---
Perl - Simplistic tag removal
Assumes the whole HTML text is in one string

sub text_clean() {
  local $_ = shift;

  s/| .*?>)/+ /igs;              # Handle Specific HTML Tags
  s/<\/BR(>| .*?>)/\n/igs;
  s/<\/?P(>| .*?>)/\n\n/igs;
  s///igs;
  s///igs;

  s/<.*?>/ /gs;          # Remove all other HTML Tags

  s/[^\S\n]+/ /g;        # White Space (not newline) compression
  s/^ //gm;
  s/ $//gm;

  s/\&/\&/g;         # common HTML characters
  s/\</\/g;
  s/ / /g;          # no-break space (must be after space compression)

  s/\n\n+/\n\n/g;        # blank line compression
  s/^\n+//g;
  s/\n+$/\n/g;

  return $_;
}

------------------------------------------------------------------------------
See info/perl/www.hints  for

  URL encoding and decoding

  CGI decoding

For hints on matching specific text for data extraction see
  http://lwp.interglacial.com/ch06_02.htm

------------------------------------------------------------------------------
Tables with no spacing

   

------------------------------------------------------------------------------
Centering a whole page!

If you don't mind using frames, here is a cool way to centralize a page
no matter what resolution the surfer is using:-


your alternative (Suggestion, SSI Inclusion page.htm :-)


This would center your page horizontally - obviously the file spacer.htm
would have just a background or bgcolor.  You could nest a frameset to
also center the page vertically, or simply make the file page.htm
another frameset to do the same thing.

matthew  

------------------------------------------------------------------------------
META elements..

Page Description (to proper describe your page for Search Engines)

   
   
   
   
   

Currently only AltaVista and Infoseek use this information but people
who supply the correct keywords are placed at the top of search hits.

Other than the mail links (last two) this is not a good idea for general
pages just major starting pages. The LINK tag for example is understood
by the lynx text only web browser, for automatic reporting of bad links.


Auto Loading the next page (or sound!)
  
      I presume this tells Netscape to download the audio file 20 seconds
      after the page has been loaded.

      WARNING: If the page is still downloading graphical elements after
      twenty seconds has expired, the  will be called and the page
      layout will be aborted. This has been my experience on the Mac
      side....              --- Pacific Ocean Digital  

Specify base for all URL's and Images.
   
      The original source of this document, Images and other relative
      links will be from this directory and NOT where this file came from!

------------------------------------------------------------------------------
Inline Data URL's

Allow small images (or other data) to be added directly in the downloaded HTML
page. See  http://en.wikipedia.org/wiki/Data_URI_scheme

This is especially more efficent for small images (less than 10 Kbytes)

Example
  Red dot

or
  Nose Guy

Warning does not work for IE7 (does for IE8)
and Firefox 3.5.7 cannot take newlines in the data!

CSS
  ul.checklist  li.complete { margin-left: 20px; background:
    url('data:image/png;base64,
      iVBORw0KGgoAAAANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEUAAAD/
      //+l2Z/dAAAAM0lEQVR4nGP4/5/h/1+G/58ZDrAz3D/McH8yw83NDDeNGe4U
      g9C9zwz3gVLMDA/A6P9/AFGGFyjOXZtQAAAAAElFTkSuQmCC')
    top left no-repeat; }

JavaScript

  window.open('data:text/html;charset=utf-8,%3C%21DOCTYPE%20'+
    'html%3E%0D%0A%3Chtml%20lang%3D%22en%22%3E%0D%0A%3Chead%'+
    '3E%3Ctitle%3EEmbedded%20Window%3C%2Ftitle%3E%3C%2Fhead%'+
    '3E%0D%0A%3Cbody%3E%3Ch1%3E42%3C%2Fh1%3E%3C%2Fbody%3E%0A'+
    '%3C%2Fhtml%3E%0A%0D%0A','_blank','height=300,width=400');

See also use of 'inline' images in ImageMagick
  http://www.imagemagick.org/Usage/files/#inline

------------------------------------------------------------------------------
Server Side Include Files (file type)

Files which are Server Side Includes are just text included and may or
may not contain HTML code.

Note however that many HTML checkers such as  weblint will object to
this HTML included file as it does not contain the required , <HTML>,
<BODY> tags. Or the including file SSI will be missing certian tags.

I suggest you use a different suffix which makes it plain that this is a
server side include.  I use ".phtml" suffix, but other posibilities are
".ssi"  (server side include).

------------------------------------------------------------------------------
Execute Bit Hack (for server side includes)

The  execute bit method of installing Server Side Includes is a HACK and
should NOT be used!!!  It is depreciated, hacky and error prone.

Instead you sould use...

            index.shtml

The .shtml suffix IS THE CONVENSION for both NCSA and Apache Web servers.

Both servers also have a 'DirectoryIndex' configuration option so that
directory index files can be  ".shtml"  or to  ".cgi"  files instead of the
normal  ".html"  format.

For example...

    =======8<--------
    # DirectoryIndex: Name of the file to use as a pre-written HTML
    # directory index.  These files are used if a directory is referenced.

    DirectoryIndex index.html index.shtml index.cgi
    =======8<--------

------------------------------------------------------------------------------