HTML file Hints and Tips
------------------------------------------------------------------------------
Site News URL
Refering to a newsgroup at a particular site
Example: snews://secnews.netscape.com/netscape.communicator
------------------------------------------------------------------------------
Special Mime Types
text/html Normal HTML document
text/plain Plain old text, not formating (even if HTML)
application/x-msdownload Force browser to save file to disk
you can force a particular file to be a specific type by adding to
a ".htaccess" file a line like the following...
AddType text/plain .hints .pl .sh
This overrides the normal "mime-type" suffix lookup, however client
browsers may ignore the suggested mime-type and look up its own type
based on the URL links suffix. :-(
------------------------------------------------------------------------------
Comments
However double hyphens are not allowed in comments.
As such for separators in HTML I suggest using equal signs...
This double hyphen rule however can cause problems when commenting
shell code. For example this fails (openbox xml or HTML)...
The solution is to use - to replace at least one of the hyphens
Which will work regardless of if the code is commented or uncommented.
But makes copy and paste from the HTML source more difficult.
------------------------------------------------------------------------------
Breaking up Long HTML lines
You can insert newlines which do not appear as white space in the
final html document by placing them INSIDE html tags.
EG:
will add newlines to the html source but no whitespace will appear
around the image. It used to be common practice to CAPITALISE the html tags
and options so they standout from the arguments. But that does not appear to
be the case any more.
Adding newline inside a comment will also work...
This is one very long line.
Question.. can this be done automatically?
------------------------------------------------------------------------------
Print contents between tags over mulitple lines
EG: .... .... ....
awk 'BEGIN{ RS=""}{gsub(/.*/,"")}1' file
------------------------------------------------------------------------------
Extract links from a page!
See my "html2urls" script
#!/usr/bin/perl -0777
#
# Written by Sarang Gupta (sarang@sarangworld.com)
for $x ( split(/\n/,<>) ){
$_ = `lynx -source $x`;
s/[^<]*(<[^>]*>)/$1/gi;
s/\n/ /g;
$x =~ s/^(.*)\/[^\/]*.htm[a-z]*$/$1/;s/(<[^<>]*>)/$1\n/g;
for(split(/\n/)){
if( s/^.*(href|src)[^=]*=\s*"([^"]*)".*$/$2/i ) {
s/^(?!(http|ftp|mailto))/$x\/$2/i;
print"$_\n";
}
}
}
Alturnative use lynx to dump a reference list
Get the list of URLs
lynx -listonly -image_links -dump URL
Directly from the URL
lynx -dump URL |\
sed '1,/^References$/d; /^[ 0-9]*\. /!d; s///'
Indirectly from a already downloaded file
lynx -dump -force_html -base URL file.html |\
sed '1,/^References$/d; /^[ 0-9]*\. /!d; s///'
However the perl library modules can do this too...
Directly
GET -o links "$1" |\
sed -n "s/^\(FRAME\|A\|IMG\)[ \t]*//p"
Indirectly
GET -o links -b "$1" file.html |\
sed -n "s/^\(FRAME\|A\|IMG\)[ \t]*//p"
Wget does not provide a way to directly extract URL's. Though it must do this
internally. You will need to use one of the other methods above to do this.
------------------------------------------------------------------------------
HTML to Plain Text (remove html formatting)
See my "html2txt" script
---
Lynx
lynx -dump -width=80 -force_html -nolist file://localhost`pwd`/file.html
Or as a pipeline
lynx -width=80 -nomargins -force_html -dump -nolist -stdin
Remove -width to turn off word wrap
Other options include...
-justify justify the text
-dont_wrap_pre no not wrap the
formated blocks
---
ELinks is simular
elinks -dump -dump-width 80 -force-html -no-references -no-numbering \
file://localhost`pwd`/file.html
---
Sed
This will remove tags from MOST html
sed -e :a -e 's/<[^>]*>//g; /]*>\( *<[^>]*>\)*/ /g; /new->parse($_)'
---
Perl - Simplistic tag removal
Assumes the whole HTML text is in one string
sub text_clean() {
local $_ = shift;
s/
| .*?>)/+ /igs; # Handle Specific HTML Tags
s/<\/BR(>| .*?>)/\n/igs;
s/<\/?P(>| .*?>)/\n\n/igs;
s///igs;
s///igs;
s/<.*?>/ /gs; # Remove all other HTML Tags
s/[^\S\n]+/ /g; # White Space (not newline) compression
s/^ //gm;
s/ $//gm;
s/\&/\&/g; # common HTML characters
s/\</\/g;
s/ / /g; # no-break space (must be after space compression)
s/\n\n+/\n\n/g; # blank line compression
s/^\n+//g;
s/\n+$/\n/g;
return $_;
}
------------------------------------------------------------------------------
See info/perl/www.hints for
URL encoding and decoding
CGI decoding
For hints on matching specific text for data extraction see
http://lwp.interglacial.com/ch06_02.htm
------------------------------------------------------------------------------
Tables with no spacing
------------------------------------------------------------------------------
Centering a whole page!
If you don't mind using frames, here is a cool way to centralize a page
no matter what resolution the surfer is using:-
your alternative (Suggestion, SSI Inclusion page.htm :-)
This would center your page horizontally - obviously the file spacer.htm
would have just a background or bgcolor. You could nest a frameset to
also center the page vertically, or simply make the file page.htm
another frameset to do the same thing.
matthew
------------------------------------------------------------------------------
META elements..
Page Description (to proper describe your page for Search Engines)
Currently only AltaVista and Infoseek use this information but people
who supply the correct keywords are placed at the top of search hits.
Other than the mail links (last two) this is not a good idea for general
pages just major starting pages. The LINK tag for example is understood
by the lynx text only web browser, for automatic reporting of bad links.
Auto Loading the next page (or sound!)
I presume this tells Netscape to download the audio file 20 seconds
after the page has been loaded.
WARNING: If the page is still downloading graphical elements after
twenty seconds has expired, the will be called and the page
layout will be aborted. This has been my experience on the Mac
side.... --- Pacific Ocean Digital
Specify base for all URL's and Images.
The original source of this document, Images and other relative
links will be from this directory and NOT where this file came from!
------------------------------------------------------------------------------
Inline Data URL's
Allow small images (or other data) to be added directly in the downloaded HTML
page. See http://en.wikipedia.org/wiki/Data_URI_scheme
This is especially more efficent for small images (less than 10 Kbytes)
Example
or
Warning does not work for IE7 (does for IE8)
and Firefox 3.5.7 cannot take newlines in the data!
CSS
ul.checklist li.complete { margin-left: 20px; background:
url('data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQAQMAAAAlPW0iAAAABlBMVEUAAAD/
//+l2Z/dAAAAM0lEQVR4nGP4/5/h/1+G/58ZDrAz3D/McH8yw83NDDeNGe4U
g9C9zwz3gVLMDA/A6P9/AFGGFyjOXZtQAAAAAElFTkSuQmCC')
top left no-repeat; }
JavaScript
window.open('data:text/html;charset=utf-8,%3C%21DOCTYPE%20'+
'html%3E%0D%0A%3Chtml%20lang%3D%22en%22%3E%0D%0A%3Chead%'+
'3E%3Ctitle%3EEmbedded%20Window%3C%2Ftitle%3E%3C%2Fhead%'+
'3E%0D%0A%3Cbody%3E%3Ch1%3E42%3C%2Fh1%3E%3C%2Fbody%3E%0A'+
'%3C%2Fhtml%3E%0A%0D%0A','_blank','height=300,width=400');
See also use of 'inline' images in ImageMagick
http://www.imagemagick.org/Usage/files/#inline
------------------------------------------------------------------------------
Server Side Include Files (file type)
Files which are Server Side Includes are just text included and may or
may not contain HTML code.
Note however that many HTML checkers such as weblint will object to
this HTML included file as it does not contain the required , ,
tags. Or the including file SSI will be missing certian tags.
I suggest you use a different suffix which makes it plain that this is a
server side include. I use ".phtml" suffix, but other posibilities are
".ssi" (server side include).
------------------------------------------------------------------------------
Execute Bit Hack (for server side includes)
The execute bit method of installing Server Side Includes is a HACK and
should NOT be used!!! It is depreciated, hacky and error prone.
Instead you sould use...
index.shtml
The .shtml suffix IS THE CONVENSION for both NCSA and Apache Web servers.
Both servers also have a 'DirectoryIndex' configuration option so that
directory index files can be ".shtml" or to ".cgi" files instead of the
normal ".html" format.
For example...
=======8<--------
# DirectoryIndex: Name of the file to use as a pre-written HTML
# directory index. These files are used if a directory is referenced.
DirectoryIndex index.html index.shtml index.cgi
=======8<--------
------------------------------------------------------------------------------