[Zope] Dumping Zope (CMF) site to file system

Eugene el-spam at yandex.ru
Thu Jul 8 05:19:34 EDT 2004


Hello David,

DCS> I want to set up a process for dumping my Zope CMF site to the
DCS> filesystem, to be served by Apache. I'm interested in anyone who's doing
DCS> this - what tools are you using. I'm trying Wget, but the main problem
DCS> is dealing with absolute URLs. I can use the Wget --convert-links
DCS> option, which removes the href attribute from the <base> tag and makes
DCS> internal links relative. However, I still have a problem with folders.
DCS> The absolute_url() method does not return a trailing slash for folders.
DCS> Wget downloads the URL folder_name as a file called folder_name, but it
DCS> downloads folder_name/ as folder_name/index.html. I have already written
DCS> a relativeURL() script based on
DCS> portal_url.getRelativeUrl(), but it 
DCS> doesn't return a trailing slash either, so I'll have to add one.

Recently I've done this problem.
The solution is next.
1. Make all your URLs end with slash.
   I did it manually, by correcting some lists in portlets,
   and after that I  found how to redefine absolute_url() function.
   Please, look for it here:
2. Run wget (I'm doing it from my Zope as a reaction on some user
   action) but it's also could be done with shell script like below.
   Convert links in downloaded files, erase <base ..> tag.
   Also I edit html files to delete 'index.html' from links - any URL
   now ends with '/'. (*)
   If you wish you may optimize file by killing white space - I found
   white space takes about 30-40% of html file.
3. Publish your files.


Here's the script:

el at test[<<debug-1/bin]%cat mirror.sh
#!/bin/sh

param=$1
if test "$param" = ""; then
  param='-r -l 1 -i ../etc/wget-list'
else
  param="http://www.test/$param"
fi
wget -v -nH -k -p -X images -x -R index_html $param
for i in `find ./ -name '*.html'`;
do
 infa=`cat $i`
 infa=`echo $infa|sed -e 's/href="\([a-zA-Z0-9._/-]*\)\/index.html"/href="\1\/"/g' \
    -e 's/="index.html"/=".\/"/g' -e 's/<base href=""[^/]*\/>/<!--here was base tag-->/'`
 echo $infa > $i
done

======
File wget-list contains extra files need to be downloaded:
el at test[<<debug-1/bin]%cat ../etc/wget-list
http://www.test/
http://www.test/xtra/head.css
http://www.test/xtra/default.css
http://www.test/xtra/inside.css

====

Addition:
(*) - It's my mania. I hate URL with a lot of junk like
     http://site/print1.html?foo=bar&sid=4759436545&vasya=pupkine&junks=true&nothing=many-many....
     The best URL is in format as supposed Tim Bernes Lee:
     http://site/section/subsection/page/

     
-- 
Best regards,
 Eugene                            mailto:el-spam at yandex.ru



More information about the Zope mailing list