0.5.5
08 Jul 2004
WebFetcher is a module designed to facilitate fetching large ammounts of documents and images from the web. With WebFetcher it is easy to do such things as downloading all the images on a page. You can also do more complex tasks, such as fetching all the pages linked to by a certain page and all the images on those pages. You can save the documents in a tree structure mirroring the server layout or flat in a single directory. In either case, WebFetcher can translate the links on the pages so that all relationships are preserved.
Save a copy of Programming Ruby to your hard drive:
require 'webfetcher' book = WebFetcher::Page.url('http://www.rubycentral.com/book/') book.recurse.save('pickaxe')
The latest version of WebFetcher is available at http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/.
Run
ruby install.rb
or copy the file webfetcher.rb to the desired destination manually.
WebFetcher is released under the same license as Ruby.
Much of what this module does is a bit heuristic. It doesn't use a full SGML parser, for example. Even if it dit it wouldn't work well with the many web pages that have errors. It uses a real hack to try to deal with links in javascript code. Dealing with links that end in "/" is also a bit difficult, because we don't know what the name of the file that is returned actually is. Sometimes, the web server returns a content-location header with the name of the file, but not always.
Because of this, I can't be 100 % sure that this module works with every web page (though I've tried to do as good as I can). If you find a page where it doesn't work, I would love to hear about it, since it would help me imrpove the module further.
The Internet has a lot of standards. Obviously a program like this cannot support all of them. (I am not attempting to write a full blown web browser.) Some of the things not currently supported by this module are:
ftp -- ftp-links are currently ignored.
basic authentication -- pages requiring authentication are ignored. Links including authentication information are not handled.
SSL -- https-links are ignored.
JavaScript -- the module has a little hack for finding links in JavaScript code, but it doesn't attempt to be a real JavaScript interpreter, and it never will.
I will probably add some of these things in the future, but I'd like to make sure that the basic functionality is as stable as possible before I start adding new things. Send me a list of your favorite features.
Feel free to send comments, bug reports and feature requests to the author. Here is the author's latest email address.
WebFetcher deals with two kinds of objects, pages and collections of pages. To create a new page, use Page.url.
include WebFetcher p = Page.url('http://www.rubycentral.com/book/')
This does not immediately start a download of the page. The download is delayed until the content is really needed, for example if you call the content method which returns the content of the web page or the save method which saves it to a file.
puts p.content p.save('index.html')
You will see when the page is downloaded, because the program will print GET http://www.rubycentral.com/book/ to $stderr
If you want to download a large number of documents, it is easier to use a PageCollection. A PageCollection is basically an array of pages with some convenience methods. You can get a PageCollection by calling PageCollection.new with an array of pages or you can can get a page collection from a page by using one of the methods: links, images, rich_page, extract and recurse.
links returns a page collection consisting of every document that is linked from the current page.
# Print google matches for ruby (and other links on the google page) pc = Page.url('http://www.google.com/search?q=ruby').links puts pc
images returns a page collection consisting of every image on the current page.
# Get google images of rubies pc = Page.url('http://images.google.com/images?q=ruby').images puts pc
rich_page returns a page collection consisting of the current page and any element that is directly visible on it. That includes layers, frames, images and style sheets.
puts Page.url('http://www.rubycentral.com/').rich_page
extract is the most versatile way of creating a page collection. It allows you to specify exactly what types of links you want to extract to the collection. See the documentation for more details.
puts Page.url('http://www.cnn.com/').extract('a', :external)
recurse is the most useful method when you want to download a large number of documents. It returns a collection consisting of the current page, all documents linked by it, the documents linked by those pages and so on. If you specify an integer argument to recurse, that argument tells the function how deep it should recurse. If you specify nil the program will recurse until it cannot find any more pages.
By default, recurse only returns document in the same directory as the current page and its subdirectories. That prevents it from downloading the entire web. But you can give options to recurse in the same way as you give options to extract.
book = Page.url('http://www.rubycentral.com/book/').recurse
If you use one of the methods links, images, rich_page, extract and recurse on a PageCollection you will get the same result as if you use the method on each member in the collection and then join the results. The most useful method is probably to first call recurse and then rich_page on the result to make sure that you have every document visible on the pages.
page = Page.url('http://www.scottmccloud.com/comics/mi/mi.html') pages = page.recurse.rich_page
Once you have a page collection, you probably want to save it. You can do that by using the save method and specifying the destination directory. By default, save will convert all the links on the pages so that they point to your locally saved documents when available, and to external web documents otherwise. So the pages will work just as usual, but run faster and offline. You can change the behavior of save by specifying options, see the documentation.
Page.url('http://www.rubycentral.com/book/').recurse.save('pickaxe')
Downloading documents from the Internet can be a time consuming and error prone procedure. WebFetcher allows you to deal with this by defining error handlers and progress monitors.
An error handler is called when an exception occurs during a potentially lengthy operation, such as recurse. Propagating the exception in this case is not always a good idea since it might abort the download and destroy hours of work. For this reason, WebFetcher never propagetes an exception, instead it sends it to an error handler which can be defined by you. In your error handler, you can propagate the exception or deal with it as you wish.
page.error_handler = proc {|x| File.open('log', File::APPEND) {|f| f.write(x)} }
The error handler is inherited by all pages derived from the page. By default, the system uses the following error handler:
proc {|x| $stderr.puts(x)}
PageCollections can have error handlers too. They are called if an exception occurs during save.
Progress monitors are used to keep track of program progress. A progress monitor attached to a page is called, with the page as argument, when the page is downloaded.
page.progress_monitor = proc {|page| puts "i'm fetching #{page.url}"}
Progress monitors are inherited by derived pages. By default, pages use the following progress monitor:
proc {|page| $stderr.puts "GET #{page.url}"}
Set progress_monitor= to nil to disable the output.
You can also add progress monitors to recurse and save. The recurse monitor is called for each document that is examined and the save monitor is called for each document that is saved.
RECURSE = proc {|page, i, total| puts "recurse #{page} (#{i} of #{total})"} SAVE = proc {|file, page, i, total| puts "saving #{file} (#{i} of #{total})"} book = Page.url('http://www.rubycentral.com/book/') book.recurse(&RECURSE).save('pickaxe', &SAVE)
Page PageCollection UnhandledSchemeError UnreachableDocumentError
A module that facilitates fetching documents (images and HTML pages) from the Internet using the HTTP protocol. It makes it easy to download all the images on a page or an entire tree of documents rooted at a certain point.
require 'webfetcher' include WebFetcher
book = Page.url('http://www.rubycentral.com/book/') pages = book.recurse(10).save('pickaxe')
new url == attr basename content content_translated dirname error_handler= ext extract host html? image? images largest_image link links name path port progress_monitor= proxy_host proxy_host= proxy_port proxy_port= recurse resp rich_page save save_translated tag to_s true_path url
This class represents a downloadable HTTP document.
Creates a new page at the specified host, path and port. proxy_host and proxy_port specify the location of the proxy. Use nil for proxy_host if you do not want to use a proxy. tag and attr are used to set the tag and attr attributes. name sets the name attribute.
If a block is given it is used as progress_monitor=, otherwise the default progress monitor is used.
p = Page.new('www.acc.umu.se', '/')
Most of the time it is probably simpler to use Page.url.
Creates a new page from an URL. The remaining argument are as for new.
Raises UnhandledSchemeError if the link is a scheme not handled by this program (such as mailto).
p = Page.url('http://www.acc.umu.se/')
Two pages are considiered equal if they have the same URL. (This definition is also used for hashing.)
If the document was generated from a tag, this returns a hash with the attributes of the tag. Hash keys are always lower case. If the document was not generated from a tag, attr is {}.
Returns the file name part of path.
Page.url('http://a.a.a/x/x.html').basename >> 'x.html'
Returns the content of the document. If the document has not been downloaded yet, a download is initiated.
Returns the page content as a string with translated URLs. Arguments are the same as for save_translated.
Returns the directory part of path.
Page.url('http://a.a.a/x/x.html').dirname >> '/x'
Sets the error handler for this document. The error handler is called if an exception occurs during a lengthy operation, such as recurse. This can happen, for example, if you get a network error. In that case, raising an exception is not always a good idea, since it can abort a very lengthy download. So instead of raising an exception, the program calls the error handler (a Proc) with the exception.
p.error_handler = proc {|x| raise unless x.class==RuntimeError}
The default error handler prints the error to $stderr, but does not abort the download. If you want to change this behavior, you must set error_handler. Note that the error handler is inherited when you do link or extract.
Returns a suitable file name extension for storing this object.
For fetched documents, the mime-type is used, otherwise the extension is extracted from the URL. If the URL looks like a link to a CGI-script the document is fetched to determine the mime-type.
Page.url('http://a.a.a/x.gif').ext >> 'gif'
Returns a set of pages extracted from the links on this page.
options determines how the links are extracted. There are two types of options, options specifying which types of links should be extracted (anchors, images, etc), and options that specify where the documents we are interested in are located.
A type option is either the special name :all_types which specifies that all types of links should be extracted or the name of a tag (a String) which links should be extracted. You can specify as many type otpions as you want. Tags currently supported by this module are: a, img, layer, bgsound, area, embed, body, frame, script, applet and link.
The location option can be :external, :server, :subdir, :dir, :all_locations or any combination of these. :external denotes links to other servers. :server links anywhere on the current server :subdir links to the current directory and all its subdirectories. :dir links to the current directory only. :all_locations specify that all locations should be extracted.
If you include a Page in the options, all location arguments, such as :external, :server, etc, will be relative to that page instead of the current page.
The special keyword :all is the same as specifying :all_locations and :all_types.
If you do not specify a type, :all_types is assumed. If you do not specify a location :server is assumed.
The extracted links are returned as a PageCollection
images = page.extract('img', :subdir)
The name of the host where the document is located.
Page.url('http://a.a.a/x/x.html').host >> 'a.a.a'
Returns true if the current page is an HTML page.
WebFetcher tries to determine the type of the page by looking at the link (to see if it ends in ".html", etc). If the result of this is inconclusive, this method will fetch the page to get a content-type header.
Returns true if this page is an image.
This method looks at the extension, the tag attribute and (if the document has been fetched) the content-type. Unlike html?, it never initiates a download.
Returns a PageCollection with all the images on this page.
This is equivalent to calling extract('img', :all_locations)
Returns the largest image on this page. This method does not actually download the images to check their sizes, it only looks at the height and width attributes. It returns nil if their are no images with specified height and width.
Creates a new page from a link URL on the current page. The URL can be either an absolute or a relative URL. The progress_monitor= and error_handler= of the current page are inherited by the new page. tag and attr sets the tag and attr attributes of the new page.
If the link is not a http-link, UnhandledSchemeError is raised.
page = current_page.link('../index.html')
Returns a PageCollection with all the documents linked from this page.
This is equivalent to calling extract('a', 'area', :all_locations)
If the document was derived from a #-link, this is the part after the #.
The path of this document
Page.url('http://a.a.a:1200/x/x.html').path >> 'x/x.html'
The port on the host where the document is located.
Page.url('http://a.a.a:1200/x/x.html').port >> 1200
Sets the progress monitor (a Proc object) for this document. The progress monitor is called when the page is downloaded, with the page as argument.
p.progress_monitor = proc {|p| puts "Downloading #{p}"}
If you do not specify a progress monitor a default progress monitor is used which prints "GET page_name" to $stderr. To disable this, set progress_monitor to nil.
Note that the progress_monitor property is inherited when you run extract or link, the progress_monitor will monitor the download of those pages too.
The name of the proxy to use for downloading documents or nil if you do not want to use a proxy.
The name of the proxy to use for downloading documents or nil if you do not want to use a proxy.
The port of the proxy that should be used to download documents.
The port of the proxy that should be used to download documents.
Returns a PageCollection consisting of this page and all the pages reachable from it by recursively following the links.
level specifies how deep the recursion should nest. If level is 0 only this page is returned. If level is 1, this page and all links on it are returned, and so on. If you specify nil for recursion, it will visit every linked page. You should never use nil and the :external option simultaneously. If you do, the program will most likely try to download the entire web.
options are the same as for extract, but the default options are different. The default type option is :all_types and the default location option is :subdir. The current page is used as reference page for all location arguments.
If you call the method with a block, progress updates will be sent to the block. page is the page that the recursion currently is looking at. i is the index of that page, it starts at 0 and increses as the recursion progresses. total is the total number of documents in the recursion queue. When i reaches total the recursion will stop, but note that total will increase as the recursion progresses.
pc = page.recurse(10) {|p,i,t| puts "#{p} (#{i} of #{t})"}
Returns http response header retrieved when the document was fetched. If the document has not been fetched yet a download is initiated.
Returns a PageCollection consisting of this page and all the documents directly visible on it: frames, stylesheets, images, etc.
Saves this document to the specified path.
Saves this document to the specified path, using url_map to translate the URLs in the document to local links. The url_map should be a hash associating URLs with the paths were they are stored locally.
If absolutize is true, all the relative links in the file that point to documents not in urlmap are converted to absolute links. This ensures that all links work as previously, even though the file is stored locally.
You can use this method to convert the links on the documents you are downloading so that they point to the downloaded documents on your hard drive instead of to external servers. But it is probably easier to use PageCollection#save.
If the document was derived from a tag, this is the tag name (i.e., 'img' for an image tag). (The tag is always in lower case.) If the document was not generated from a link, tag is nil.
Returns a string representation of this document. Equivalent to url.
Returns the true path of the document.
Sometimes the path of a document cannot be determined from its URL. If the URL points to a directory, the document could be a document in that directory named "index.html", "index.htm", "index.php" or something else.
The only way of determining the true path of the document is to download it and check the content-location header. The path method does not do that, since it is a potentially costly operation.
Sometimes, however, you need to know the true path of a document. This method checks the URL and if it looks like it points to a directory, it downloads the document to get a content-location header. It then returns the true path. (Note that some webservers do not send these headers even though they ought to, so this method is not guaranteed to return a correct result.)
After you have called true_path once, the result is cached, so subsequent calls to path and url will use the true path.
Returns the url of this page. If with_name is true the name part is included in the url.
new urls error_handler= extract images largest_image links recurse rich_page save
This class is used to represent a collection of pages. It is basically an Array of pages with some added convenience methods.
There is currently some debate about what you should do when you create a subclass of Array. Should you reimplement all or some of the methods in Array, so that they return your new subtype instead of an Array. See RCR #38 on RubyGarden.
I do no such reimplementations in this class. I count that Matz will make the right decision in which methods to change and which not to and think it is best to follow whatever standard he sets, unless there are special reasons not to.
If you want to do array operations, you must convert the result back to a PageCollection manually.
pages = PageCollection.new(page1.links + page2.images)
Creates a new page collection. pages are the pages in the collection. If a block is given it is used as error_handler=.
Creates a page collection from an array of urls. proxy and proxy_port specify the proxy to use if any.
The error handler works as Page#error_handler=.
Creates a new page collection by calling Page#extract on each document in this collection.
Creates a new page collection by calling Page#images on each document in this collection.
Creates a new page collection by calling Page#largest_image on each document in this collection.
Creates a new page collection by calling Page#links on each document in this collection.
Creates a new page collection by calling Page#recurse on each document in this collection.
Creates a new page collection by calling Page#rich_page on each document in this collection.
Saves the pages in this collection to the directory rootdir. options can be used to control how the pages should be saved.
With the option :translate (default), any internal links between the pages in the collection are translated to point to the file where the page is saved instead of the original URL. Relative links to pages not in the collection are converted to absolute links to ensure that they still work. This is usually what you want. All links will work as normal, but the pages in the collection will be cached on your hard drive allowing fast access whether a network connection is available or not. Use the option :notranslate to save the pages just as they are without translating the links.
With :flat (default) all the files are saved directly under the specified directory. This is usually what you want. Note that this does not break the links unless you use :notranslate. If you specify :tree instead, the files are saved in a structure mirroring the layout on the server.
If you specify :rename (default), the files are renamed if there already exists a file with that name in the directory. Note that this doesn't break the links as long as :translate is on. With :overwrite, files with the same name are overwritten. If there are several pages in the collection with the same name they will overwrite each other. If you use :rename_all, all the pages in the collection are renamed before they are saved. This can be useful if the files have strange names.
When files are renamed they are given the names 1.html 2.html, etc. The extension is adapted to the file type. Regardless of what rename setting is used, strange links, such as CGI script links are always renamed.
If a block is given, it is called for each document in the collection as it is saved. path specifies the path where the document was saved, page, the page that was saved. i is the index of the saved page and size the total number of pages.
# Save flat, translated in current directory pages.save('.') {|f,p,i,s| puts "#{f} (#{i} of #{s})"} # Save in a tree structure with no translation pages.save('.', :tree, :overwrite, :notranslate)
This exception is raised if a link containing an unhandled scheme such as mailto or ftp is encountered. This exception is not propagated by methods such as extract and recurse. Instead, it is captured internally and ignored.
The scheme of the url
The url itself
This exception is raised when a document cannot be reached. It is propagated to the exception handler of the class.
The response returned by the server
The url that we are trying to reach