WBM

Formatted documentation for the WBM function.

This functions acts as an API for the Wayback Machine (web.archive.org).

Contents

Description

With this function you can download captures to the internet archive that matches a date pattern. If the current time matches the pattern and there is no valid capture, a capture will be generated. The WBM time stamps are in UTC, so a switch allows you to provide the date-time pattern in local time, which will be converted to UTC internally.

This code enables you to use a specific web page in your data processing, without the need to check if the page has changed its structure or is not available at all.
You can also redirect all outputs (errors only partially) to a file or a graphics object, so you can more easily use this function in a GUI or allow it to write to a log file.

Usage instruction about the syntax of the WBM interface are derived from a Wikipedia help page. Fuzzy date matching behavior is based on this archive.org page.

NB: Please be gentle with the WBM. Usage of the WBM doesn't require an API key, logging in or anything like that. If you abuse it, you will spoil it for everyone. Please make sure your usage does not break this Nice Thing™. If you use it often, please consider donating, so the WBM will stick around.

Syntax

WBM(filename,url_part)
WBM(___,Name,Value)
WBM(___,options)
outfilename=WBM(___)

Output arguments

outfilename Full path of the output file, the variable is empty if the download failed.

Input arguments

filename The target filename in any format that websave (or urlwrite) accepts. If this file already exists, it will be overwritten in most cases.
url_part This URL will be searched for on the WBM. The URL might be changed (e.g. ':80' is often added).
Name,Value The settings below can be entered with a Name,Value syntax.
options Instead of the Name,Value, parameters can also be entered in a struct. Missing fields will be set to the default values.

Name,Value pairs

date_part A string with the date of the capture. It must be in the yyyymmddHHMMSS format, but doesn't have to be complete. Note that this is represented in UTC.
If incomplete, the Wayback Machine will return a capture that is as close to the midpoint of the matching range as possible. So for date_part='2' the range is 2000-01-01 00:00 to 2999-12-31 23:59:59, meaning the WBM will attempt to return the capture closest to 2499-12-31 23:59:59.
default='2';
UseLocalTime A scalar logical. Interpret the date_part in local time instead of UTC. This has the practical effect of the upper and lower bounds of the matching date being shifted by the timezone offset.
default=false;
tries A 1x3 vector. The first value is the total number of times an attempt to load the page is made, the second value is the number of save attempts and the last value is the number of timeouts allowed.
default=[5 4 4];
verbose A scalar denoting the verbosity. Level 0 will hide all errors that are caught. Level 1 will enable only warnings about the internet connection being down. Level 2 includes errors NOT matching the usual pattern as well and level 3 includes all other errors that get rethrown as warning.
Octave uses libcurl, making error catching is bit more difficult. This will result in more HTML errors being rethrown as warnings under Octave than Matlab.
default=3;
if_UTC_failed This is a char array with the intended behavior for when this function is unable to determine the UTC. The options are 'error', 'warn_0', 'warn_1', 'warn_2', 'warn_3', and 'ignore'. For the options starting with warn_, a warning will be triggered if the 'verbose' parameter is set to this level or higher (so 'warn_0' will trigger a warning if 'verbose' is set to 0).
If this parameter is not set to 'error', the valid time range is expanded by -12 and +14 hours to account for all possible time zones, and the midpoint is shifted accordingly.
default='warn_3';
print_to_con A logical that controls whether warnings and other output will be printed to the command window. Errors can't be turned off.
default=true; if print_to_fid, print_to_obj, or print_to_fcn is specified then default=false;
print_to_fid The file identifier where console output will be printed. Errors and warnings will be printed including the call stack. You can provide the fid for the command window (fid=1) to print warnings as text. Errors will be printed to the specified file before being actually thrown.
If print_to_fid, print_to_obj, and print_to_fcn are all empty, this will have the effect of suppressing every output except errors.
This parameter does not affect warnings or errors during input parsing.
Array inputs are allowed.
default=[];
print_to_obj The handle to an object with a String property, e.g. an edit field in a GUI where console output will be printed. Messages with newline characters (ignoring trailing newlines) will be returned as a cell array. This includes warnings and errors, which will be printed without the call stack. Errors will be written to the object before the error is actually thrown.
If print_to_fid, print_to_obj, and print_to_fcn are all empty, this will have the effect of suppressing every output except errors.
This parameter does not affect warnings or errors during input parsing.
Array inputs are allowed.
default=[];
print_to_fcn A struct with a function handle, anonymous function or inline function in the 'h' field and optionally additional data in the 'data' field. The function should accept three inputs: a char array (either 'warning' or 'error'), a struct with the message, id, and stack, and the optional additional data. The function(s) will be run before the error is actually thrown.
If print_to_fid, print_to_obj, and print_to_fcn are all empty, this will have the effect of suppressing every output except errors.
This parameter does not affect warnings or errors during input parsing.
Array inputs are allowed.
default=[];
m_date_r A string describing the response to the date missing in the downloaded web page. Usually, either the top bar will be present (which contains links), or the page itself will contain links, so this situation may indicate a problem with the save to the WBM. Allowed values are 'ignore', 'warning' and 'error'. Be aware that non-page content (such as images) will set off this response. Flags other than '*' will also set off this response.
default='warning'; if flags~='*' then default='ignore';
response The response variable is a cell array, where each row encodes one sequence of HMTL errors and the appropriate next action. The syntax of each row is as follows:
#1 If there is a sequence of failure that fit the first cell,
#2 and the HTML error codes of the sequence are equal to the second cell,
#3 then respond as per the third cell.
The sequence of failures are encoded like this:
t1: failed attempt to load, t2: failed attempt to save, tx: either failed to load, or failed to save.
The error code list must be HTML status codes. The Matlab timeout error is encoded with 4080 (analogous to the HTTP 408 timeout error code). The error is extracted from the identifier, which is not always possible, especially in the case of Octave.
The response in the third cell is either 'load', 'save', 'exit', or 'pause_retry'. Load and save set the preferred type. If a response is not allowed by 'tries' left, the other response (save or load) is tried, until sum(tries(1:2))==0. If the response is set to exit, or there is still no successful download after tries has been exhausted, the output file will be deleted and the script will exit. The pause_retry is intended for use with an error 429. See the err429 parameter for more options.
default={'tx',404,'load';'txtx',[404 404],'save';'tx',403,'save';'t2t2',[403 403],'exit';'tx',429,'pause_retry'};
err429 Sometimes the webserver will return an 429 status code. This should trigger a waiting period of a few seconds. This parameter controls the behavior of this function in case of a 429 status code. It is a struct with the following fields.
The CountsAsTry field (logical) describes if the attempt should decrease the tries counter.
The TimeToWait field (double) contains the time in seconds to wait before retrying.
The PrintAtVerbosityLevel field (double) contains the verbosity level at which a text should be printed, showing the user the function did not freeze.
default=struct('CountsAsTry',false,'TimeToWait',15,'PrintAtVerbosityLevel',3);
ignore The ignore variable is vector with the same type of error codes as in the response variable. Ignored errors will only be ignored for the purposes of the response, they will not prevent the tries vector from decreasing.
default=4080;
flag The flags can be used to specify an explicit version of the archived page. The options are 'id' (identical), 'js' (Javascript), 'cs' (CSS), 'im' (image), 'fw'/'if' (iFrame), or * (explicitly expand date, shows calendar in browser mode). With the 'id' flag the page is show as captured (i.e. without the WBM banner, making it ideal for e.g. exe files). With the 'id' and '*' flags the date check will fail, so the missing date response (m_date_r) will be invoked. For the 'im' flag you can circumvent this by first loading in the normal mode ('*'), and then extracting the image link from that page. That way you can enforce a date pattern and still get the image. The Wikipedia page suggests that a flag syntax requires a full date, but this seems not to be the case, as the date can still auto-expand.
default='*';
waittime This value controls the maximum time that is spent waiting on the internet connection for each call of this function. This does not include the time waiting as a result of a 429 error. The input must be convertible to a scalar double. This is the time in seconds.
NB: Setting this to inf will cause an infite loop if the internet connection is lost.
default=60;
timeout This value is the allowed timeout in seconds. It is ignored if it isn't supported. The input must be converitble to a scalar double.
default=10;

Compatibility, version info, and licence

Compatibility considerations:

Test suite result Windows XP/7/10 Ubuntu 20.04 LTS MacOS 10.15 Catalina
Matlab R2021a W10 : Pass
Matlab R2020b W10 : Pass
Matlab R2020a W10 : Pass
Matlab R2018a W10 : Pass Pass
Matlab R2015a W10 : Pass Pass
Matlab R2013b W10 : Pass
Matlab R2012b W10 : Pass
Matlab R2011a W10 : Pass Pass
Matlab R2010b Pass
Matlab R2010a W7 : Pass
Matlab R2007b W10 : Pass
Matlab 7.1 (R14SP3) XP : Pass
Matlab 6.5 (R13) W10 : Pass
Octave 6.2.0 W10 : Pass
Octave 5.2.0 W10 : Pass Pass
Octave 4.4.1 W10 : Pass Pass

Version: 3.0
Date:    2021-04-22
Author:  H.J. Wisselink
Licence: CC by-nc-sa 4.0 ( https://creativecommons.org/licenses/by-nc-sa/4.0 )
Email = 'h_j_wisselink*alumnus_utwente_nl';
Real_email = regexprep(Email,{'*','_'},{'@','.'})

Test suite

This tester is included so you can test if your own modifications would introduce any bugs. These tests form the basis for the compatibility table above.

Note that some of the functions in this tester might be different from the functions included in the actual function. Usually this is done to allow triggering of certain errors.

Even without comments or blank lines and compressing the functions down as much as possible, the tester function is too large for this page. The full tester function (including all comments) can be found here.