Wayback Machine API

Versión 4.1.0 (147 KB) por Rik
This functions acts as an API for the Wayback Machine (web.archive.org).
54 Descargas
Actualizado 10 abr 2024

WBM documentation

View WBM on File Exchange Open in MATLAB Online

Table of contents

Description

With this function you can download captures to the internet archive that matches a date pattern. If the current time matches the pattern and there is no valid capture, a capture will be generated. The WBM time stamps are in UTC, so a switch allows you to provide the date-time pattern in local time, which will be converted to UTC internally.

This code enables you to use a specific web page in your data processing, without the need to check if the page has changed its structure or is not available at all.

You can also redirect all outputs (errors only partially) to a file or a graphics object, so you can more easily use this function in a GUI or allow it to write to a log file.

Usage instruction about the syntax of the WBM interface are derived from a Wikipedia help page. Fuzzy date matching behavior is based on this archive.org page.

If the Wayback Machine is useful to you, please consider donating. Based on a yearly operating cost of 35m$ and approximately 6k hits/s, please donate at least $1 for every 5000 requests. They don't block API access, require a login, or anything similar. If you abuse it, they may have to change that and you will have spoiled it for everyone. Please make sure your usage doesn't break this Nice Thing™.

Generally, each call to this function will result in two requests. A counter will be stored in a file. The WBMRequestCounterFile optional input can be used to interact with the file. Run WBM([],[],'WBMRequestCounterFile','read') to read the current count.

(These statistics are based on HTTP responses shown on this 60-day average chart and the 990 IRS forms posted by ProPublica.)

Matlab/Octave

Syntax

WBM(filename,url_part)
WBM(___,Name,Value)
WBM(___,options)
outfilename = WBM(___)
[outfilename,FileCaptureInfo] = WBM(___)

Output arguments

Argument Description
outfilename Full path of the output file, the variable is empty if the download failed.
FileCaptureInfo A struct containing the information about the downloaded file. It contains the timestamp of the file (in the 'timestamp' field), the flag used ('flag'), and the base URL ('url'). In short, all elements needed to form the full URL of the capture.

Input arguments

Argument Description
filename The target filename in any format that websave (or urlwrite) accepts. If this file already exists, it will be overwritten in most cases.
url_part This URL will be searched for on the WBM. The URL might be changed (e.g. :80 is often added).
Name,Value The settings below can be entered with a Name,Value syntax.
options Instead of the Name,Value, parameters can also be entered in a struct. Missing fields will be set to the default values.

Name,Value pairs

Name Value
date_part A string with the date of the capture. It must be in the yyyymmddHHMMSS format, but doesn't have to be complete. Note that this is represented in UTC.
If incomplete, the Wayback Machine will return a capture that is as close to the midpoint of the matching range as possible. So for date_part='2' the range is 2000-01-01 00:00 to 2999-12-31 23:59:59, meaning the WBM will attempt to return the capture closest to 2499-12-31 23:59:59.
default='2';
target_date Normally, the Wayback Machine will return the capture closest to the midpoint between the earliest valid date matching the date_part and the latest date matching the date_part. This parameter allows setting a different target, while still allowing a broad range of results. This can be used to skew the preference when loading a page. Like date_part, it must be in the yyyymmddHHMMSS format, and doesn't have to be complete. An example would be to provide 'date_part','2','target_date','20220630'. That way, if a capture is available from 2022, that will be loaded, but any result from 2000 to 2999 is allowed. If left empty, the midpoint determined by date_part will be used as the target. If the target is in the future (which will be determined by parsing the target to bounds and determining the midpoint), it will be cropped to the current local time minus 14 hours to avoid errors in the Wayback Machine API call.
default='';
UseLocalTime A scalar logical. Interpret the date_part in local time instead of UTC. This has the practical effect of the upper and lower bounds of the matching date being shifted by the timezone offset.
default=false;
tries A 1x3 vector. The first value is the total number of times an attempt to load the page is made, the second value is the number of save attempts and the last value is the number of timeouts allowed.
default=[5 4 4];
verbose A scalar denoting the verbosity. Level 0 will hide all errors that are caught. Level 1 will enable only warnings about the internet connection being down. Level 2 includes errors NOT matching the usual pattern as well and level 3 includes all other errors that get rethrown as warning.
Octave uses libcurl, making error catching is bit more difficult. This will result in more HTML errors being rethrown as warnings under Octave than Matlab.
default=3;
if_UTC_failed This is a char array with the intended behavior for when this function is unable to determine the UTC. The options are 'error', 'warn_0', 'warn_1', 'warn_2', 'warn_3', and 'ignore'. For the options starting with warn_, a warning will be triggered if the 'verbose' parameter is set to this level or higher (so 'warn_0' will trigger a warning if 'verbose' is set to 0).
If this parameter is not set to 'error', the valid time range is expanded by -12 and +14 hours to account for all possible time zones, and the midpoint is shifted accordingly.
default='warn_3';
m_date_r A string describing the response to the date missing in the downloaded web page. Usually, either the top bar will be present (which contains links), or the page itself will contain links, so this situation may indicate a problem with the save to the WBM. Allowed values are 'ignore', 'warning' and 'error'. Be aware that non-page content (such as images) will set off this response. Flags other than the default will also set off this response.
default='warning'; if flags is not default then default='ignore';
response The response variable is a cell array, where each row encodes one sequence of HMTL errors and the appropriate next action. The syntax of each row is as follows:
#1 If there is a sequence of failure that fit the first cell,
#2 and the HTML error codes of the sequence are equal to the second cell,
#3 then respond as per the third cell.
The sequence of failures are encoded like this:
t1: failed attempt to load, t2: failed attempt to save, tx: either failed to load, or failed to save.
The error code list must be HTML status codes. The Matlab timeout error is encoded with 4080 (analogous to the HTTP 408 timeout error code). The error is extracted from the identifier, which is not always possible, especially in the case of Octave.
The response in the third cell is either 'load', 'save', 'exit', or 'pause_retry'. Load and save set the preferred type. If a response is not allowed by 'tries' left, the other response (save or load) is tried, until sum(tries(1:2))==0. If the response is set to exit, or there is still no successful download after tries has been exhausted, the output file will be deleted and the script will exit. The pause_retry is intended for use with an error 429. See the err429 parameter for more options.
default={'tx',404,'load';'txtx',[404 404],'save';'tx',403,'save';'t2t2',[403 403],'exit';'tx',429,'pause_retry';'t2t2t2',429,'exit'};
err429 Sometimes the webserver will return a 429 status code. This should trigger a waiting period of a few seconds. If this status code is return 3 times for a save, that probably means the number of saves is exceeded. Disable saves when retrying within 24 hours, as they will keep leading to this error code.
This parameter controls the behavior of this function in case of a 429 status code. It is a struct with the following fields:
The CountsAsTry field (logical) describes if the attempt should decrease the tries counter.
The TimeToWait field (double) contains the time in seconds to wait before retrying.
The PrintAtVerbosityLevel field (double) contains the verbosity level at which a text should be printed, showing the user the function did not freeze.
default=struct('CountsAsTry',false,'TimeToWait',15,'PrintAtVerbosityLevel',3);
ignore The ignore variable is vector with the same type of error codes as in the response variable. Ignored errors will only be ignored for the purposes of the response, they will not prevent the tries vector from decreasing.
default=4080;
flag The flags can be used to specify an explicit version of the archived page. The options are '', '*', 'id' (identical), 'js' (Javascript), 'cs' (CSS), 'im' (image), 'fw'/'if' (iFrame). An empty flag will only expand the date. Providing '*' used to explicitly expand the date and only show the calendar view when using a browser, but it seems to now also load the calendar with websave/urlwrite. With the 'id' flag the page is show as captured (i.e. without the WBM banner, making it ideal for e.g. exe files). With the 'id' and '*' flags the date check will fail, so the missing date response (m_date_r) will be invoked. For the 'im' flag you can circumvent this by first loading in the normal mode (''), and then extracting the image link from that page. That way you can enforce a date pattern and still get the image. The Wikipedia page suggests that a flag syntax requires a full date, but this seems not to be the case, as the date can still auto-expand.
default='';
waittime This value controls the maximum time that is spent waiting on the internet connection for each call of this function. This does not include the time waiting as a result of a 429 error. The input must be convertible to a scalar double. This is the time in seconds.
NB: Setting this to inf will cause an infite loop if the internet connection is lost.
default=60;
timeout This value is the allowed timeout in seconds. It is ignored if it isn't supported. The input must be converitble to a scalar double.
default=10;
WBMRequestCounterFile This must be empty, or a char containing 'read' or 'reset'. If it is provided, all other inputs are ignored, except the exception redirection. That means count=WBM([],[],'WBMRequestCounterFile','read'); is a valid call. For the 'read' input, the output will contain the number of requests posted to the Wayback Machine. This counter is intended to cover all releases of Matlab and GNU Octave. Using the 'reset' switch will reset the counter back to 0.
default='';
print_to_con An attempt is made to also use this parameter for warnings or errors during input parsing.
A logical that controls whether warnings and other output will be printed to the command window. Errors can't be turned off.
default=true;
Specifying print_to_fid, print_to_obj, or print_to_fcn will change the default to false, unless parsing of any of the other exception redirection options results in an error.
print_to_fid An attempt is made to also use this parameter for warnings or errors during input parsing.
The file identifier where console output will be printed. Errors and warnings will be printed including the call stack. You can provide the fid for the command window (fid=1) to print warnings as text. Errors will be printed to the specified file before the error is actually thrown.
If print_to_fid, print_to_obj, and print_to_fcn are all empty, this will have the effect of suppressing every output except errors.
Array inputs are allowed.
default=[];
print_to_obj An attempt is made to also use this parameter for warnings or errors during input parsing.
The handle to an object with a String property, e.g. an edit field in a GUI where console output will be printed. Messages with newline characters (ignoring trailing newlines) will be returned as a cell array. This includes warnings and errors, which will be printed without the call stack. Errors will be written to the object before the error is actually thrown.
If print_to_fid, print_to_obj, and print_to_fcn are all empty, this will have the effect of suppressing every output except errors.
Array inputs are allowed.
default=[];
print_to_fcn An attempt is made to also use this parameter for warnings or errors during input parsing.
A struct with a function handle, anonymous function or inline function in the 'h' field and optionally additional data in the 'data' field. The function should accept three inputs: a char array (either 'warning' or 'error'), a struct with the message, id, and stack, and the optional additional data. The function(s) will be run before the error is actually thrown.
If print_to_fid, print_to_obj, and print_to_fcn are all empty, this will have the effect of suppressing every output except errors.
Array inputs are allowed.
default=[];

Compatibility, version info, and licence

Compatibility considerations:

  • HTML error codes are harder to catch on Octave. Depending on the selected verbosity level that means the number of warnings will be larger.
  • The duration of a timeout can only be set with websave. This means that for larger files or less stable internet connections, a timeout error will be more likely when using older releases or Octave.
Test suite result Windows Linux MacOS
Matlab R2024a W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2023b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2023a W11 : Pass
Matlab R2022b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2022a W11 : Pass
Matlab R2021b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2021a W11 : Pass
Matlab R2020b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2020a W11 : Pass
Matlab R2019b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2019a W11 : Pass
Matlab R2018b W11 : ubuntu_22.04 : Pass Monterey : Pass
Matlab R2018a W11 : Pass
Matlab R2017b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2016b W11 : Pass ubuntu_22.04 : Pass Monterey : Pass
Matlab R2015a W11 : Pass
Matlab R2013b W11 : Pass
Matlab R2007b W11 : Pass
Matlab 6.5 (R13) W11 : Pass
Octave 8.4.0 W11 : Pass
Octave 7.2.0 W11 : Pass
Octave 6.2.0 W11 : Pass raspbian_11 : Pass Monterey : Pass
Octave 5.2.0 W11 : Pass
Octave 4.4.1 W11 : Pass
Version: 4.1.0
Date:    2024-04-10
Author:  H.J. Wisselink
Licence: CC by-nc-sa 4.0 ( https://creativecommons.org/licenses/by-nc-sa/4.0 )
Email = 'h_j_wisselink*alumnus_utwente_nl';
Real_email = regexprep(Email,{'*','_'},{'@','.'})

Test suite

The tester is included so you can test if your own modifications would introduce any bugs. These tests form the basis for the compatibility table above. Note that functions may be different between the tester version and the normal function. Make sure to apply any modifications to both.

Citar como

Rik (2024). Wayback Machine API (https://github.com/thrynae/WBM/releases/tag/v4.1.0), GitHub. Recuperado .

Compatibilidad con la versión de MATLAB
Se creó con R2024a
Compatible con cualquier versión desde R13SP1
Compatibilidad con las plataformas
Windows macOS Linux
Categorías
Más información sobre Octave en Help Center y MATLAB Answers.
Etiquetas Añadir etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Versión Publicado Notas de la versión
4.1.0

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v4.1.0

4.0.2

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v4.0.2

4.0.1.0

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v4.0.1

4.0.0

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v4.0.0

3.1.0

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v3.1.0

3.0.0

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v3.0.0

2.0.0

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/v2.0.0

1.6

See release notes for this release on GitHub: https://github.com/thrynae/WBM/releases/tag/1.6

1.5.0.1

uploaded wrong file

1.5.0.0

more input options, much better robustness (including handling of changes in archive.org), and updated dependencies

1.4.0.0

test function is now included, new input parsing, more robust method for file reading, and minor tweaks

1.3.0.0

small bugfixes

1.2.0.0

minor improvements
added compatibility for R13 (MATLAB 6.5)

1.1.0.0

Connection check failed on Ubuntu; this is now fixed

1.0.0.0

Para consultar o notificar algún problema sobre este complemento de GitHub, visite el repositorio de GitHub.
Para consultar o notificar algún problema sobre este complemento de GitHub, visite el repositorio de GitHub.