D*/gei_digital Time Series: Help

Introduction

The D* time series web-service provides a RESTful API and a simple browser-based user interface for acquisition and display of time-series data from an associated DDC corpus search engine, with optional smoothing and outlier detection. See "User Interface" for details on the browser-based user interface, and see "REST API" for details on the underlying RESTful API.

User Interface

Upon accessing the top-level service URL for a given corpus ( http://diacollo.gei.de/gei-digital-2020/hist.perl ) in a web browser, the user is presented with a graphical interface in which queries can be constructed and submitted to the underlying DDC server. This section describes the various elements of that interface. screenshot
Query Form
The top of the user interface contains a simple HTML query form with input widgets corresponding to the various parameters of the underlying REST API. Hovering the mouse over an input widget should cause a tooltip to be displayed briefly describing the corresponding parameter's function. Changes made to the parameters in the query form will only cause the plot area to be updated after submitting the form by clicking on the "submit" button or pressing the "Enter" key when a text input widget is focused.
Button Bar
The bottom of the header area contains a number of text-mode buttons for navigation and export of the data-set underlying the current plot display (if any).
Plot Area
The main body of the user interface is reserved for the plot image data returned by the underlying REST API for the parameter set last submitted from the query form.
The bottom of the user interface contains a footer with some administrative information.

REST API

The underlying REST API is responsible for data acquistition and optional plot image generation, and can be accessed directly by querying http://diacollo.gei.de/gei-digital-2020/dhist-plot.perl with appropriate parameters.

Parameters

The web-service accepts a set of parameter=value pairs for each request and returns a corresponding time series as a either raw data-set or a two dimensional plot image. Parameters are passed to the service via the URL query string or HTTP POST request as for a standard web form (e.g. using the browser-based user interface.)

Basic Parameters

The following parameters influence the data acquisition and smoothing processes used to generate the final time series data set.
query
Aliases: query, qu, q, lemma, lem, l
REQUIRED

Target context query in the DDC query langauge. The query expression Q specified to this service should be a "Context Query" (query_conditions); the actual request sent to the underlying DDC server will be something like: COUNT(Q #SEP) #BY[date/1, textClass]

slice
Aliases: sliceby, slice, sl, s
Default: 5

Size S in years of date intervals ("epochs", "slices") into which time series data are to be partitioned, with optional offset suffix +O rsp. -O, where 0 ≤ |O| < S. If no offset term ±O is explicitly specified but the xrange parameter contains a non-trivial minumum date (xrange=xmin:xmax with xmin*), O will be assigned a default value (O = xmin mod S) such that the initial epoch begins exactly at xmin, otherwise O defaults to 0 (zero). See "Epoch Partitioning" for details.

norm
Aliases: normalize, norm, n
Default: date+class

Categorization mode for selecting frequency-per-million scaling coefficient . Accepted values: date+class, date, class, corpus, none. See "Scaling & Normalization" for details.

window
Aliases: window, win, w
Default: 1

Number W of adjacent epochs (before and/or after) to include in the moving-average smoothing window for each data point. See "Moving-Average Smoothing" for details.
wbase
Alaises: wbase, wb, W
Default: 1

Inverse-distance smoothing base (real number B) for weighted moving-average smoothing. If the parameter is passed as wbase=0, the default value (B=1) is used, resulting in a uniform weighting scheme for all epochs contributing to yx,z. May also be passed as wbase=e to use the natural exponent B = e = exp(1) = 2.71828…. See "Moving-Average Smoothing" for details.
logavg
Aliases: logavg, loga, la, lognorm, logn, log, ln
Default: 0

If true (nonzero), moving averages will be computed on a logarithmic scale. See "Moving-Average Smoothing" for details.
totals
Aliases: totals, tot, T
Default: 0

If true (nonzero), total corpus sizes will be plotted rather than number of hits for the specified query. Should be equivalent to setting query=*.

single
Aliases: single, sing, sg
Default: 0

If true (nonzero), consolidates all corpus-encoded text-classes into a single universal class ("Gesamt"), and displays only the single time series for this pseudo-class.

grand
Aliases: grand, gr, g
Default: 0

If true (nonzero), displays a grand-average curve with pseudo-class "Gesamt" (analogous to that generated by specifying single=1) in addition to the curves for individual text classes.

gaps
Aliases: gaps, gap
Default: 0

If true (nonzero), missing data-points count(x,z) in the data returned by the DDC query (indicating no hits in any year assigned to the epoch y) will not be generated or passed to gnuplot. This behavior can lead to unexpected smoothing-phenomena, since gnuplot interpolates over missing x-axis values. The default behavior (gaps=0) generates explicit data points count(x,z)=0 (zero) for missing values.

prune
Aliases: prune, pr
Default: 0

Inverse confidence level for outlier detection (0: no pruning, .05: 95% confidence level). See "Outlier Detection" for details.

Format Parameters

The following parameters influence only the formatting of the result set, but have no effect on the dataset itself. The pformat parameter selects the result format, the pretty parameter enables or disables pretty-pretting for JSON-mode output, and the remaining parameters act as wrappers for gnuplot options used to generate plot images.
pformat
Aliases: pformat, pfmt, pf, format, fmt, f
Default: svg

Specifies the desired output format. Accepted values: eps, epsmono, gp, json, pdf, pdfmono, png, ps, psmono, svg, text. See "Output Formats" for details.
xlabel
Aliases: xlabel, xlab, xl
Default: date

x-axis label; specify xlabel=none to supppress

ylabel
Aliases: ylabel, ylab, yl
Default: (depends on the requested normlization mode)

gnuplot y-axis label; specify ylabel=none to suppress

xrange
Aliases: xrange, xr
Default: *:*

gnuplot x-axis range xmin:xmax to be plotted, where xmin and xmax are either the minimum and maximum values to plotted (respectively), or an asterisk character (*) indicating that the limit should be auto-determined by gnuplot.

yrange
Aliases: yrange, yr
Default: 0:*

gnuplot y-axis range ymin:ymax to be plotted, syntax analogous to the xrange parameter.

logscale
Aliases: logscale lscale, ls, logy, ly
Default: 0

Enable or disable log-scale base for gnuplot y-axis. A value of 0 (zero, default) disables log scaling, a value of 1 enables base-2 log scaling, and any other value logscale=b enables base-b log scaling.

title
Default: (auto-generated)

gnuplot title; specify title=none to suppress.

size
Aliases: psize, psiz, psz, size, siz, sz
Default: 840,480

gnuplot output image size WIDTH,HEIGHT (in pixels).

key
Aliases: key, legend, leg
Default: inside right top

gnuplot legend location; specify key=off or key=none to suppress.

smooth
Aliases: smooth, sm
Default: none

gnuplot smoothing method, used between supplied data points. Accepted values are none, csplines, and bezeier. The default value smooth=none results in simple linear interpolation between supplied data points.

style
Default: lines

gnuplot curve plotting style. Accepted values are lines, points, and linespoints.

grid
Default: 0

If true (nonzero), enable gnuplot grid for axis tic-lines.

bare
Default: 0

If true (nonzero), produce a compact "bare" plot for use by www.dwds.de.

pretty
Default: 0

If true (nonzero), JSON format output will be pretty-printed. Primarily useful for debugging.

Output Formats

The REST API supports a number of different output formats for time series data-sets and plot images. The desired output format can be specified with the pformat parameter. See "Dataset Formats" for a list of supported raw data formats, and see "Plot Formats" for a list of supported image formats.

Dataset Formats

The following pformat options return the (smoothed) time-series dataset in a form suitable for further automated processing. They will not be displayed correctly in the browser-based user interface.
gp
Aliases: gnuplot, gp

Returns a standalone gnuplot script including data block(s) for the smoothed data. Used internally to generate the the plot formats.

json
Returns the data-set encoded as a flat JSON array, where each element represents a single data point as an object of the form:
{
  "date" : "1900",              /* epoch label (x) */
  "class" : "Belletristik",     /* text class (z) */
  "val" : 123.45,               /* smoothed frequency (y) */
  "raw" : 109                   /* raw sample count */
}
text
Aliases: text, txt, tab, tsv, csv, dat

Returns the dataset as TAB-separated, UTF-8 encoded text. The returned data set contains one line for each data point, and each line is divided into three TAB-separated columns. The first column is the (smoothed) frequency value (y), the second column is the epoch label (x), and the final column is the associated text-class (z).

Plot Formats

The following pformat options return two-dimensional time-series plot images for the given request parameters. Plot images are generated by the gnuplot program using the script returned by the gp format option. Most of these formats will not be displayed correctly in the browser-based user interface without an appropriate browser plug-in.
eps
encapsulated postscript plot
epsmono
encapsulated post-script plot, monochrome
pdf
PDF plot (gnuplot, download)
pdfmono
PDF plot, monochrome
png
portable network graphics plot (browser-safe)
ps
postscript plot
psmono
postscript plot, monochrome
svg
scalable vector graphics plot (browser-safe, default)

Gory Details

This section describes the details of the data acquistition and smoothing process.

Notation

The following table shows variables & notation used in this section.
SymbolDescription
Dset of all dates (years) in the corpus
Zset of all text classes in the corpus
set of all natural numbers (non-negative integers)
QDDC query expression as given by the query parameter
Qset of all corpus hits (tokens) for the query Q
hsingle corpus hit (token)
date(h)date (year) of hit h as returned by DDC
class(h)text-class of hit h as returned by DDC
|ɑ|absolute value (ɑ numeric) or set size (ɑ a set)
fQ raw corpus-frequency function
Sepoch size in years as given by the slice parameter
Oepoch offset in years, as given by the slice or xrange parameter
epochepoch partitioning function (date → epoch)
epoch-1inverse epoch partitioning function (epoch → date(s))
fQ,S,O epoch-wise corpus frequency function
Cx,zfrequency-per-million scaling factor, selected by the norm parameter
f~Q,S,O frequency-per-million distribution
Wsize of moving-average smoothing window in epochs as given by the window parameter
f~Q,S,O,W smoothed epoch-wise frequency function with uniform weighting
f~Q,S,O,W,B smoothed epoch-wise frequency function with exponential discounting using wbase
xepoch label, x-axis plot coordinate, independent variable
ysmoothed pseudo-frequency frequency, y-axis plot coordinate, dependent variable
ztext class, sub-plot identifier, independent variable

Raw Frequency Data

The raw sample data is extracted from a DDC count()-query of the form COUNT(Q #SEP) #BY[date/1, textClass]. The raw sample data is essentially a partial frequency funtion fQ : D × Z → ℕ : (d,z) ↦ |{h ∈ ⟦Q⟧ : date(h)=d & class(h)=z}| mapping independent variable pairs of date d and text-class z to the raw number of attested hits (tokens) for query Q in the associated corpus.

Epoch Partitioning (Date-Slicing)

The slice parameter size S and offset O together determine how raw sample dates (years) are aggregated into independent date intervals ("epochs") for purposes of smoothing and plotting. A single aggregated epoch-frequency fQ,S,O (x,z) will be computed for each pair of epoch (x) and text-class (z) in the selected range, whereby the index of each epoch-label modulo S will always be equal to that of the offset ±O (usually zero). Raw sample dates d are mapped to epoch-labels x=epoch(d), and raw yearly frequencies are summed over within each epoch x to define the epoch-frequency function fQ,S,O (x,z):
epoch(d)= O + S × (d - O) / S
fQ,S,O (x,z)= depoch-1(x) fQ (d,z)
For example, specifying slice=10 or slice=10+0 requests zero-offset decade epochs, and would result in epoch-labels 1900, 1910, 1920, etc. representing the date intervals 1900-1909, 1910-1919, 1920-1929, etc. Requesting slice=100+5 would result in century epochs offset by 5 years with labels such as 1705 (~1705-1804), 1805 (~1805-1904), 1905 (~1905-2004), etc.

Scaling & Normalization

To facilitate comparability of plotted values across (sub)corpora of varying size, raw epoch frequency counts may be scaled to frequency-per-million-tokens values by a simple linear projection as requested by the norm parameter. Formally, the norm parameter is used to select a scaling factor Cx,z by which raw epoch frequencies fQ,S,O (x,z) are mapped to normalized pseudo-frequencies f~Q,S,O (x,z): f~Q,S,O (x,z) =
fQ,S,O (x,z)
Cx,z

The accepted values for the norm parameter populate Cx,z as follows, for f* the raw global corpus size function:

  • date+class (default): normalize by date interval and text-class; Cx,z =
    f*,S,O (x,z)
    106
    COUNT(* #SEP #ASC_DATE[x,x+S] #HAS[textClass,z])
    106
  • date: normalize by date interval only (over all text-classes); Cx,z =
    z' ∈ Z f*,S,O (x,z')
    106
    COUNT(* #SEP #ASC_DATE[x,x+S])
    106
  • class: normalize by text-class only (over all dates) Cx,z =
    dD f* (d,z)
    106
    COUNT(* #SEP #HAS[textClass,z])
    106
  • corpus: normalize corpus-globally (over all date intervals and text-classes) Cx,z =
    dD, z' ∈ Z f* (d,z)
    106
    COUNT(* #SEP)
    106
  • none: do not normalize at all, but operate on raw absolute frequency counts; Cx,z = 1

Outlier Detection

If the prune parameter is specified and nonzero, an error-distribution for the normalized (but unsmoothed) sample points f~Q,S,O (x,z) with respect to a double-exponential filtered "expectation function" will be computed using two calls to PDL::Stats::TS::filter_exp() (right-to-left and left-to-right) and averaging these. The observed "errors" are converted to p-values assuming a normal (Gaussian) distribution, and all sample points with p-values outside of the specified confidence interval are treated as outliers. Sample points f~Q,S,O (x,z) thus identified as outliers are removed and replaced by linear interpolation over their nearest non-outlier neighbors. See Jurish (2016) for an example (in German).

Moving-Average Smoothing

To minimize visual interference ("zig-zags") arising from sparse sample data, the pseudo-frequency distribution data f~Q,S,O (x,z) are passed through a moving-average smoothing filter over the immediately adjacent (preceding and following) W epochs of the corresponding text class as specified by the window parameter, with optional exponential discounting as requested by the wbase parameter, resulting in a smoothed pseudo-frequency distribution f~Q,S,O,W (·,·).

Requesting window=0 disables moving-average smoothing: yx,z = yx,z(0) = f~Q,S,O,0 (x,z) = f~Q,S,O (x,z)

If window=1, only immediately adjacent epochs of the current text class z contribute to f~Q,S,O,W (x,z):

yx,z = yx,z(1) = f~Q,S,O,1 (x,z) = avg {yx-S,z(0) , yx,z(0) , yx+S,z(0)}
=
f~Q,S,O (x-S,z) + f~Q,S,O (x,z) + f~Q,S,O (x+S,z)
3

In general for window=W and slice=S, wbase=B with B ∈ {0,1}:

yx,z = yx,z(W) = f~Q,S,O,W (x,z) = avg
W
i=-W
{yx+iS,z(0)}
=
1
1+2W 
W
i=-W
f~Q,S,O (x+iS,z)

Exponential Discounting

If a nontrivial wbase parameter wbase=B is specified (B≠1), neighboring epochs' contributions to the data-point f~Q,S,O,W (x,z) are weighted by inverse distance to the target epoch along the x axis:
yx,z = yx,z(W) = f~Q,S,O,W,B (x,z) = 𝔼
W
i=-W
[B -|i |] yx+iS,z(0)
=
W
i=-W
B -|i | f~Q,S,O (x+iS,z)
1 + 2
W
i=1
B -i

Log Averaging

If the logavg option is specified and nonzero, moving averages will be computed for the natural logarithms of the scaled pseudo-frequencies and subsequently re-projected onto the original value domain, using a constant ɛ to avoid singularities (default ɛ=0.5):
yx,z(0)= log(f~Q,S,O (x,z) + ɛ)
yx,z= exp(yx,z(W)) - ɛ

Plot Data

The final data to be plotted are triples (x,yx,z,z) of epoch label x, smoothed pseudo-frequency y, and text-class z. In gnuplot-generated image formats, adjacent data points of each text-class z will be connected by line segments.

See Also

D* OpenSearch API version 0.58 0.15695 sec Imprint · Privacy
Projekt GEI-Digital-2020
Collection: GEI-Digital
Corpus sources provided by the Georg-Eckert-Institut - Leibniz-Institut für internationale Schulbuchforschung.
Corpus processing and infrastructure development by the Zentrum für digitale Lexikographie der deutschen Sprache at the Berlin-Brandenburg Academy of Sciences and Humanities.