Introduction
The D* time series web-service provides a RESTful API and a simple
browser-based user interface for acquisition and display of
time-series data from an associated DDC corpus search engine,
with optional smoothing and outlier detection.
See
"User Interface" for details on the
browser-based user interface, and see
"REST API"
for details on the underlying RESTful API.
User Interface
Upon accessing the top-level service URL for a given corpus (
http://diacollo.gei.de/gei-digital-2020/hist.perl ) in a web browser,
the user is presented with a graphical interface in which queries can be constructed and
submitted to the underlying DDC server.
This section describes the various elements of that interface.
- Query Form
-
The top of the user interface contains a simple HTML query form with input widgets corresponding
to the various parameters of the underlying REST API.
Hovering the mouse over an input widget should cause a tooltip to be displayed
briefly describing the corresponding parameter's function. Changes made to the parameters in
the query form will only cause the plot area to be updated after submitting the form
by clicking on the "submit" button or pressing the "Enter" key when a text input widget
is focused.
- Button Bar
-
The bottom of the header area contains a number of text-mode buttons for navigation and
export
of the data-set underlying the current plot display (if any).
- Plot Area
-
The main body of the user interface is reserved for the plot image data returned
by the underlying REST API for the parameter set last submitted
from the query form.
-
The bottom of the user interface contains a footer with some administrative information.
Parameters
The web-service accepts a set of
parameter=value pairs for each request and returns a corresponding time series
as a either raw data-set or a two dimensional plot image. Parameters are passed to the service via the
URL query string
or
HTTP POST
request as for a standard
web form
(e.g. using the
browser-based user interface.)
Basic Parameters
The following parameters influence the data acquisition and smoothing processes
used to generate the final time series data set.
- query
-
Aliases: query, qu, q, lemma, lem, l
REQUIRED
Target context query in the DDC query langauge.
The query expression Q specified to this service should be a "Context Query" (query_conditions);
the actual request sent to the underlying DDC server will be something like:
COUNT(Q #SEP) #BY[date/1, textClass]
- slice
-
Aliases: sliceby, slice, sl, s
Default: 5
Size S in years of date intervals ("epochs", "slices") into which time series data are to be partitioned,
with optional offset suffix +O rsp. -O,
where 0 ≤ |O| < S.
If no offset term ±O is explicitly specified
but the xrange parameter contains a non-trivial minumum date
(xrange=xmin:xmax
with xmin≠*
),
O will be assigned a default value (O = xmin mod S) such that
the initial epoch begins exactly at xmin, otherwise O defaults to 0 (zero).
See "Epoch Partitioning" for details.
- norm
-
Aliases: normalize, norm, n
Default: date+class
Categorization mode for selecting frequency-per-million scaling coefficient . Accepted values:
date+class, date, class, corpus, none.
See "Scaling & Normalization" for details.
- window
-
Aliases: window, win, w
Default: 1
Number W of adjacent epochs (before and/or after) to include in the
moving-average smoothing window
for each data point.
See "Moving-Average Smoothing" for details.
- wbase
-
Alaises: wbase, wb, W
Default: 1
Inverse-distance smoothing base (real number B) for weighted moving-average smoothing.
If the parameter is passed as wbase=0
, the default value (B=1) is used,
resulting in a uniform weighting scheme for all epochs contributing to yx,z.
May also be passed as wbase=e
to use the natural exponent
B = e = exp(1) = 2.71828….
See "Moving-Average Smoothing" for details.
- logavg
-
Aliases: logavg, loga, la, lognorm, logn, log, ln
Default: 0
If true (nonzero), moving averages will be computed on a logarithmic scale.
See "Moving-Average Smoothing" for details.
- totals
-
Aliases: totals, tot, T
Default: 0
If true (nonzero), total corpus sizes will be plotted rather than number of hits for the
specified query.
Should be equivalent to setting query=*
.
- single
-
Aliases: single, sing, sg
Default: 0
If true (nonzero), consolidates all corpus-encoded text-classes into a single
universal class ("Gesamt"), and displays only the single time series for this
pseudo-class.
- grand
-
Aliases: grand, gr, g
Default: 0
If true (nonzero), displays a grand-average curve with pseudo-class "Gesamt"
(analogous to that generated by specifying single=1
)
in addition to the curves for individual text classes.
- gaps
-
Aliases: gaps, gap
Default: 0
If true (nonzero), missing data-points count(x,z) in the data returned by the DDC query
(indicating no hits in any year assigned to the epoch y) will not be generated or passed
to gnuplot. This behavior can lead to unexpected smoothing-phenomena, since gnuplot interpolates
over missing x-axis values. The default behavior (gaps=0
) generates explicit
data points count(x,z)=0 (zero) for missing values.
- prune
-
Aliases: prune, pr
Default: 0
Inverse confidence level for outlier detection (0: no pruning, .05: 95% confidence level).
See "Outlier Detection" for details.
Gory Details
This section describes the details of the data acquistition and smoothing process.
Notation
The following table shows variables & notation used in this section.
Symbol | Description |
D | set of all dates (years) in the corpus |
Z | set of all text classes in the corpus |
ℕ | set of all natural numbers (non-negative integers) |
Q | DDC query expression as given by the query parameter |
⟦Q⟧ | set of all corpus hits (tokens) for the query Q |
h | single corpus hit (token) |
date(h) | date (year) of hit h as returned by DDC |
class(h) | text-class of hit h as returned by DDC |
|ɑ| | absolute value (ɑ numeric) or set size (ɑ a set) |
fQ | raw corpus-frequency function |
S | epoch size in years as given by the slice parameter |
O | epoch offset in years, as given by the slice or xrange parameter |
epoch | epoch partitioning function (date → epoch) |
epoch-1 | inverse epoch partitioning function (epoch → date(s)) |
fQ,S,O | epoch-wise corpus frequency function |
Cx,z | frequency-per-million scaling factor, selected by the norm parameter |
f~Q,S,O | frequency-per-million distribution |
W | size of moving-average smoothing window in epochs as given by the window parameter |
f~Q,S,O,W | smoothed epoch-wise frequency function with uniform weighting |
f~Q,S,O,W,B | smoothed epoch-wise frequency function with exponential discounting using wbase |
x | epoch label, x-axis plot coordinate, independent variable |
y | smoothed pseudo-frequency frequency, y-axis plot coordinate, dependent variable |
z | text class, sub-plot identifier, independent variable |
Raw Frequency Data
The raw sample data is extracted from a
DDC count()-query of the form
COUNT(Q #SEP) #BY[date/1, textClass]
. The raw sample data
is essentially a partial frequency funtion
fQ : D × Z → ℕ
: (d,z) ↦ |{h ∈ ⟦Q⟧ : date(h)=d & class(h)=z}|
mapping independent variable pairs of date
d and text-class
z to the raw number of attested
hits (tokens) for
query Q in the associated corpus.
Epoch Partitioning (Date-Slicing)
The
slice parameter size
S and offset
O together determine how raw sample dates (years)
are aggregated into independent date intervals ("epochs") for purposes of smoothing and plotting. A single
aggregated epoch-frequency
fQ,S,O (
x,
z) will be computed for each pair of epoch (
x) and text-class (
z) in the selected range,
whereby the index of each epoch-label modulo
S will always be equal to that of the offset
±O (usually zero).
Raw sample dates
d are mapped to epoch-labels
x=
epoch(
d), and raw yearly frequencies are summed over within each
epoch
x to define the epoch-frequency function
fQ,S,O (
x,
z):
epoch(d) | = O + S × ⌊(d - O) / S⌋ |
fQ,S,O (x,z) | = ∑d ∈ epoch-1(x) fQ (d,z) |
For example, specifying
slice=10
or
slice=10+0
requests zero-offset decade epochs,
and would result in epoch-labels 1900, 1910, 1920, etc. representing the date intervals
1900-1909, 1910-1919, 1920-1929, etc. Requesting
slice=100+5
would result in
century epochs offset by 5 years with labels such as 1705 (~1705-1804), 1805 (~1805-1904), 1905 (~1905-2004), etc.
Scaling & Normalization
To facilitate comparability of plotted values across (sub)corpora of varying size,
raw epoch frequency counts may be scaled to frequency-per-million-tokens values by
a simple linear projection as requested by the
norm parameter.
Formally, the
norm parameter is used to select a scaling factor
Cx,z
by which raw epoch frequencies
fQ,S,O (
x,
z) are mapped to normalized pseudo-frequencies
f~
Q,S,O (
x,
z):
f~Q,S,O (x,z) =
The accepted values for the
norm parameter populate
Cx,z as follows,
for
f*
the
raw global corpus size function:
-
date+class
(default): normalize by date interval and text-class;
Cx,z
= ≈
COUNT(* #SEP #ASC_DATE[x,x+S] #HAS[textClass,z]) |
106 |
-
date: normalize by date interval only (over all text-classes);
Cx,z
= ≈
COUNT(* #SEP #ASC_DATE[x,x+S]) |
106 |
-
class: normalize by text-class only (over all dates)
Cx,z
= ≈
COUNT(* #SEP #HAS[textClass,z]) |
106 |
-
corpus: normalize corpus-globally (over all date intervals and text-classes)
Cx,z
=
∑d ∈ D, z' ∈ Z f* (d,z) |
106 |
≈
-
none: do not normalize at all, but operate on raw absolute frequency counts;
Cx,z = 1
Outlier Detection
If the
prune parameter is specified and nonzero,
an error-distribution for the normalized (but unsmoothed) sample points
f~
Q,S,O (
x,
z)
with respect to a
double-
exponential filtered "expectation function"
will be computed using two calls to
PDL::Stats::TS::filter_exp()
(right-to-left and left-to-right) and averaging these. The observed "errors" are converted to
p-values assuming a
normal (Gaussian) distribution,
and all sample points with
p-values outside of the specified
confidence interval are treated as outliers.
Sample points
f~
Q,S,O (
x,
z) thus identified as outliers
are removed and replaced by linear interpolation over their nearest non-outlier neighbors.
See
Jurish (2016) for an example (in German).
Moving-Average Smoothing
To minimize visual interference ("zig-zags") arising from sparse sample data, the pseudo-frequency distribution
data
f~
Q,S,O (
x,
z) are passed through a
moving-average smoothing filter
over the immediately adjacent (preceding and following)
W epochs of the corresponding text class as specified by the
window parameter, with optional exponential discounting as requested by the
wbase parameter,
resulting in a smoothed pseudo-frequency distribution
f~
Q,S,O,W (·,·).
Requesting
window=0
disables moving-average smoothing:
yx,z = yx,z(0) = f~Q,S,O,0
(x,z) = f~Q,S,O (x,z)
If
window=1
, only immediately adjacent epochs of the current text class
z
contribute to
f~
Q,S,O,W (
x,
z):
yx,z = yx,z(1) = f~Q,S,O,1 (x,z) |
= avg {yx-S,z(0) , yx,z(0) , yx+S,z(0)}
|
| = f~Q,S,O (x-S,z) + f~Q,S,O (x,z) + f~Q,S,O (x+S,z) | 3 |
|
In general for
window=W
and
slice=S
,
wbase=B
with
B ∈ {0,1}:
yx,z = yx,z(W) = f~Q,S,O,W (x,z) |
= avg {yx+iS,z(0)} |
| = ∑ f~Q,S,O (x+iS,z) |
Exponential Discounting
If a nontrivial
wbase parameter
wbase=B
is specified (
B≠1),
neighboring epochs' contributions to the data-point
f~
Q,S,O,W (
x,
z)
are weighted by inverse distance to the target epoch along the
x axis:
yx,z = yx,z(W) = f~Q,S,O,W,B (x,z) |
= 𝔼 [B -|i |] yx+iS,z(0) |
| = ∑ B -|i | f~Q,S,O (x+iS,z) | 1 + 2∑ B -i |
|
Log Averaging
If the
logavg option is specified and nonzero, moving averages will be computed
for the natural logarithms of the scaled pseudo-frequencies and subsequently re-projected
onto the original value domain, using a constant
ɛ to avoid singularities (default
ɛ=0.5):
yx,z(0) | = log(f~Q,S,O (x,z) + ɛ) |
yx,z | = exp(yx,z(W)) - ɛ |
Plot Data
The final data to be plotted are triples
(
x,
yx,z,
z) of epoch label
x, smoothed pseudo-frequency
y, and text-class
z.
In
gnuplot-generated image formats, adjacent data points
of each text-class
z will be connected by line segments.