D*/gei_digital: DiaCollo Help

Introduction

DiaCollo (pronounced /diːˈakəloʊ/, "dee-ah-kə-loh", analogous to the well-known juggling prop) is a tool for efficient extraction of diachronic collocations from an underlying text corpus. Unlike other collocation extractors such as DWDS Wortprofil, Sketch Engine, or the UCS toolkit, DiaCollo is suitable for extraction and analysis of diachronic collocation data, i.e. collocations whose significance depends on the date of their occurrence. By tracking changes in a word's typical collocates over time and applying J. R. Firth's famous principle that "you shall know a word by the company it keeps", DiaCollo can help to provide a clearer picture of diachronic changes in the word's usage, in particular those related to semantic shift.

Requests & Paramters

DiaCollo is designed as a request-oriented service: it accepts a user request as a set of parameter=value pairs and returns a corresponding profile for the term(s) queried. Paramters are passed to the DiaCollo web-service REST-itutionally via the URL query string or HTTP POST request as for a standard web form. The URL for the low-level request including all user parameters is displayed in the DiaCollo web front-end as a hyperlink labelled "Raw URL" at the top of the data display area.

Profiles & Diffs

The results of a simple DiaCollo user request are returned as a tabular profile of the k-best collocates for the queried word(s) in the requested date-range, aggregated into sub-intervals ("epochs", e.g. decades) as specified by the slice request parameter. Alternatively, the user may request a comparison or "diff" profile in order to highlight the most prominent differences between two quasi-independent queries, e.g. between two different words or between occurrences of the same word in different date intervals.

Indices & Attributes

For maximum efficiency, DiaCollo uses an internal "native" index structure over the input corpus content words to compute collocations. Each indexed word is treated as a tuple of linguistically salient attributes in addition to the document date. By default, the attributes "Lemma" and "Pos" (part-of-speech) are indexed. User query and groupby request parameters are interpreted as logical conjunctions of restrictions over these attributes, selecting the precise token tuple(s) to be profiled. For finer-grained selection of profiling targets, DiaCollo supports boolean query expressions and document meta-data filters via the tdf and diff-tdf profile types, and the full range of the DDC query language via the ddc and diff-ddc profile types.

Scores & Formats

DiaCollo offers several different score functions for ranking candidate profile collocates, as well as various output formats for returning profile results; see below for details.

Source Code & Services

DiaCollo is implemented as a Perl library, and distributed under the same terms as Perl itself. Source code is available from CPAN, e.g. at http://metacpan.org/release/DiaColloDB. In addition to the source code, a number of DiaCollo instances are accessible by means of an online RESTful web-service plugin for the DDC/D* corpus management framework. A list of publicly available DiaCollo corpora can be found here. A list of all DiaCollo indices hosted by the DWDS project at the BBAW is maintained here (click on the DiaCollo icon in the "Tools" column to access the DiaCollo GUI for a particular corpus).

Parameters

query
Target LEMMA(s) or /REGEX/ or DDC QUERY (aliases: query q lemmata lemmas lemma lem l; REQUIRED). See Query Syntax for details.
date
Target DATE(s) or /REGEX/ or range MIN:MAX (aliases: dates date d; default=all). In date-ranges, either or both of MIN and MAX may be specified as an asterisk ("*", ASCII 0x2A) to represent the minimum (rsp. maximum) date in the stored index, thus "*:*" represents the entire date range of the indexed corpus. The DDC and diff-DDC profile types currently do not support date regexes; see Profile Types for details.
slice
Target epoch size or "0" (zero) for global profile (aliases: dslice slice ds sl s; default=10). DiaCollo returns up to $KBEST items for each date sub-interval in the requested range. Date intervals (also called "epochs" or "slices") are labelled in DiaCollo result sets by their minimum element, i.e. epoch(YEAR) = SLICE*⌊YEAR/SLICE. Epochs in diff profiles are labelled by the epochs of the aligned sub-profile slices, separated by a hyphen character, i.e. diffEpoch(pa,pb) = "EPOCHa-EPOCHb".
bquery
Diff target query (aliases: bquery bq blemmata blemmas blemma blem bl; REQUIRED for diff profiles). See Query Syntax for details.
bdate
Diff target date(s) (aliases: bdates bdate bd; default=$DATE)
bslice
Diff target epoch size (aliases: bdslice bslice bds bsl bs; default=$SLICE)
groupby
Aggregation attribute list with optional restrictions (aliases: groupby group gr gb g; default=l,p). See Query Syntax and Grouping for details.
score
Score function, one of (f fm lf lfm mi ld ll) (aliases: score sc sf; default=ld). See Score Functions for details.
kbest
Number of items to return per epoch (aliases: kbest kb k; default=10)
cutoff
Score cutoff per epoch (aliases: cutoff cut co; default=none). Currently has no effect for diff profiles.
diff
Score aggregation function for diff profiles (aliases: diffop diff D; default=adiff). See Diff Operations for details.
global
Boolean indicating whether to prune profiles globally or locally for each epoch (aliases: global glob glo gl G; default=0 (disabled)). If this option is in effect, each epoch returned should contain exactly the same set of collocate items w2; otherwise (default) the set of collocates may differ between epochs.
1pass
Boolean indicating whether to use approximate single-pass f2 acquisition method for native collocations and diff:collocations profiles. (Aliases: onepass 1pass 1p; default=0 (disabled)). DiaColloDB versions <= v0.08.006 used this method by default. As of DiaColloDB v0.10.000, single-pass f2 acquisition is still supported, but the speed benefits are minimal, and single-pass profiles may in fact be slower to compute than full dual-pass profiles.
debug
Debug mode? (aliases: debug dbg; default=0)
profile
Profile type to compute (aliases: profile prof prf pr p; default=2). See Profile Types for details.
format
Output format (aliases: format fmt f; default=html). See Output Formats for details.

Query Syntax

DiaCollo supports both a "native" shorthand query syntax appropriate for simple queries and the DDC Query Language (since DiaCollo v0.06.004), although not all profile types support all DDC query operations.

Native Query Syntax

A native query is simply a list of search criteria of the form ATTR=VALUE interpreted as a logical conjunction of the specified conditions for single match token, with multiple request clauses separated by commas (,) or whitespace:
q_native::=qn_clause ([\s,]+ qn_clause)*
qn_clause::="$"? qnc_attr "=" qnc_value
|qnc_value
qnc_value::="/" REGEX "/" ([gimsadlux]*)
|STRING ("|" STRING)*
  • If the attribute name (qnc_attr) is omitted from a restriction clause, a default attribute is used (currently "lemma").
  • Native groupby requests are defined analogously, but allow omission of the value-part of the clause (qnc_value) rather than the attribute name; see Grouping for details.
  • Special characters in regular expressions or strings can be escaped by preceeding them with a backslash (\).
  • See the perlre manpage for details on the regular expressions supported by DiaCollo.

DDC Query Syntax

As of v0.06.004, DiaCollo additionally supports DDC query syntax, although not all profile types support all DDC query operations. In particular:

See the DDC Query Language manpage for details on the core DDC query language. DiaCollo's DDC dialect additionally supports the following shorthand aliases:

qc_word (" "|",")+ qc_word
Commas or spaces separating qc_word sub-expressions are mapped to WITH-clauses, i.e. token-local logical conjunction of independent restriction clauses, analogous to the native syntax.
#LIMIT[N]
Requests that DDC retrieve at most N items; useful for speeding up response times for large result-sets
#SAMPLE[N]
Requests that generated DDC count()-queries operate on a random sample of at most N tokens. Can actually wind up taking longer than without a #SAMPLE clause, since this requires hits to be (randomly) sorted.
#DMAX[D]
Set maximum proximity distance (+1) for implicit NEAR() queries. Note that the value for D should be 1 greater than the value passed to the DDC NEAR() operator itself. The default value depends on the DiaCollo index, but is usually 5 (up to 4 intervening tokens between w1 and w2).
#FMIN[N]
Set minimum frequency of collocate targets to be profiled. Useful for reducing network I/O overhead between the client and the DDC server. Default value depends on the DiaCollo index, but is usually 2. Higher values should result in shorter running times, but may filter out some interesting results. Note that this frequency threshold applies on a DDC subcorpus ("shard") basis.
#FCOEF[C]
Override the relation-specific frequency scaling coefficient for this query. For formal reasons, the independent frequencies f1, f2, and N are scaled up by a query-specific factor when computing score-functions from DDC count data, in order to ensure that the quantities involved can be interpreted as probabilities. The scaling coefficient is usually automatically guessed from the DDC query (e.g. C=2(N+1) for a query of the form NEAR(* =2,X,N), C=N+1 for a query of the form "* =2 #<N X", and C=1 for "* =2 #=N X").
"[" l_countkeys "]"
A groupby request can be explicitly wrapped in square brackets to force its interpretation as a DDC l_countkeys count-key list as opposed to a native groupby request. Potentially useful if you want/need to use alternative target offsets, biliographic metadata fields, or regex transformations on the result tuple attributes.

Grouping

The groupby parameter can be used to specify which indexed attributes of the candidate collocates are to be projected and to impose optional restrictions on the values of those attributes. It can be informally understood as a combination of SQL's GROUP BY and HAVING clauses.

The value of the groupby parameter is a comma-separated list of grouping expressions gb_expr:

q_groupby::=gb_expr ([\s,]+ gb_expr)*
gb_expr::="$"? qnc_attr
|"$"? qnc_attr "=" qnc_value

Only the attributes qnc_attr explicty specified in the groupby clause are projected from candidate collocates, so that if you request for example groupby: Lemma, then a result-set will include at most one entry for the lemma "flood", even if that lemma occurs in your corpus with multiple part-of-speech tags (e.g. as both a noun and a verb). If instead you request groupby: Lemma, Pos, then the result-set will treat distinct (Lemma,Pos) pairs as distinct collocate items.

If the groupby expression is of the form ATTR=HAVING, the HAVING expression (qnc_value) is interpreted as a restriction on the candidate collocates' values for the associated projected attribute ATTR (qnc_attr). For example, groupby: Lemma, Pos=NN will return only those collocates with the PoS-tag "NN" (common noun), and groupby: Lemma, Pos=/^ADJ/ will return only collocates whose PoS-tag begins with "ADJ".

Note that the bubble and cloud formats only display the first projected collocate attribute by default, although the entire projected collocate n-tuple should be available through the "details" popup window display after (double-)clicking on a collocate item in the main display canvas.

Score Functions

DiaCollo assigns each collocate in a unary profile a real-valued score by means of a user-specified score function. Currently, DiaCollo supports the following score functions:

f Raw collocation frequency, scoref = f12. Despite its immediate and intuitive interpretability, ranking by raw frequency alone does not usually provide a very good picture of collocations' "significance", since high-frequency items such as determiners tend to get ranked highest simply by virtue of their (uninteresting) high overall likelihood, rather than any particular (and potentially interesting) affinity for the search term(s) in question. While the native DiaCollo profile types filter out determiners (and all other function words) by default, the basic problem of uninteresting high-frequency collocates (e.g. "Herr") remains for raw frequency rankings.
fm Collocation frequency per million tokens, scorefm = 1000000 * f12 / N. This is just a linear normalized variant of raw collocation frequency.
lf Collocation log-frequency, scorelf = log2(f12 + ɛ). This is just a logarithmic variant of raw collocation frequency.
lfm Collocation log-frequency per million, scorelfm = log2(1000000 * (f12+ɛ) / (N+ɛ)). This is just a logarithmic variant of the normalized collocation frequency.
mi1 Raw pointwise mutual information, scoremi1 = log2( ((f12+ɛ)*(N+ɛ)) / ((f1+ɛ)*(f2+ɛ)) ). Attempts to address the shortcomings of raw-frequency rankings by estimating the change in code-lengths for jointly encoded collocation pairs versus independent encoding of each collocate. It generally works well for high- and mid-frequency collocation pairs, but tends to return disproportionately large values for low-frequency collocates.
mi3 Pointwise mutual information variant using the cube of the raw co-occurrence frequency f12 to boost association scores for high-frequency pairs, scoremi3 = log2( ((f12+ɛ)3 * (N+ɛ)) / ((f1+ɛ)*(f2+ɛ)) ). Heuristic score function investigated by both Evert (2004) and Rychlý (2008), attributed to Daille (1994).
milf (alias: mi)
Pointwise mutual information * log-frequency product as described by Rychlý (2008), scoremilf = log2( ((f12+ɛ)*(N+ɛ)) / ((f1+ɛ)*(f2+ɛ)) ) * log2( f12+ɛ ). Multiplying the raw PMI by the log-frequency of the collocation pair is a post-hoc attempt to ameliorate raw MI's preference for low-frequency collocates, but this strategy is not always successful.
ld Scaled log-Dice coefficient, scoreld = 14 + log2( 2*(f12+ɛ) / ((f1+ɛ) + (f2+ɛ)) ). Suggested by Rychlý (2008) as an association score for collocations and related to the intersection of fuzzy sets, the Dice coefficient is less susceptible to low-frequency outliers than pointwise mutual information or the PMI * log-frequency product while still managing to filter out most "chance" collocations with (uninteresting) high-frequency items returned by a raw-frequency ranking. This is the default score function used by the synchronic collocation profiler DWDS Wortprofil as described by Didakowski & Geyken (2013), and is also currently the default score function for the DiaCollo web front-end.
ll Variant of the popular binomial log likelihood ratio as suggested by Dunning (1993): scorell = sgnll * log(1 + log λ), where sgnll = f12 < f1*f2/N ? -1 : 1 and log λ = f12*log(f12/(f1*f2/N)) + (f1-f12)*log((f1-f12)/((f1*(N-f2)/N))) + (f2-f12)*log((f2-f12)/((N-f1)*f2/N)) + (N-f1-f2+f12)*log((N-f1-f2+f12)/((N-f1)*(N-f2)/N)). The first term sgnll is a sign coefficient implementing a one-sided association measure following Evert (2008), which assigns non-negative values only to "attracting" collocate pairs which co-occur more often than expected, whereas the "pure" log-likelihood ratio also assigns large values to "repelling" pairs which co-occur less often than expected. Raw log-likelihood ratio values log λ tend to vary much more widely than e.g. scaled log-Dice coefficients, leading visualizations based on this quantity to over-emphasize a small number of very strong collocates and relegating weaker associations to the background. To ameliorate this effect, DiaCollo reports and scales based on the quantity log(1 + log λ). Values are unbounded, but are typically in the range [-10:10].

... where the variables in the above definitions are interpreted per epoch as:
w1: target token matching the user query request
w2: collocate token matching the user groupby request
N: total number of collocation relations in the current corpus epoch
f12: frequency of the collocation (w1,w2) in the current corpus epoch
f1: total frequency of the query term (w1) in the selected profile type for the current corpus epoch
f2: total frequency of the collocate term (w2) in the selected profile type for the current corpus epoch
ɛ: smoothing constant, by default zero.

Diff Operations

In comparison ("diff") mode, DiaCollo computes an aggregate score diffdiff(sa,sb) for a pair of independent item scores sa and sb by applying a binary "diff operation" as dictated by the diff request parameter. In addition to determining the aggregate score to be associated with a pair of independent score operands, the choice of diff operations also determines the method by which returned items are to be ranked and selected for return e.g. by the kbest parameter, as well as the domain over which the diff operation is to be applied: pre-trimmed diff operations act only on the up to 2*KBEST items in (dom(kbest(pa)) ∪ dom(kbest(pb))), restricted diff operations act on the domain intersection (dom(pa) ∩ dom(pb)), while non-trimmed operations act on the entire domain union (dom(pa) ∪ dom(pb)), where pa and pb are the unary profiles resulting from independent evaluation of the query and bquery request parameters qa and qb, respectively. Currently, DiaCollo supports the following diff operations:

diff Raw score difference (pre-trimmed, asymmetric): diffdiff(sa,sb) = sa-sb. Useful for selecting collocates strongly associated only with qa.
adiff Pseudo-absolute score difference (pre-trimmed, symmetric). Selects based on diffadiff(sa,sb) = |sa-sb|, but returns raw differences as for raw diff. This is the default diff operation, which selects the most extreme differences among the prominent collocates of qa and qb, regardless of the the direction those differences, which itself is expressed as the sign of the returned diff score.
sum Score sum (symmetric): diffsum(sa,sb) = sa+sb. Selects strong associations for either qa or qb, preferring shared associations, but not very sensitive to non-uniform operand values (e.g. diffsum(0,8) = diffsum(4,4) = 8, but only the latter configuration indicates similar collocation behavior of the associated collocates). Returned rankings are equivalent to the avg operation.
min Score minimum (restricted, symmetric): diffmin(sa,sb) = min(sa,sb). Punishes non-uniform operand values by selecting only the weaker of the operand association scores. Highly sensitive to sparse data problems, since missing data are assigned scores of 0 (zero).
max Score maximum (pre-trimmed, symmetric): diffmax(sa,sb) = max(sa,sb). Selects only the stronger of the operand association scores. Potentially useful for discovering interesting target collocations for further investigation.
avg Score average (restricted, symmetric): diffavg(sa,sb) = avg(sa,sb) = (sa+sb)/2. Suffers from the same drawbacks as the sum operation, to which the returned rankings are equivalent, although values are more immediately comparable to those returned for unary profiles.
havg Pseudo-harmonic average (restricted, symmetric): diffhavg(sa,sb) ~ havg(sa,sb) = 2*(sa*sb)/(sa+sb). Selects collocates with strong associations to both qa and qb. Attempts to address the shortcomings of the sum and avg diff operations by penalizing dissimilar operand values. In order to avoid singularities resulting from sparse data, this operation actually computes the arithmetic average of the harmonic and raw arithmetic means; i.e.
havg(sa,sb) := { 0 if sa≤0 or sb≤0 singularity
2*(sa*sb)/(sa+sb) otherwise harmonic mean
diffhavg(sa,sb) := avg( havg(sa,sb), avg(sa,sb) ) arithmetic average of harmonic and arithmetic means
gavg Pseudo-geometric average (restricted, symmetric): diffgavg(sa,sb) ~ gavg(sa,sb) = √sa*sb. Selects collocates with strong associations to both qa and qb, similar to the havg operation. To avoid singularities resulting from sparse data, this operation actually computes the arithmetic average of the geometric and the raw arithmetic means, analogous to the method used to compute the havg diff score.
lavg Pseudo-logarithmic average (restricted, symmetric): difflavg(sa,sb) ~ exp( avg(log(sa),log(sb)) ). Selects collocates with strong associations to both qa and qb, penalizing dissimilar operand values logarithmically. To avoid negative log values, the target values are forced into the range [1,∞] before averaging, i.e.:
Δsasb := { 1-min(sa,sb) if min(sa,sb) < 1 avoid negative logarithms
0 otherwise safe operands
difflavg(sa,sb) := exp( avg(log(sasasb), log(sbsasb)) ) - Δsasb adjusted log-average

Profile Types

DiaCollo offers several different methods for acquiring raw frequency data on the basis of which to score, rank, and select "significant" collocations. These methods are referred to here as "profile types", and the data returned as "profiles". Currently, DiaCollo supports the following profile types:
collocations
(aliases: colloc cof c f12 2)
Native collocation profile. Retrieves and ranks all content words (w2) occurring together with the search term (w1) within a context window of dmax content words and without an intervening boundary of the selected DDC "break collection". Only collocation tuple-pairs with a minimum frequency of fmin are considered. Note that for reasons of efficiency, the frequency threshold fmin, the context-window size dmax, and the boundary DDC break collection must be specified at compile-time, and cannot be changed by the user. The default DiaCollo configuration for corpora at the BBAW uses sentence-break boundaries, dmax = 5, and fmin = 5.
unigrams
(aliases: ug u f1 1)
Native unigram profile. Retrieves and ranks all terms matching the search query (w1). Can be useful together with with prefix-, suffix-, or regular-expression queries in order to profile e.g. stems in compounds. Currently, DiaCollo cannot acquire an independent value for the variable f2 for unigram profiles (since this would entail very large prefix- rsp. suffix-indices and/or regular expression operations unsupported by the underlying library), so that f2 = f12 for each item returned. As a consequence, the mutual information and log-Dice score function rankings are isomorphic to the raw frequency score function for this profile type.
tdf
(aliases: tdf tdm TDF TDM)
Native collocation profile based on an underlying term-document frequency matrix. Retrieves and ranks all content words (w2) occurring together with the search term (w1) within a single "document" as defined by the DDC "break collection" specified via the compile-time option -dbreak (by default, source paragraphs are used as "document" boundaries for matrix construction). Allows more flexible queries and result-set aggregation than the simple collocations profiles, but generally slower to evaluate and less sensitive to proximity effects.
ddc
(aliases: ddc DDC)
Advanced profile using count() queries submitted to an independent DDC search engine to acquire raw frequency data. Unlike the collocations or unigrams profile types, DDC profiles are not limited to the default break collection or a fixed context window size (although by default DDC profiles approximate the collocations profile type by means of the NEAR() operator). Indeed, DDC profiling can use the full range of the DDC query language to express search queries, provided that an explicit match-id =2 is included to indicate the position(s) of the collocate target(s) for which a profile is to be computed. See Query Syntax for more details. Note that the DDC profile type does not currently support regular expression date restrictions. Note also that DDC profiles are usually much slower and more memory-intensive than their native counterparts, and should be avoided if possible.
diff-collocations
(aliases: diff-colloc diff-cof diff-c diff-f12 diff-2 d2)
Comparison of two native collocation profiles. "Diff" queries compute independent profiles pa and pb for the query and bquery parameters, respectively. After ranking according to the selected score function, a comparison-profile pa-b is computed as score(pa-b) = diffdiff(score(pa), score(pb)) by applying the selected diff operation, and the $KBEST items are selected and returned based on their aggregate diff scores.
diff-unigrams
(aliases: diff-ug diff-u diff-f1 diff-1 d1)
Comparison of two native unigram profiles. See diff-collocations for details on comparison profiles.
diff-tdf
(aliases: diff-tdf diff-TDF dtdf dTDF)
Comparison of two TDF profiles. See diff-collocations for details on comparison profiles.
diff-ddc
(aliases: diff-DDC dDDC dddc)
Comparison of two DDC profiles. See diff-collocations for details on comparison profiles, Note that DDC (diff) profiles are usually slower and more memory-intensive than their native counterparts, and should be avoided if possible.

Output Formats

DiaCollo currently supports the following output formats:
txt
(aliases: text txt t tsv csv)
TAB-separated UTF-8 plain text output, suitable for importing into the spreadsheet application of your choice, e.g. Gnumeric or LibreOffice Calc.
json
(aliases: json js j)
Native JSON format suitable for further automated processing, web-services, etc.
html
(aliases: html htm)
Simple HTML table format, used for live display in the demo interface. In addition to a tabular display of the profile data, the web front-end HTML display uses JavaScript to generate hyperlinks to (close approximations of) underlying corpus hits ("KWIC-links") as well as a color-coded representation of the association score (rsp. score-difference for diff profiles) associated with each row. Due to the implicit compile-time filtering of native index data by content words and the index parameters dmax and fmin, the number of hits returned by the KWIC-links for native collocation profiles may differ somewhat from the f12 pair frequency reported in the table. DDC profiles however should report f12 values identical to the number of corpus hits returned by the associated KWIC-links.
storable
(aliases: storable sto bin)
Binary format using the Perl Storable module, suitable for further automated processing with Perl.
gmotion
(aliases: gmotion gm)
Online visualization using Google Motion Chart. Requires flash player. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format. See also Martin Hilpert's motion chart resource page for some examples, use cases, and discussion.
hichart
(aliases: hc hi chart hichart highchart highcharts)
Online visualization using the Highcharts JavaScript library. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format. Clicking on any data point causes a popup window to be displayed containing hyperlinks to corpus hits for the corresponding collocation pairs as for the HTML format, which see with respect to caveats.
bubble
(aliases: b bub bubble bubbles)
Online interactive visualization using the D3.js JavaScript library force-layout. Collocates are displayed as labelled circles whose radii and color represent the correspondig association score. Node colors are the same hues as those used in the HTML table format, but may appear "washed-out" or "pastel-ized" due to their (partial) transparency. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format.
cloud
(aliases: c cl cld cloud)
Online interactive visualization using Jason Davies' d3-cloud layout for the D3.js JavaScript visualization library. Collocates are displayed as text labels whose size and color represent the correspondig association score. Node colors are the same hues as those used in the HTML table format, but will appear somewhat darker ("dirty" or "shadowed") for better legibility on a white background. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format.

Keyboard Bindings

The online visualizations based on the D3.js JavaScript library (bubble, cloud) support the following keyboard shortcuts whenever the display canvas has the keyboard focus (as indicated by a drop-shadow around the canvas itself as well as the keyboard icon icon in the header area):

Key(s)Action
spacetoggle playback animation
up-arrowincrease playback speed (coarse)
down-arrowdecrease playback speed (coarse)
shift+up-arrowincreate playback speed (fine)
shift+down-arrowdecrease playback speed (fine)
number 1 or 0reset playback speed to default (1×)
number 2-9set playback speed to N × default
shift+number 2-9set playback speed to 1/N × default
Homesnap to first epoch
Endsnap to final epoch
left-arrowsnap to previous whole epoch
right-arrowsnap to next whole epoch
shift+left-arrowinterpolate backward by one quarter epoch
shift+right-arrowinterpolate forward by one quarter epoch
xexport snapshot of current display canvas as SVG

Examples

Diff Examples

TDF Examples

DDC Examples

More Examples

... are welcome; please drop me a line if you find a good one!

Fiendishly Awkward Questions

Corpora

Can I use DiaCollo on my own corpus?
What languages are supported?
pretty much any written language ought to work: DiaCollo aspires to be language-agnostic.
What corpus formats are supported?
Why must I tokenize and annotate my corpus myself?
  • one tool ⇔ one job
  • language agnosia ↝ flexibility
  • DiaCollo is not an all-singing+dancing, one-stop-shopping text analysis tool (and almost certainly never will be)
  • consider using CLARIN-D WebLicht if you want a generic corpus annotation framework
Can I use DiaCollo to directly compare different corpora?
What is "DDC" and why might I care?
  • DDC ("DiaLing/DWDS Concordancer") is an open-source corpus search engine used by the DWDS, DTA, and ZDL projects at the BBAW.
  • "D*" ("dstar") is the corpus management framework used by the DWDS, DTA, and ZDL projects at the BBAW. It employs DDC as a low-level corpus search engine, and optionally includes additional auxiliary indices (including DiaCollo itself) and RESTful web-wrappers (chances are high that you're accessing this document through a D* web wrapper).
  • Both DiaCollo and DDC indices can be managed by the D* corpus management framework, and any D* DiaCollo instance at the BBAW should have an associated DDC index and server.
  • DDC is not "part of" DiaCollo and is not included with the DiaCollo source distribution; nor is DiaCollo "part of" DDC. DiaCollo and DDC are independent modular software packages which - when correctly configured - can play together nicely.
  • A functional DDC corpus index and a running ddc_daemon server are required for evaluating DiaCollo's KWIC-link approximations and DDC profile relation.
  • Configuration, compilation, maintenance, and usage of D*, DDC, and/or ddc_daemon is beyond the scope of this FAQ.
How large does my corpus need to be in order to get reliable results?
  • Epoch (slice) size is more relevant than total corpus size: if you just want a larger sample for a particular query, try increasing the value of the slice parameter (↝ reduce diachronic granularity).
  • A "good" epoch size depends on both the absolute and relative frequency of the target phenomenon as specified by the query, and groupby, and date parameters. Since frequencies of natural language phenomena can vary widely, there is no reliable "one-size-fits-all" strategy for determining a "good" epoch size independently of the corpus distribution and the other user query parameters; see Gabrielatos et al. (2012) for a more detailed discussion.
  • Beware compile-time filters (-tfmin, -lfmin, -Opgood, etc.), server-side pruning thresholds, and k-best pruning
    • the indexing option -use-all-the-data disables all compile-time thresholds for the native and TDF relations
    • the #FMIN operator disables runtime pruning for the DDC relation
  • corpus artefacts are always possible (e.g. "Pferdebuckel", Krise→Tolstoj)
  • completely subjective, non-rigorous, & informal recommendation:
    • your chances are pretty good if min{f1,f2} ≥ 100 (using the variables described above) ... but DiaCollo can also produce interesting results from small corpora and/or epochs well below this threshold!

Runtime

Can I download DiaCollo results for offline use?
Probably yes: most output formats supported by DiaCollo can be saved to your local computer and used offline.
  • Static tabular formats such as txt, html, json, and storable can be downloaded by clickin on the "Raw URL" link above the main display canvas and using your browser's "Save As" function (typically invoked by right-clicking on a blank area of the page display and choosing "Save As" from the context menu).
  • A snapshot of a single epoch visualization in the bubble or cloud format can be exported by clicking on the download icon (download icon) in the upper right corner of the display canvas.
  • An interactive GUI snapshot for the bubble or cloud visualization formats can be saved to your local computer by using your browser's "Save As (Web-page, complete)" function (typically invoked by right-clicking on a blank area of the page display and choosing "Save As" from the context menu). Note that you will not be able to submit changes to any query parameters from such an offline snapshot.
  • google motion charts don't support offline use at all.
How can I restrict the profiled collocates to immediate predecessors?
Use the DDC profile relation together with a phrase query, e.g. "*=2 Mann" #FMIN 1; see also this example.
Why does my collocant appear as a collocate for itself?
Self-collocations are never counted for a single token. Co-occurrences between multiple tokens of a single type are counted twice (yes, this is a wart, but it's not the wart you probably think it is). Consider for example the sentence "Flies flew". This sentence contains 2 tokens (the individual words "Flies" and "flew"). Both of these are instances of a single lemma type (let's call it "FLY"), so that at indexing time, DiaCollo would index the sentence as "FLY FLY", which would result in 2 co-occurrences being counted for the lemma-pair ("FLY","FLY"): one occurrence left-to-right ("Flies","flew") and one occurrence right-to-left ("flew","Flies"). If only lemmata are being indexed, these would be true identity pairs and could be handled by a special exception at indexing time. Usually however, additional attributes will be indexed (such as part-of-speech tag or surface form), so the actual co-occurrence pairs are more likely to be complex and not strictly identical; in this case something like ("FLY/NNS:Flies","FLY/VBD:flew"). Further, it is entirely possible that lexical items do in fact self-collocate, as in the sentence "Fly, fly away!".
Why does my collocate item g "disappear" in epoch e?
  • It may not be among the k-best collocates in epoch e. By default, DiaCollo prunes results to the k-best items in each epoch independently.
    • Raising the kbest will allow more collocates to be displayed per epoch.
    • Setting the global option to a true value should cause the exact same set of collocates to be displayed in every epoch (possibly with f12 = 0).
    • If you just want to check a particular association scores (or indexed frequencies) for a particular item (or items), you can use the groupby parameter to select the collocate(s) of interest.
  • It may have been omitted from the native index by compile-time filters (-cfmin, -tfmin, -Opgood etc.). DiaCollo cannot return or display data which hasn't been indexed.
  • It may be have been suppressed by server-side pruning for the DDC profile relation. Try specifying the #FMIN 1 query operator.
  • It may not occur at all (or not co-occur with your collocant) in the specified epoch. DiaCollo only indices only store frequency data for attested phenomena (f>0). Try raising the value slice parameter or setting it to 0 (zero) for a corpus-global profile.
Why does the D3 date-slider (bubble, cloud) "snap" to epoch boundaries?
A DiaCollo profile includes at most one data point per candidate collocate per epoch, where epochs are treated as opaque ordinal classification categories; see DiaColloDB::Profile::Multi(3pm). For visual clarity, size and color of collocate items in DiaCollo's D3.js visualization formats (bubble and cloud) are linearly interpolated between discrete epochs by the JavaScript GUI code, causing them to smoothly grow and/or shrink in the GUI animation rather than suddenly "jumping" to a new state. Manually adjusting ("dragging") the position of the date-slider will however cause it "snap" to the nearest epoch boundary, since that is the closest ordinal category for which the profile has actual data points.
Why does the collocation pair (q,q) appear at epoch e (even tough I know it doesn't really occur until later)?
Epochs are labelled by their minimum possible element (year), so for slice=E, the epoch e represents the date interval [e .. e+E-1] (e.g. for slice=10, the epoch "1980" represents the interval 1980-1989).
Why don't the corpus KWIC links always return exactly f12 hits?
  • DiaCollo itself does not create or maintain a full-text index (one tool ⇔ one job); retrieval of actual corpus hits is performed by an independent DDC server.
  • DiaCollo's KWIC links are nothing more or less than dynamically generated DDC search queries which – when evaluated by an appropriate DDC server (typically when you click on such a link from the DiaCollo GUI) – should closely approximate the corpus hits for the associated collocation pair.
  • The DDC queries generated by DiaCollo for its KWIC links are only approximations in the case of native profile relations, since there are no equivalent DDC query expressions for many of DiaCollo's compile-time filters (e.g. content-word filtering).
  • If you need to ensure exact results, use the DDC relation together with the #FMIN 1 query operator.

Errors

Error: DiaColloDB::Document::CLASS: cannot load file ...
This error message can be emitted at indexing time by dcdb-create.perl if the corpus input data you supplied does not appear to be formatted correctly. Ensure that your data is formatted properly and that you have specified the correct -dclass=CLASS option to dcdb-create.perl.
Error: No 'query' parameter specified
This message indicates that your request did not include a collocant query specification by means of the query parameter (rsp. the QUERY1 argument to dcdb-query.perl). It can appear in the HTML GUI before any request has been submitted. The query parameter is required.
Error: No data to display
No index entries matched your request. Underlying causes may vary, but possible reasons include:
Error: You cannot submit queries from an offline data set
This error indicates that you attempted to submit a new profile request to a static GUI snapshot, which is not supported. Submit new requests to "live" index wrapper instead.
Error: Variable 'ddc_url_root' not set: KWIC links disabled
This indicates that the DiaCollo index is not associated with any running DDC/D* REST API. If this error appears for a DDC/D* DiaCollo instance hosted by the BBAW, it probably indicates a corpus configuration bug: please inform the corpus maintainer. If you have indexed your own corpus, you will need to setup a D*-compatbile DDC server and set DiaCollo's ddcServer option if you want to support DDC profiles and KWIC links.
Error: 500 Internal Server Error
This is just an HTTP status code, and not an error message (and also not very informative). Keep reading for some (hopefully) more useful diagnostics.
Error: ttk_process(): template error: undef error - [MESSAGE]
Something went wrong using the DiaColloDB::WWW GUI (still not very informative). The actual diagnostic message begins with [MESSAGE].
Error: ... called at FILE.pm line XYZ
This is a stack trace of the actual error. Only first few lines are likely to be informative.
Error: parseQuery(): ... could not parse query: syntax error: ...
Indicates that the query parameter could not be parsed. Ensure that your query is syntactically valid; see "Query Syntax" for details.
Error: align(): cannot align non-trivial multi-profiles of unequal size
You tried to compare two profiles with incompatible epoch partitions. Diff comparisons (diff, diff-tdf, etc.) require both operand profiles to contain the same number of epochs or only a single epoch (e.g. slice=0).
Error: ... abstract method called
I probably forgot to implement something; please let me know!
Error: ... timeout elapsed
This usually means that the independent DDC server took too long to respond for a DDC profile. You query may be too complex to be handled gracefully: try raising the value of #FMIN, lowering the value of #DMAX, and/or profiling only a corpus sample. Please do not simply re-submit a query which has caused a timeout immediately in the hope that this error will have been magically resolved, and please do wait at least for the "timeout elapsed" message to appear before submitting a new query to an unresponsive server.
Error: no 'ddcServer' key defined
This indicates that you tried to use the DDC profile relation without declaring an associated DDC server to provide frequency data. If this error appears for a DDC/D* DiaCollo instance hosted by the BBAW, it probably indicates a corpus configuration bug: please inform the corpus maintainer. If you have a running DDC server for a your own corpus, you can EITHER Replace HOST and PORT in the examples above with values appropriate for your DDC server.

References

DiaCollo

General

  • Jörg Didakowski and Alexander Geyken (2013). "From DWDS corpora to a German Word Profile – methodological problems and solutions." In: Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information. 2nd Work Report of the Academic Network "Internet Lexicography". Mannheim, Institut für Deutsche Sprache. (OPAL - Online publizierte Arbeiten zur Linguistik X/2012), pages 43-52. (PDF)
  • Béatrice Daille (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.
  • Ted Dunning (1993). "Accurate methods for the statistics of surprise and coincidence." Computational Linguistics 19(1), 61-74. (PDF)
  • Stefan Evert (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Universität Stuttgart. (PDF)
  • Stefan Evert (2008). "Corpora and collocations." In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58, Berlin, Mouton de Gruyter, pages 1212-1248. (PDF [extended manuscript])
  • C. Gabrielatos, T. McEnery, P. J. Diggle, and P. Baker (2012). "The peaks and troughs of corpus-based contextual analysis." International Journal of Corpus Linguistics 17(2), pages 151–175. (DOI, PDF [post-print])
  • Pavel Rychlý (2008). "A lexicographer-friendly association score". In P. Sojka and A. Horák (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing. RASLAN 2008, pages 6-9. (PDF, PDF(2))
  • Adam Kilgarriff and David Tugwell (2002). "Sketching words". In M.-H. Corréard (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins. EURALEX, pages 125-137. (PDF)

DiaColloDB v0.12.019 / DiaColloDB::WWW v0.02.005 0.265856 sec Imprint · Privacy
Projekt GEI-Digital-2020
Collection: GEI-Digital
Corpus sources provided by the Georg-Eckert-Institut - Leibniz-Institut für internationale Schulbuchforschung.
Corpus processing and infrastructure development by the Zentrum für digitale Lexikographie der deutschen Sprache at the Berlin-Brandenburg Academy of Sciences and Humanities.