Contents
Introduction
DiaCollo (pronounced /diːˈakəloʊ/, "dee-ah-kə-loh", analogous to the well-known juggling prop) is a tool for efficient extraction of diachronic collocations from an underlying text corpus. Unlike other collocation extractors such as DWDS Wortprofil, Sketch Engine, or the UCS toolkit, DiaCollo is suitable for extraction and analysis of diachronic collocation data, i.e. collocations whose significance depends on the date of their occurrence. By tracking changes in a word's typical collocates over time and applying J. R. Firth's famous principle that "you shall know a word by the company it keeps", DiaCollo can help to provide a clearer picture of diachronic changes in the word's usage, in particular those related to semantic shift.Requests & Paramters
DiaCollo is designed as a request-oriented service: it accepts a user request as a set of parameter=value pairs and returns a corresponding profile for the term(s) queried. Paramters are passed to the DiaCollo web-service REST-itutionally via the URL query string or HTTP POST request as for a standard web form. The URL for the low-level request including all user parameters is displayed in the DiaCollo web front-end as a hyperlink labelled "Raw URL" at the top of the data display area.Profiles & Diffs
The results of a simple DiaCollo user request are returned as a tabular profile of the k-best collocates for the queried word(s) in the requested date-range, aggregated into sub-intervals ("epochs", e.g. decades) as specified by the slice request parameter. Alternatively, the user may request a comparison or "diff" profile in order to highlight the most prominent differences between two quasi-independent queries, e.g. between two different words or between occurrences of the same word in different date intervals.Indices & Attributes
For maximum efficiency, DiaCollo uses an internal "native" index structure over the input corpus content words to compute collocations. Each indexed word is treated as a tuple of linguistically salient attributes in addition to the document date. By default, the attributes "Lemma" and "Pos" (part-of-speech) are indexed. User query and groupby request parameters are interpreted as logical conjunctions of restrictions over these attributes, selecting the precise token tuple(s) to be profiled. For finer-grained selection of profiling targets, DiaCollo supports boolean query expressions and document meta-data filters via the tdf and diff-tdf profile types, and the full range of the DDC query language via the ddc and diff-ddc profile types.Scores & Formats
DiaCollo offers several different score functions for ranking candidate profile collocates, as well as various output formats for returning profile results; see below for details.Source Code & Services
DiaCollo is implemented as a Perl library, and distributed under the same terms as Perl itself. Source code is available from CPAN, e.g. at http://metacpan.org/release/DiaColloDB. In addition to the source code, a number of DiaCollo instances are accessible by means of an online RESTful web-service plugin for the DDC/D* corpus management framework. A list of publicly available DiaCollo corpora can be found here. A list of all DiaCollo indices hosted by the DWDS project at the BBAW is maintained here (click on the DiaCollo icon in the "Tools" column to access the DiaCollo GUI for a particular corpus).Other Useful Links
- DiaCollo Tutorial (in German)
- CLARIN-D Showcase for DiaCollo (in German)
- DiaCollo Smorgasboard
- The DiaCollo References section contains a list of selected DiaCollo-related publications.
Parameters
- query
- Target LEMMA(s) or /REGEX/ or DDC QUERY (aliases: query q lemmata lemmas lemma lem l; REQUIRED). See Query Syntax for details.
- date
- Target DATE(s) or /REGEX/ or range MIN:MAX (aliases: dates date d; default=all). In date-ranges, either or both of MIN and MAX may be specified as an asterisk ("*", ASCII 0x2A) to represent the minimum (rsp. maximum) date in the stored index, thus "*:*" represents the entire date range of the indexed corpus. The DDC and diff-DDC profile types currently do not support date regexes; see Profile Types for details.
- slice
- Target epoch size or "0" (zero) for global profile (aliases: dslice slice ds sl s; default=10). DiaCollo returns up to $KBEST items for each date sub-interval in the requested range. Date intervals (also called "epochs" or "slices") are labelled in DiaCollo result sets by their minimum element, i.e. epoch(YEAR) = SLICE*⌊YEAR/SLICE⌋. Epochs in diff profiles are labelled by the epochs of the aligned sub-profile slices, separated by a hyphen character, i.e. diffEpoch(pa,pb) = "EPOCHa-EPOCHb".
- bquery
- Diff target query (aliases: bquery bq blemmata blemmas blemma blem bl; REQUIRED for diff profiles). See Query Syntax for details.
- bdate
- Diff target date(s) (aliases: bdates bdate bd; default=$DATE)
- bslice
- Diff target epoch size (aliases: bdslice bslice bds bsl bs; default=$SLICE)
- groupby
- Aggregation attribute list with optional restrictions (aliases: groupby group gr gb g; default=l,p). See Query Syntax and Grouping for details.
- score
- Score function, one of (f fm lf lfm mi ld ll) (aliases: score sc sf; default=ld). See Score Functions for details.
- kbest
- Number of items to return per epoch (aliases: kbest kb k; default=10)
- cutoff
- Score cutoff per epoch (aliases: cutoff cut co; default=none). Currently has no effect for diff profiles.
- diff
- Score aggregation function for diff profiles (aliases: diffop diff D; default=adiff). See Diff Operations for details.
- global
- Boolean indicating whether to prune profiles globally or locally for each epoch (aliases: global glob glo gl G; default=0 (disabled)). If this option is in effect, each epoch returned should contain exactly the same set of collocate items w2; otherwise (default) the set of collocates may differ between epochs.
- 1pass
- Boolean indicating whether to use approximate single-pass f2 acquisition method for native collocations and diff:collocations profiles. (Aliases: onepass 1pass 1p; default=0 (disabled)). DiaColloDB versions <= v0.08.006 used this method by default. As of DiaColloDB v0.10.000, single-pass f2 acquisition is still supported, but the speed benefits are minimal, and single-pass profiles may in fact be slower to compute than full dual-pass profiles.
- debug
- Debug mode? (aliases: debug dbg; default=0)
- profile
- Profile type to compute (aliases: profile prof prf pr p; default=2). See Profile Types for details.
- format
- Output format (aliases: format fmt f; default=html). See Output Formats for details.
Query Syntax
DiaCollo supports both a "native" shorthand query syntax appropriate for simple queries and the DDC Query Language (since DiaCollo v0.06.004), although not all profile types support all DDC query operations.Native Query Syntax
A native query is simply a list of search criteria of the form ATTR=VALUE interpreted as a logical conjunction of the specified conditions for single match token, with multiple request clauses separated by commas (,) or whitespace:q_native | ::= | qn_clause ([\s,]+ qn_clause)* |
qn_clause | ::= | "$"? qnc_attr "=" qnc_value |
| | qnc_value | |
qnc_value | ::= | "/" REGEX "/" ([gimsadlux]*) |
| | STRING ("|" STRING)* |
- If the attribute name (qnc_attr) is omitted from a restriction clause, a default attribute is used (currently "lemma").
- Native groupby requests are defined analogously, but allow omission of the value-part of the clause (qnc_value) rather than the attribute name; see Grouping for details.
- Special characters in regular expressions or strings can be escaped by preceeding them with a backslash (\).
- See the perlre manpage for details on the regular expressions supported by DiaCollo.
DDC Query Syntax
As of v0.06.004, DiaCollo additionally supports DDC query syntax, although not all profile types support all DDC query operations. In particular:- The native profile types collocations, unigrams, diff:collocations, and diff:unigrams only support single-token queries over the natively indexed token attributes (by default "lemma" and "pos").
- The (term x document) matrix profile types tdf and diff-tdf additionally support Boolean query expressions and document meta-data filters. These profile types may also employ indexed document metadata fields in the groupby clause.
- The DDC profile types ddc and diff-ddc require that an explicit match-id =2 is included to indicate the position(s) of the collocate target(s) for which a profile is to be computed, otherwise a default NEAR() query is automatically generated.
- qc_word (" "|",")+ qc_word
- Commas or spaces separating qc_word sub-expressions are mapped to WITH-clauses, i.e. token-local logical conjunction of independent restriction clauses, analogous to the native syntax.
- #LIMIT[N]
- Requests that DDC retrieve at most N items; useful for speeding up response times for large result-sets
- #SAMPLE[N]
- Requests that generated DDC count()-queries operate on a random sample of at most N tokens. Can actually wind up taking longer than without a #SAMPLE clause, since this requires hits to be (randomly) sorted.
- #DMAX[D]
- Set maximum proximity distance (+1) for implicit NEAR() queries. Note that the value for D should be 1 greater than the value passed to the DDC NEAR() operator itself. The default value depends on the DiaCollo index, but is usually 5 (up to 4 intervening tokens between w1 and w2).
- #FMIN[N]
- Set minimum frequency of collocate targets to be profiled. Useful for reducing network I/O overhead between the client and the DDC server. Default value depends on the DiaCollo index, but is usually 2. Higher values should result in shorter running times, but may filter out some interesting results. Note that this frequency threshold applies on a DDC subcorpus ("shard") basis.
- #FCOEF[C]
- Override the relation-specific frequency scaling coefficient for this query. For formal reasons, the independent frequencies f1, f2, and N are scaled up by a query-specific factor when computing score-functions from DDC count data, in order to ensure that the quantities involved can be interpreted as probabilities. The scaling coefficient is usually automatically guessed from the DDC query (e.g. C=2(N+1) for a query of the form NEAR(* =2,X,N), C=N+1 for a query of the form "* =2 #<N X", and C=1 for "* =2 #=N X").
- "[" l_countkeys "]"
- A groupby request can be explicitly wrapped in square brackets to force its interpretation as a DDC l_countkeys count-key list as opposed to a native groupby request. Potentially useful if you want/need to use alternative target offsets, biliographic metadata fields, or regex transformations on the result tuple attributes.
Grouping
The groupby parameter can be used to specify which indexed attributes of the candidate collocates are to be projected and to impose optional restrictions on the values of those attributes. It can be informally understood as a combination of SQL's GROUP BY and HAVING clauses.The value of the groupby parameter is a comma-separated list of grouping expressions gb_expr:
q_groupby | ::= | gb_expr ([\s,]+ gb_expr)* |
gb_expr | ::= | "$"? qnc_attr |
| | "$"? qnc_attr "=" qnc_value |
Only the attributes qnc_attr explicty specified in the groupby clause are projected from candidate collocates, so that if you request for example groupby: Lemma, then a result-set will include at most one entry for the lemma "flood", even if that lemma occurs in your corpus with multiple part-of-speech tags (e.g. as both a noun and a verb). If instead you request groupby: Lemma, Pos, then the result-set will treat distinct (Lemma,Pos) pairs as distinct collocate items.
If the groupby expression is of the form ATTR=HAVING, the HAVING expression (qnc_value) is interpreted as a restriction on the candidate collocates' values for the associated projected attribute ATTR (qnc_attr). For example, groupby: Lemma, Pos=NN will return only those collocates with the PoS-tag "NN" (common noun), and groupby: Lemma, Pos=/^ADJ/ will return only collocates whose PoS-tag begins with "ADJ".
Note that the bubble and cloud formats only display the first projected collocate attribute by default, although the entire projected collocate n-tuple should be available through the "details" popup window display after (double-)clicking on a collocate item in the main display canvas.
Score Functions
DiaCollo assigns each collocate in a unary profile a real-valued score by means of a user-specified score function. Currently, DiaCollo supports the following score functions:
f | Raw collocation frequency, scoref = f12. Despite its immediate and intuitive interpretability, ranking by raw frequency alone does not usually provide a very good picture of collocations' "significance", since high-frequency items such as determiners tend to get ranked highest simply by virtue of their (uninteresting) high overall likelihood, rather than any particular (and potentially interesting) affinity for the search term(s) in question. While the native DiaCollo profile types filter out determiners (and all other function words) by default, the basic problem of uninteresting high-frequency collocates (e.g. "Herr") remains for raw frequency rankings. | |
---|---|---|
fm | Collocation frequency per million tokens, scorefm = 1000000 * f12 / N. This is just a linear normalized variant of raw collocation frequency. | |
lf | Collocation log-frequency, scorelf = log2(f12 + ɛ). This is just a logarithmic variant of raw collocation frequency. | |
lfm | Collocation log-frequency per million, scorelfm = log2(1000000 * (f12+ɛ) / (N+ɛ)). This is just a logarithmic variant of the normalized collocation frequency. | |
mi1 | Raw pointwise mutual information, scoremi1 = log2( ((f12+ɛ)*(N+ɛ)) / ((f1+ɛ)*(f2+ɛ)) ). Attempts to address the shortcomings of raw-frequency rankings by estimating the change in code-lengths for jointly encoded collocation pairs versus independent encoding of each collocate. It generally works well for high- and mid-frequency collocation pairs, but tends to return disproportionately large values for low-frequency collocates. | |
mi3 | Pointwise mutual information variant using the cube of the raw co-occurrence frequency f12 to boost association scores for high-frequency pairs, scoremi3 = log2( ((f12+ɛ)3 * (N+ɛ)) / ((f1+ɛ)*(f2+ɛ)) ). Heuristic score function investigated by both Evert (2004) and Rychlý (2008), attributed to Daille (1994). | |
milf |
(alias: mi) Pointwise mutual information * log-frequency product as described by Rychlý (2008), scoremilf = log2( ((f12+ɛ)*(N+ɛ)) / ((f1+ɛ)*(f2+ɛ)) ) * log2( f12+ɛ ). Multiplying the raw PMI by the log-frequency of the collocation pair is a post-hoc attempt to ameliorate raw MI's preference for low-frequency collocates, but this strategy is not always successful. |
|
ld | Scaled log-Dice coefficient, scoreld = 14 + log2( 2*(f12+ɛ) / ((f1+ɛ) + (f2+ɛ)) ). Suggested by Rychlý (2008) as an association score for collocations and related to the intersection of fuzzy sets, the Dice coefficient is less susceptible to low-frequency outliers than pointwise mutual information or the PMI * log-frequency product while still managing to filter out most "chance" collocations with (uninteresting) high-frequency items returned by a raw-frequency ranking. This is the default score function used by the synchronic collocation profiler DWDS Wortprofil as described by Didakowski & Geyken (2013), and is also currently the default score function for the DiaCollo web front-end. | |
ll | Variant of the popular binomial log likelihood ratio as suggested by Dunning (1993): scorell = sgnll * log(1 + log λ), where sgnll = f12 < f1*f2/N ? -1 : 1 and log λ = f12*log(f12/(f1*f2/N)) + (f1-f12)*log((f1-f12)/((f1*(N-f2)/N))) + (f2-f12)*log((f2-f12)/((N-f1)*f2/N)) + (N-f1-f2+f12)*log((N-f1-f2+f12)/((N-f1)*(N-f2)/N)). The first term sgnll is a sign coefficient implementing a one-sided association measure following Evert (2008), which assigns non-negative values only to "attracting" collocate pairs which co-occur more often than expected, whereas the "pure" log-likelihood ratio also assigns large values to "repelling" pairs which co-occur less often than expected. Raw log-likelihood ratio values log λ tend to vary much more widely than e.g. scaled log-Dice coefficients, leading visualizations based on this quantity to over-emphasize a small number of very strong collocates and relegating weaker associations to the background. To ameliorate this effect, DiaCollo reports and scales based on the quantity log(1 + log λ). Values are unbounded, but are typically in the range [-10:10]. |
w1 | : target token matching the user query request |
w2 | : collocate token matching the user groupby request |
N | : total number of collocation relations in the current corpus epoch |
f12 | : frequency of the collocation (w1,w2) in the current corpus epoch |
f1 | : total frequency of the query term (w1) in the selected profile type for the current corpus epoch |
f2 | : total frequency of the collocate term (w2) in the selected profile type for the current corpus epoch |
ɛ | : smoothing constant, by default zero. |
Diff Operations
In comparison ("diff") mode, DiaCollo computes an aggregate score diffdiff(sa,sb) for a pair of independent item scores sa and sb by applying a binary "diff operation" as dictated by the diff request parameter. In addition to determining the aggregate score to be associated with a pair of independent score operands, the choice of diff operations also determines the method by which returned items are to be ranked and selected for return e.g. by the kbest parameter, as well as the domain over which the diff operation is to be applied: pre-trimmed diff operations act only on the up to 2*KBEST items in (dom(kbest(pa)) ∪ dom(kbest(pb))), restricted diff operations act on the domain intersection (dom(pa) ∩ dom(pb)), while non-trimmed operations act on the entire domain union (dom(pa) ∪ dom(pb)), where pa and pb are the unary profiles resulting from independent evaluation of the query and bquery request parameters qa and qb, respectively. Currently, DiaCollo supports the following diff operations:
diff | Raw score difference (pre-trimmed, asymmetric): diffdiff(sa,sb) = sa-sb. Useful for selecting collocates strongly associated only with qa. | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
adiff | Pseudo-absolute score difference (pre-trimmed, symmetric). Selects based on diffadiff(sa,sb) = |sa-sb|, but returns raw differences as for raw diff. This is the default diff operation, which selects the most extreme differences among the prominent collocates of qa and qb, regardless of the the direction those differences, which itself is expressed as the sign of the returned diff score. | ||||||||||||||||
sum | Score sum (symmetric): diffsum(sa,sb) = sa+sb. Selects strong associations for either qa or qb, preferring shared associations, but not very sensitive to non-uniform operand values (e.g. diffsum(0,8) = diffsum(4,4) = 8, but only the latter configuration indicates similar collocation behavior of the associated collocates). Returned rankings are equivalent to the avg operation. | ||||||||||||||||
min | Score minimum (restricted, symmetric): diffmin(sa,sb) = min(sa,sb). Punishes non-uniform operand values by selecting only the weaker of the operand association scores. Highly sensitive to sparse data problems, since missing data are assigned scores of 0 (zero). | ||||||||||||||||
max | Score maximum (pre-trimmed, symmetric): diffmax(sa,sb) = max(sa,sb). Selects only the stronger of the operand association scores. Potentially useful for discovering interesting target collocations for further investigation. | ||||||||||||||||
avg | Score average (restricted, symmetric): diffavg(sa,sb) = avg(sa,sb) = (sa+sb)/2. Suffers from the same drawbacks as the sum operation, to which the returned rankings are equivalent, although values are more immediately comparable to those returned for unary profiles. | ||||||||||||||||
havg |
Pseudo-harmonic average (restricted, symmetric):
diffhavg(sa,sb) ~ havg(sa,sb) = 2*(sa*sb)/(sa+sb).
Selects collocates with strong associations to both
qa and qb.
Attempts to address the shortcomings of the sum and
avg diff operations by penalizing dissimilar operand values.
In order to avoid singularities resulting from sparse data,
this operation actually computes the arithmetic average
of the harmonic and raw arithmetic means; i.e.
|
||||||||||||||||
gavg | Pseudo-geometric average (restricted, symmetric): diffgavg(sa,sb) ~ gavg(sa,sb) = √sa*sb. Selects collocates with strong associations to both qa and qb, similar to the havg operation. To avoid singularities resulting from sparse data, this operation actually computes the arithmetic average of the geometric and the raw arithmetic means, analogous to the method used to compute the havg diff score. | ||||||||||||||||
lavg |
Pseudo-logarithmic average (restricted, symmetric):
difflavg(sa,sb) ~ exp( avg(log(sa),log(sb)) ).
Selects collocates with strong associations to both
qa and qb,
penalizing dissimilar operand values logarithmically.
To avoid negative log values, the target values are forced into the range
[1,∞] before averaging, i.e.:
|
Profile Types
DiaCollo offers several different methods for acquiring raw frequency data on the basis of which to score, rank, and select "significant" collocations. These methods are referred to here as "profile types", and the data returned as "profiles". Currently, DiaCollo supports the following profile types:- collocations
- (aliases: colloc cof c f12 2)
Native collocation profile. Retrieves and ranks all content words (w2) occurring together with the search term (w1) within a context window of dmax content words and without an intervening boundary of the selected DDC "break collection". Only collocation tuple-pairs with a minimum frequency of fmin are considered. Note that for reasons of efficiency, the frequency threshold fmin, the context-window size dmax, and the boundary DDC break collection must be specified at compile-time, and cannot be changed by the user. The default DiaCollo configuration for corpora at the BBAW uses sentence-break boundaries, dmax = 5, and fmin = 5. - unigrams
- (aliases: ug u f1 1)
Native unigram profile. Retrieves and ranks all terms matching the search query (w1). Can be useful together with with prefix-, suffix-, or regular-expression queries in order to profile e.g. stems in compounds. Currently, DiaCollo cannot acquire an independent value for the variable f2 for unigram profiles (since this would entail very large prefix- rsp. suffix-indices and/or regular expression operations unsupported by the underlying library), so that f2 = f12 for each item returned. As a consequence, the mutual information and log-Dice score function rankings are isomorphic to the raw frequency score function for this profile type. - tdf
- (aliases: tdf tdm TDF TDM)
Native collocation profile based on an underlying term-document frequency matrix. Retrieves and ranks all content words (w2) occurring together with the search term (w1) within a single "document" as defined by the DDC "break collection" specified via the compile-time option -dbreak (by default, source paragraphs are used as "document" boundaries for matrix construction). Allows more flexible queries and result-set aggregation than the simple collocations profiles, but generally slower to evaluate and less sensitive to proximity effects. - ddc
- (aliases: ddc DDC)
Advanced profile using count() queries submitted to an independent DDC search engine to acquire raw frequency data. Unlike the collocations or unigrams profile types, DDC profiles are not limited to the default break collection or a fixed context window size (although by default DDC profiles approximate the collocations profile type by means of the NEAR() operator). Indeed, DDC profiling can use the full range of the DDC query language to express search queries, provided that an explicit match-id =2 is included to indicate the position(s) of the collocate target(s) for which a profile is to be computed. See Query Syntax for more details. Note that the DDC profile type does not currently support regular expression date restrictions. Note also that DDC profiles are usually much slower and more memory-intensive than their native counterparts, and should be avoided if possible. - diff-collocations
- (aliases: diff-colloc diff-cof diff-c diff-f12 diff-2 d2)
Comparison of two native collocation profiles. "Diff" queries compute independent profiles pa and pb for the query and bquery parameters, respectively. After ranking according to the selected score function, a comparison-profile pa-b is computed as score(pa-b) = diffdiff(score(pa), score(pb)) by applying the selected diff operation, and the $KBEST items are selected and returned based on their aggregate diff scores. - diff-unigrams
- (aliases: diff-ug diff-u diff-f1 diff-1 d1)
Comparison of two native unigram profiles. See diff-collocations for details on comparison profiles. - diff-tdf
- (aliases: diff-tdf diff-TDF dtdf dTDF)
Comparison of two TDF profiles. See diff-collocations for details on comparison profiles. - diff-ddc
- (aliases: diff-DDC dDDC dddc)
Comparison of two DDC profiles. See diff-collocations for details on comparison profiles, Note that DDC (diff) profiles are usually slower and more memory-intensive than their native counterparts, and should be avoided if possible.
Output Formats
DiaCollo currently supports the following output formats:- txt
- (aliases: text txt t tsv csv)
TAB-separated UTF-8 plain text output, suitable for importing into the spreadsheet application of your choice, e.g. Gnumeric or LibreOffice Calc. - json
- (aliases: json js j)
Native JSON format suitable for further automated processing, web-services, etc. - html
- (aliases: html htm)
Simple HTML table format, used for live display in the demo interface. In addition to a tabular display of the profile data, the web front-end HTML display uses JavaScript to generate hyperlinks to (close approximations of) underlying corpus hits ("KWIC-links") as well as a color-coded representation of the association score (rsp. score-difference for diff profiles) associated with each row. Due to the implicit compile-time filtering of native index data by content words and the index parameters dmax and fmin, the number of hits returned by the KWIC-links for native collocation profiles may differ somewhat from the f12 pair frequency reported in the table. DDC profiles however should report f12 values identical to the number of corpus hits returned by the associated KWIC-links. - storable
- (aliases: storable sto bin)
Binary format using the Perl Storable module, suitable for further automated processing with Perl. - gmotion
- (aliases: gmotion gm)
Online visualization using Google Motion Chart. Requires flash player. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format. See also Martin Hilpert's motion chart resource page for some examples, use cases, and discussion. - hichart
- (aliases: hc hi chart hichart highchart highcharts)
Online visualization using the Highcharts JavaScript library. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format. Clicking on any data point causes a popup window to be displayed containing hyperlinks to corpus hits for the corresponding collocation pairs as for the HTML format, which see with respect to caveats. - bubble
- (aliases: b bub bubble bubbles)
Online interactive visualization using the D3.js JavaScript library force-layout. Collocates are displayed as labelled circles whose radii and color represent the correspondig association score. Node colors are the same hues as those used in the HTML table format, but may appear "washed-out" or "pastel-ized" due to their (partial) transparency. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format. - cloud
- (aliases: c cl cld cloud)
Online interactive visualization using Jason Davies' d3-cloud layout for the D3.js JavaScript visualization library. Collocates are displayed as text labels whose size and color represent the correspondig association score. Node colors are the same hues as those used in the HTML table format, but will appear somewhat darker ("dirty" or "shadowed") for better legibility on a white background. For best results, it is recommended that you set the global parameter to a true value (e.g. 1 (one)) when using this output format.
Keyboard Bindings
The online visualizations based on the D3.js JavaScript library (bubble, cloud) support the following keyboard shortcuts whenever the display canvas has the keyboard focus (as indicated by a drop-shadow around the canvas itself as well as the icon in the header area):Key(s) | Action |
---|---|
space | toggle playback animation |
up-arrow | increase playback speed (coarse) |
down-arrow | decrease playback speed (coarse) |
shift+up-arrow | increate playback speed (fine) |
shift+down-arrow | decrease playback speed (fine) |
number 1 or 0 | reset playback speed to default (1×) |
number 2-9 | set playback speed to N × default |
shift+number 2-9 | set playback speed to 1/N × default |
Home | snap to first epoch |
End | snap to final epoch |
left-arrow | snap to previous whole epoch |
right-arrow | snap to next whole epoch |
shift+left-arrow | interpolate backward by one quarter epoch |
shift+right-arrow | interpolate forward by one quarter epoch |
x | export snapshot of current display canvas as SVG |
Examples
Basic Examples
- {q:Kohl ; slice:10 ; score:f ; kbest:1 ; p:2} The most frequent collocates of "Kohl" by decade in the current corpus.
- {q:Kohl ; slice:10 ; score:ld ; kbest:1 ; p:2} "Kohl"-collocates in the current corpus ranked by log Dice-coefficient instead of raw frequency.
- {q:Mann ; slice:0 ; score:ld ; kbest:10 ; p:2} The 10 best collocates of "Mann" over the entire corpus (slice=0).
- {q:/panzer$/ ; slice:5 ; score:f ; kbest:1 ; p:1} The most frequent "-panzer" unigrams in 5-year epochs.
- {q:/mann$/ ; slice:1 ; date:1980:1989 ; score:f ; kbest:1 ; p:1} The most frequent "-mann" unigrams by year in the 1980s of the DWDS Kerncorpus.
- {q:Maus ; slice:1 ; date:1970:1999 ; score:ld ; kbest:2} The 2 best collocates of "Maus" in 2-year epochs betweehn 1970 and 1999 ranked by log Dice-coefficient in the DWDS Kerncorpus.
- {q:Maus ; slice:1 ; date:1970:1999 ; score:ld ; kbest:10 ; f:gm} ... as above, with kbest=10 as a Google motion chart; watch e.g. "Katze" and "Tastatur".
- {q:Krise ; slice:10 ; score:ld ; kbest:10 ; f:cloud ; gb:l,p=NE} 10-best proper name collocates of the noun Krise ("crisis") in 10-year epochs over the German weekly newspaper DIE ZEIT, displayed as a d3 tag-cloud.
Diff Examples
- {q:wissen ; bq:glauben|denken|vermuten ; slice:0 ; p:d2} Compare collocates of "wissen" versus any of {"glauben","denken","vermuten"} over the entire current corpus.
- {q:wissen ; bq:@{glauben,denken,vermuten} ; slice:0 ; p:d2} ... same as above, using DDC query syntax.
- {q:Mann ; bq:Frau ; slice:25 ; p:d2} Compare collocates of "Mann" vs. "Frau" in 25-year epochs.
- {q:Mann ; bq:Frau ; slice:25 ; p:d2 ; f:gm} ... same as above, as a Google motion chart in the DTA corpus: "Kind" (child) is interesting to watch here.
- {q:Mann ; bq:Frau ; date:1700:1899 ; slice:25 ; p:d2 ; diff:havg ; f:cloud ; global:1} ... same as above, as a d3 tag-cloud, using the global option and the harmonic mean diff operation to highlight invariant similarities in the targets' collocation behavior.
- {q:Einheit ; bq:Demokratie ; slice:5 ; p:d2 ; gb:l,p=ADJA ; f:gm} Adjective collocates of "Einheit" versus "Demokratie" as a Google motion chart in the Spiegel (print) corpus: watch "sozialistisch" and "arabisch".
- {q:Mann ; bq:Mann ; slice:0 ; adate:1900:1959 ; bdate:1960:1999 ; p:d2} Compare collocates of "Mann" in the interval 1900-1959 vs. the interval 1960-1999 in the DWDS Kerncorpus.
- {q:/^Ehe./ ; bq:/.mann$/ ; slice:0 ; p:d1} Compare unigrams matching "Ehe-" vs. those matching "-mann". These results are somewhat difficult to interpret, but the intersection ("Ehemann") should have a score difference of 0 (zero), and will probably not be displayed at all.
Attribute Examples
- {q:Mann p=NE ; slice:0 ; kbest:10} 10 best collocates of "Mann" occurring as a proper name.
- {q:$l=@Mann with $p=NE ; slice:0 ; kbest:10} ... same as above, using DDC query syntax.
- {q:Mann p=NN ; bq:Frau p=NN ; date:1600:2000 ; bdate:1600:2000 ; slice:100 ; kbest:10 ; gb:l,p=ADJA ; p:d2} Compare 10 best adjective collocates of "Mann" versus "Frau" as common nouns in 100-year epochs over the DTA+DWDS Corpus, aggregating results by LEMMA,POS pairs.
TDF Examples
- {q:Maus ; slice:10 ; gb:l,p=NN ; p:tdf ; global:1} Noun collocates of "Maus" by selected document-break in the current corpus, using the tdf profile class.
- {q:Mann #has[textClass,Wissenschaft*] ; bq:Mann #has[textClass,Belletristik*] ; slice:0 ; gb:l,p=ADJA ; p:diff-tdf} Adjective document-collocates of "Mann" in text-class "Wissenschaft" versus text-class "Belletristik" in the DTA corpus.
- {q:Vernunft !#has[textClass,/^$/] ; gb:genre ; date:1600:1899 ; slice:50 ; p:tdf ; f:hichart} Profile 17th- to 19th-century use of the term `Vernunft' ("reason") by primary text-class using the DTA corpus and displaying the results as a static 2d Highcharts plot.
- {q:Vernunft ; date:1700:1899 ; slice:50 ; k:4 ; gb:author ; p:tdf ; f:cloud} Profile 18th- to 19th-century authors' predilection for the term `Vernunft' ("reason") using the DTA corpus and displaying the results as a d3 tag-cloud.
- {q:Vernunft #has[author,/Kant/] ; bq:Vernunft !#has[author,/Kant/] ; slice:0 ; gb:l,p=ADJA ; p:diff-tdf} Paragraph-level adjective collocates of `Vernunft' ("reason") in the works of Immanuel Kant versus other authors in the DTA corpus.
- {q:* #has[author,/Kant/] ; bq:* #has[author,/Hegel/] ; slice:0 ; gb:l,p=NN ; p:diff-tdf ; k:50 ; f:cloud} Compare nouns typically employed by Immanuel Kant versus Georg W. F. Hegel in the DTA corpus, displaying the results as a d3 tag-cloud.
- {q:* #has[author,/Kant/] ; bq:* #has[author,/Hegel/] ; slice:0 ; gb:l,p=NN ; p:diff-tdf ; D:havg ; k:50 ; f:cloud} As above, highlighting characteristic similarities via harmonic average rather than differences as returned by the default absolute difference comparison.
DDC Examples
- {q:Mann ; slice:0 ; gb:l,p=ADJA ; p:ddc} Adjective collocates of "Mann" in the current corpus using DDC back-end. Note that the frequencies returned will likely never be identical to those returned by the collocations profile-type, due to lack of content-word filtering in DDC queries.
- {q:near(* =2, Mann, 4) ; slice:0 ; gb:l,p=ADJA ; p:ddc} ... same as above, with explicit DDC NEAR() operator and match-id for the collocate target term.
- {q:"* =2 Mann" ; slice:0 ; gb:l,p=ADJA ; p:ddc} Immediate adjective predecessors of "Mann", using implicit restriction of candidate collocates via the groupby clause.
- {q:"$p=ADJA =2 Mann" ; slice:0 ; gb:l ; p:ddc} ... as above, grouping collocates by lemma only and expressing the collocate restriction to attributive adjectives (ADJA) directly in the query clause.
- {q:"$p=ADJA =2 Mann" ; slice:0 ; gb:[$l,textClass] ; p:ddc} ... as above, using DDC DDC count-by syntax in the groupby clause and including the "textClass" metadata attribute in the profile target tuples.
- {q:"(Getränk|gn-sub WITH $p=NN)=2 (trinken WITH $p={VVINF,VVPP})" #FMIN 1 ; date:1600:1899 ; slice:25 ; p:ddc ; f:bubble ; global:1} Consumption of potable liquids in the DTA corpus, using GermaNet term expansion via the gn-sub pipeline as a d3 bubble chart.
- {q:"* =2 Mann" #has[textClass,Wissenschaft*] ; bq:"* =2 Mann" #has[textClass,Belletristik*] ; slice:0 ; gb:l,p=ADJA ; p:diff-ddc} Adjective collocates of "Mann" in primary text-class "Wissenschaft" (science) versus "Belletristik" (belles lettres) in the DTA corpus.
- {q:Vernunft #has[author,/Kant/] ; bq:Vernunft !#has[author,/Kant/] ; slice:0 ; gb:l,p=ADJA ; p:diff-ddc} Adjective collocates of `Vernunft' ("reason") in the works of Immanuel Kant versus other authors in the DTA corpus.
More Examples
... are welcome; please drop me a line if you find a good one!Fiendishly Awkward Questions
Corpora
- Can I use DiaCollo on my own corpus?
- Sure - check out the DiaColloDB and DiaColloDB::WWW distributions on CPAN.
- Consider using cpanm for batch installations.
- DiaCollo assumes a UNIX-like environment (various flavors of Linux work great).
- KWIC-links and DDC profiles require an independently configured DDC index and server.
- What languages are supported?
- pretty much any written language ought to work: DiaCollo aspires to be language-agnostic.
- What corpus formats are supported?
- input data must be encoded in UTF-8
- only pre-tokenized and pre-annotated formats are supported; see SUBCLASSES in DiaColloDB::Document(3pm) for a list.
- Why must I tokenize and annotate my corpus myself?
- one tool ⇔ one job
- language agnosia ↝ flexibility
- DiaCollo is not an all-singing+dancing, one-stop-shopping text analysis tool (and almost certainly never will be)
- consider using CLARIN-D WebLicht if you want a generic corpus annotation framework
- Can I use DiaCollo to directly compare different corpora?
-
... on the command line:
- pass a list:// URL to dcdb-query.perl or dcdb-www-server.perl
- beware the fudge and extend properties!
- ... from the dstar WWW GUI: only for pre-aggregated corpora
-
... on the command line:
- What is "DDC" and why might I care?
- DDC ("DiaLing/DWDS Concordancer") is an open-source corpus search engine used by the DWDS, DTA, and ZDL projects at the BBAW.
- "D*" ("dstar") is the corpus management framework used by the DWDS, DTA, and ZDL projects at the BBAW. It employs DDC as a low-level corpus search engine, and optionally includes additional auxiliary indices (including DiaCollo itself) and RESTful web-wrappers (chances are high that you're accessing this document through a D* web wrapper).
- Both DiaCollo and DDC indices can be managed by the D* corpus management framework, and any D* DiaCollo instance at the BBAW should have an associated DDC index and server.
- DDC is not "part of" DiaCollo and is not included with the DiaCollo source distribution; nor is DiaCollo "part of" DDC. DiaCollo and DDC are independent modular software packages which - when correctly configured - can play together nicely.
- A functional DDC corpus index and a running ddc_daemon server are required for evaluating DiaCollo's KWIC-link approximations and DDC profile relation.
- Configuration, compilation, maintenance, and usage of D*, DDC, and/or ddc_daemon is beyond the scope of this FAQ.
- How large does my corpus need to be in order to get reliable results?
- Epoch (slice) size is more relevant than total corpus size: if you just want a larger sample for a particular query, try increasing the value of the slice parameter (↝ reduce diachronic granularity).
- A "good" epoch size depends on both the absolute and relative frequency of the target phenomenon as specified by the query, and groupby, and date parameters. Since frequencies of natural language phenomena can vary widely, there is no reliable "one-size-fits-all" strategy for determining a "good" epoch size independently of the corpus distribution and the other user query parameters; see Gabrielatos et al. (2012) for a more detailed discussion.
-
Beware compile-time filters
(-tfmin,
-lfmin,
-Opgood,
etc.),
server-side pruning thresholds,
and k-best pruning
- the indexing option -use-all-the-data disables all compile-time thresholds for the native and TDF relations
- the #FMIN operator disables runtime pruning for the DDC relation
- corpus artefacts are always possible (e.g. "Pferdebuckel", Krise→Tolstoj)
-
completely subjective, non-rigorous, & informal recommendation:
- your chances are pretty good if min{f1,f2} ≥ 100 (using the variables described above) ... but DiaCollo can also produce interesting results from small corpora and/or epochs well below this threshold!
Runtime
- Can I download DiaCollo results for offline use?
-
Probably yes: most output formats supported by DiaCollo can be saved to your local
computer and used offline.
- Static tabular formats such as txt, html, json, and storable can be downloaded by clickin on the "Raw URL" link above the main display canvas and using your browser's "Save As" function (typically invoked by right-clicking on a blank area of the page display and choosing "Save As" from the context menu).
- A snapshot of a single epoch visualization in the bubble or cloud format can be exported by clicking on the download icon () in the upper right corner of the display canvas.
- An interactive GUI snapshot for the bubble or cloud visualization formats can be saved to your local computer by using your browser's "Save As (Web-page, complete)" function (typically invoked by right-clicking on a blank area of the page display and choosing "Save As" from the context menu). Note that you will not be able to submit changes to any query parameters from such an offline snapshot.
- google motion charts don't support offline use at all.
- How can I restrict the profiled collocates to immediate predecessors?
-
Use the DDC profile relation together with a
phrase query, e.g.
"*=2 Mann" #FMIN 1
; see also this example. - Why does my collocant appear as a collocate for itself?
- Self-collocations are never counted for a single token. Co-occurrences between multiple tokens of a single type are counted twice (yes, this is a wart, but it's not the wart you probably think it is). Consider for example the sentence "Flies flew". This sentence contains 2 tokens (the individual words "Flies" and "flew"). Both of these are instances of a single lemma type (let's call it "FLY"), so that at indexing time, DiaCollo would index the sentence as "FLY FLY", which would result in 2 co-occurrences being counted for the lemma-pair ("FLY","FLY"): one occurrence left-to-right ("Flies","flew") and one occurrence right-to-left ("flew","Flies"). If only lemmata are being indexed, these would be true identity pairs and could be handled by a special exception at indexing time. Usually however, additional attributes will be indexed (such as part-of-speech tag or surface form), so the actual co-occurrence pairs are more likely to be complex and not strictly identical; in this case something like ("FLY/NNS:Flies","FLY/VBD:flew"). Further, it is entirely possible that lexical items do in fact self-collocate, as in the sentence "Fly, fly away!".
- Why does my collocate item g "disappear" in epoch e?
-
It may not be among the k-best collocates in epoch e.
By default, DiaCollo prunes results to the k-best items in each
epoch independently.
- Raising the kbest will allow more collocates to be displayed per epoch.
- Setting the global option to a true value should cause the exact same set of collocates to be displayed in every epoch (possibly with f12 = 0).
- If you just want to check a particular association scores (or indexed frequencies) for a particular item (or items), you can use the groupby parameter to select the collocate(s) of interest.
-
It may have been omitted from the native index by compile-time filters
(-cfmin,
-tfmin,
-Opgood
etc.).
DiaCollo cannot return or display data which hasn't been indexed.
- If you have compiled your own DiaCollo index, try lowering the compile-time filter thresholds or specifying the -use-all-the-data flag to dcdb-create.perl.
- It may be have been suppressed by server-side pruning for the DDC profile relation. Try specifying the #FMIN 1 query operator.
- It may not occur at all (or not co-occur with your collocant) in the specified epoch. DiaCollo only indices only store frequency data for attested phenomena (f>0). Try raising the value slice parameter or setting it to 0 (zero) for a corpus-global profile.
-
It may not be among the k-best collocates in epoch e.
By default, DiaCollo prunes results to the k-best items in each
epoch independently.
- Why does the D3 date-slider (bubble, cloud) "snap" to epoch boundaries?
- A DiaCollo profile includes at most one data point per candidate collocate per epoch, where epochs are treated as opaque ordinal classification categories; see DiaColloDB::Profile::Multi(3pm). For visual clarity, size and color of collocate items in DiaCollo's D3.js visualization formats (bubble and cloud) are linearly interpolated between discrete epochs by the JavaScript GUI code, causing them to smoothly grow and/or shrink in the GUI animation rather than suddenly "jumping" to a new state. Manually adjusting ("dragging") the position of the date-slider will however cause it "snap" to the nearest epoch boundary, since that is the closest ordinal category for which the profile has actual data points.
- Why does the collocation pair (q,q) appear at epoch e (even tough I know it doesn't really occur until later)?
- Epochs are labelled by their minimum possible element (year), so for slice=E, the epoch e represents the date interval [e .. e+E-1] (e.g. for slice=10, the epoch "1980" represents the interval 1980-1989).
- Why don't the corpus KWIC links always return exactly f12 hits?
- DiaCollo itself does not create or maintain a full-text index (one tool ⇔ one job); retrieval of actual corpus hits is performed by an independent DDC server.
- DiaCollo's KWIC links are nothing more or less than dynamically generated DDC search queries which – when evaluated by an appropriate DDC server (typically when you click on such a link from the DiaCollo GUI) – should closely approximate the corpus hits for the associated collocation pair.
- The DDC queries generated by DiaCollo for its KWIC links are only approximations in the case of native profile relations, since there are no equivalent DDC query expressions for many of DiaCollo's compile-time filters (e.g. content-word filtering).
- If you need to ensure exact results, use the DDC relation together with the #FMIN 1 query operator.
Errors
- Error: DiaColloDB::Document::CLASS: cannot load file ...
- This error message can be emitted at indexing time by dcdb-create.perl if the corpus input data you supplied does not appear to be formatted correctly. Ensure that your data is formatted properly and that you have specified the correct -dclass=CLASS option to dcdb-create.perl.
- Error: No 'query' parameter specified
- This message indicates that your request did not include a collocant query specification by means of the query parameter (rsp. the QUERY1 argument to dcdb-query.perl). It can appear in the HTML GUI before any request has been submitted. The query parameter is required.
- Error: No data to display
-
No index entries matched your request. Underlying causes may vary,
but possible reasons include:
- Typographical errors in your request's parameters, in particular query or date.
- Matching co-occurrences were omitted from the index due to compile-time filtering of native index data (-tfmin, -lfmin, -Opgood, etc.). Check the compile-time parameters using dcdb-info.perl or the info.perl URL to see the compile-time thresholds used to compile a DiaCollo index, and see DiaColloDB(3pm) for details on what the various compile-time properties mean.
- Matching co-occurrences were omitted from the requested DDC profile due to server-side pruning. Try re-submitting your request using the #FMIN 1 query operator.
- Error: You cannot submit queries from an offline data set
- This error indicates that you attempted to submit a new profile request to a static GUI snapshot, which is not supported. Submit new requests to "live" index wrapper instead.
- Error: Variable 'ddc_url_root' not set: KWIC links disabled
- This indicates that the DiaCollo index is not associated with any running DDC/D* REST API. If this error appears for a DDC/D* DiaCollo instance hosted by the BBAW, it probably indicates a corpus configuration bug: please inform the corpus maintainer. If you have indexed your own corpus, you will need to setup a D*-compatbile DDC server and set DiaCollo's ddcServer option if you want to support DDC profiles and KWIC links.
- Error: 500 Internal Server Error
- This is just an HTTP status code, and not an error message (and also not very informative). Keep reading for some (hopefully) more useful diagnostics.
- Error: ttk_process(): template error: undef error - [MESSAGE]
- Something went wrong using the DiaColloDB::WWW GUI (still not very informative). The actual diagnostic message begins with [MESSAGE].
- Error: ... called at FILE.pm line XYZ
- This is a stack trace of the actual error. Only first few lines are likely to be informative.
- Error: parseQuery(): ... could not parse query: syntax error: ...
- Indicates that the query parameter could not be parsed. Ensure that your query is syntactically valid; see "Query Syntax" for details.
- Error: align(): cannot align non-trivial multi-profiles of unequal size
- You tried to compare two profiles with incompatible epoch partitions. Diff comparisons (diff, diff-tdf, etc.) require both operand profiles to contain the same number of epochs or only a single epoch (e.g. slice=0).
- Error: ... abstract method called
- I probably forgot to implement something; please let me know!
- Error: ... timeout elapsed
- This usually means that the independent DDC server took too long to respond for a DDC profile. You query may be too complex to be handled gracefully: try raising the value of #FMIN, lowering the value of #DMAX, and/or profiling only a corpus sample. Please do not simply re-submit a query which has caused a timeout immediately in the hope that this error will have been magically resolved, and please do wait at least for the "timeout elapsed" message to appear before submitting a new query to an unresponsive server.
- Error: no 'ddcServer' key defined
-
This indicates that you tried to use the DDC profile relation
without declaring an associated DDC server to provide
frequency data.
If this error appears for a
DDC/D* DiaCollo instance hosted by the BBAW, it probably indicates
a corpus configuration bug: please inform the corpus maintainer.
If you have a running DDC server for a
your own corpus,
you can EITHER
-
edit your index header.json file and add a line
"ddcServer":"HOST:PORT"
, or - use the -OddcServer=HOST:PORT option to dcdb-query.perl when querying your DiaCollo index.
-
edit your index header.json file and add a line
References
DiaCollo
- Bryan Jurish and Maret Nieländer (2019). "Using DiaCollo for historical research." In: K. Simov and M. Eskevich (eds.), Selected Papers from the CLARIN Annual Conference 2019, Linköping Electronic Conference Proceedings 172 (2020), pages 33-40. (DOI [proceedings], DOI [article], PDF [article])
- Bryan Jurish (2018). "Diachronic Collocations, Genre, and DiaCollo." In R. J. Whitt (editor), Diachronic Corpora, Genre, and Language Change. Amsterdam, John Benjamins, pages 42–64. (Online, Draft)
- Bryan Jurish, Alexander Geyken, and Thomas Werneke (2016). "DiaCollo: diachronen Kollokationen auf der Spur." In DHd 2016: Modellierung - Vernetzung - Visualisierung, Leipzig, 7th-12th March, pages 172-175. (PDF, official [contains typesetting errors])
- Bryan Jurish (2015). "DiaCollo: On the trail of diachronic collocations." In K. De Smedt (editor), Proceedings of the CLARIN Annual Conference 2015, Wrocław, Poland, 15th-17th October, pages 28-31. (PDF, poster)
General
- Jörg Didakowski and Alexander Geyken (2013). "From DWDS corpora to a German Word Profile – methodological problems and solutions." In: Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information. 2nd Work Report of the Academic Network "Internet Lexicography". Mannheim, Institut für Deutsche Sprache. (OPAL - Online publizierte Arbeiten zur Linguistik X/2012), pages 43-52. (PDF)
- Béatrice Daille (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.
- Ted Dunning (1993). "Accurate methods for the statistics of surprise and coincidence." Computational Linguistics 19(1), 61-74. (PDF)
- Stefan Evert (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Universität Stuttgart. (PDF)
- Stefan Evert (2008). "Corpora and collocations." In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58, Berlin, Mouton de Gruyter, pages 1212-1248. (PDF [extended manuscript])
- C. Gabrielatos, T. McEnery, P. J. Diggle, and P. Baker (2012). "The peaks and troughs of corpus-based contextual analysis." International Journal of Corpus Linguistics 17(2), pages 151–175. (DOI, PDF [post-print])
- Pavel Rychlý (2008). "A lexicographer-friendly association score". In P. Sojka and A. Horák (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing. RASLAN 2008, pages 6-9. (PDF, PDF(2))
- Adam Kilgarriff and David Tugwell (2002). "Sketching words". In M.-H. Corréard (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins. EURALEX, pages 125-137. (PDF)