Distant Co-occurrence Patterns of Connectives: a Corpus Study of Formulaicity in Japanese

Using corpus research methods, this study aims to establish whether there are two-item and, more generally, multi-item distant co-occurrence patterns of connectives in written Japanese, and further, to clarify the role these patterns play in discourse. The study is based on a hybrid corpus of written Japanese including Humanities and social science papers, Science and technology papers, and general written language data. The co-occurrence threshold was set at co-occurrence frequency > 10, PMI value > 2, and Dice coefficient > 0.01. The distribution of the observed co-occurring pairs differed according to the genre. Visualization of the connectivity potential of co-occurring pairs as directed graphs showed that these co-occurring pairs constitute longer co-occurrence chains which can be interpreted as ready-made co-occurrence patterns. Two-item and multi-item co-occurrence patterns are considered a type of Bourdieu’s habitus and contribute to both discourse development and discourse prediction.


Background
Ready-made patterns in discourse have been studied for a long time. The most typical of such patterns are syntax and collocations. Regarding the various structures in syntax, DeBeaugrande and Dressler (1981) point out that they act as an 'early warning system' for the listener/reader, facilitating processing.
Knowledge of formulaic language is also an important part of speakers' linguistic knowledge, and its study has a long tradition. Within this framework, more recently, Wray (2017) has focused on the systematic co-occurrence of various elements in linguistic data and discussed the important role this phenomenon has in the load reduction of language processing. There are also studies focusing on the Japanese language. One of these, Kaneyasu (2012), deals with systematically occurring morpheme sequences in Japanese conversation. Cognition-related findings from the study of formulaic expressions are important, but they are mostly limited to the patterns of occurrence of adjacent morphemes and their various functions. On the other hand, Ishiguro (2008, Chap. 10) discusses connectives from the perspective of 'strategic usage' (senryakuteki shiyō). These patterns can be observed at the discourse level and are based on systematically occurring chains of connectives.

Aims of the present study
This study is conceived as an exploratory study, focusing on the aforementioned 'strategic usage' patterns, and aims to investigate their reality and their relation to the role they play in discourse. Specifically, the aim is to investigate the systematic distant co-occurrence of connectives occurring at the beginning of sentences in a corpus of general and academic texts written in Japanese. For this purpose, the following research questions will be addressed. RQ1: Is it possible to identify the most frequent and prominent patterns of distant co-occurrence of connectives in general and academic texts?
RQ2: If such identification is possible, is it then possible for multiple connectives to co-occur systematically? RQ3: If systematic multiple co-occurrences are possible, what role do such cooccurrence patterns play in the actual discourse?

Previous research
There is a long tradition in Japan of studying the patterning of elements that are quite far apart syntactically. Minami's (1974) study of the hierarchical structure of Japanese clauses is a good example. Various original studies have also been conducted since then. Minami himself further statistically supported his earlier results in Minami (1993). Kudo (2000) corroborated the systematic nature of distant co-occurrence between sentence-initial adverbs and sentence-final modal expressions. This is an interesting result suggesting a kind of agreement phenomenon at the semantic level. This result by Kudo was further supported by Srdanović et al. (2009) in a large corpus of data.
Inspired by Noda (1995) and Kudo (2000), Bekeš (2008, Chap. 5; investigates the role of bracket structures formed by adverbs and co-occurring sentence-final modality expressions or adverbs and some toritate (focusing) particles and their role in discourse.
With the focus of the study shifting to connectives, the scope of analysis moves to the level of discourse. Within the framework of discourse research in Japan, there are many important findings. For example, Sakuma (2012Sakuma ( , 2019 attempts to elucidate the role of connectives in the rhetorical structure of texts, based on studies such as Ichikawa (1978) and others, and on the establishment of detailed criteria for the identifying discourse units (i.e., written content paragraphs bundan, and spoken content paragraphs wadan).
There are also interesting studies on the systematic distant co-occurrence of connectives themselves. For example, Ishiguro (2008) points out the close relationship between connectives and sentence-final modality expressions (Ishiguro, 2008, Chap. 7). He also points to the possibility of systematic co-occurrence of multiple connectives and the existence of so-called 'strategic usage' in discourse development (Ishiguro, 2008, Chap. 10). Inspired by Ishiguro's work, Wang Jinbo (2015a, 2015b investigates the systematic co-occurrence of adversative (gyakusetsu) and additive (junsetsu) conjunctions in editorials and other genres, and classifies them according to their semantic properties. In order to elucidate the role they play in discourse, she further examines the correlation of such co-occurring pairs of connective expressions with their position in the discourse.

Data
In this study the following data are used: the 'Science and technology' papers (hereafter shortened to 'ST papers'), the 'Humanities and social science' papers (hereafter shortened to 'HS papers'), and a partially modified BCCWJ*, representing the general use. 1 As connectives, listed in various dictionaries and other sources proved to be insufficient data, the 523 connectives employed in Abekawa et al. (2020) were used in the analysis. Since their identification in sentences is very difficult in some cases, the analysis was limited to the most typical usage of connectives appearing at the beginning of sentences. The basic data of the relevant corpora used in this study are presented in Table 1. In addition to these corpora, a small corpus consisting of 300 Asahi Shimbun editorials and opinion articles was used for validation.

Method of analysis
The extraction of co-occurrence patterns relied on two measures of association, i.e., the PMI 2 (pointwise mutual information) and Dice coefficient 3 , both of which are used Association of Nippon Medical School), 'Kankyō shigen kōgaku kai' (The Resources Processing Society of Japan) and 'Denki gakkai' (The Institute of Electrical Engineers of Japan). In the case of the journal of The Association for Natural Language Processing, data include papers and proceedings of annual conferences, while in the case of other societies' journals, papers alone were collected. The total number of papers and proceedings included in this corpus is 4,865.
Humanities and social sciences papers. From J-STAGE, a general academic e-journal site, we independently collected up to 20 papers per each relevant academic society from the academic journals they publish by specifying the search field as Humanities and Social Sciences. The total number of papers collected in this way is 1,508. BCCWJ*. Based on the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a corpus built by the National Institute for Japanese Language and Linguistics (NINJAL) in order to provide a comprehensive picture of the written modern Japanese language use, and divided into several media types/genres. BCCWJ* is a sub-corpus of BCCWJ. In order to limit the data only to those written prose texts that are considered to have been thoroughly proofread, material taken from 'Yahoo! Chiebukuro', 'Yahoo! blog', Diet Minutes, and, for genre reasons, poetry, were excluded. See also Abekawa et. al. (2020). 2 A statistical measure of association that compares the probability of two events occurring together to the probability of them occurring independently. For two outcomes of random variables x and y, PMI is defined as PMI(x,y)=log2P(x,y)/P(x)P(y), where P(x,y) is the probability of x and y occurring together and P(x), P(y) are the probabilities of x and y occurring independently. The higher the value, the stronger the degree of association between x and y. 3 Another statistical measure of association, the Dice coefficient is defined as Dice(x,y) = 2f(xy)/(f(x) + f(y), where f(x) and f(y) are the frequencies of words x and y, and f(xy) is the frequency of words x to indicate the degree of association of co-occurring items used in traditional corpus studies (see Kolesnikova, 2016, Petrovic et al., 2006. The semantic relations underlying distant co-occurrence at the discourse level differ from those found in collocation studies which focus on local relations within a sentence. Therefore, at this stage of the study, it was empirically decided to use both measures together. Co-occurrence extraction was restricted to pairs of connected expressions that co-occurred within a range of one to four sentences apart in context. No attempt was made to extract more than two multiple connectives (n-grams), as the larger the number, the less reliable the results, due to the increased distance between individual items. Instead, typical cooccurrence pairs of the extracted connectives were considered as arcs in a directed graph, concatenated into longer co-occurrence chains, and further validated on the basis of actual data.

Co-occurrence and genre of connectives
The first conjunctive expression appearing in co-occurrence is labeled as X and the second as Y. Their co-occurrence frequency is denoted as f(XY). Following Sakuma (2012Sakuma ( , 2019, discourse units (content-based paragraphs) are referred to as dan. For easier recognition of co-occurrences, dan is assumed to consist of one or more sentences (S). A content-based paragraph realized by two co-occurring connectives can be denoted as follows.
(1) ［dan0＝S0］-X-［dan1＝S1］-Y-［dan2＝S2］ In other words, two co-occurring connectives represent a relationship between three dan content paragraphs. The relationship may be parallel or hierarchical. Following the custom in corpus analysis, to eliminate rare and thus considered atypical cases of co-occurrence, only cases with frequency f(XY) > 10 are included in the analysis. On the other hand, in order to widen the range of potential co-occurrence candidates, the threshold value of PMI is set to PMI > 2, which is lower than the usual threshold value of PMI > 4, customarily used in corpus analysis (cf. Petrovic, 2006). And finally, an additional condition, i.e., Dice coefficient > 0.01, is added to the co-occurrence recognition criteria, to compensate for the tendency of PMI to be higher for lowfrequency co-occurrences. The co-occurrences meeting these criteria are tabulated in Table 2 below: and y occurring together. The higher the value, the stronger the degree of association between x and y. It is clear from the table that the percentage of connectives that co-occur with other connectives within a certain range (four sentences) is low, around 0,5%, regardless of the corpus. This means that in the corpora studied, the proportion of connectives that explicitly indicate the relationship between two dan content paragraphs is low. On the other hand, the proportion of co-occurrences C that exceed the PMI value and Dice coefficient thresholds among all co-occurrences K in their respective corpora is significantly lower in BCCWJ* at 3.21%, compared to 23.57% in 'Humanities and social sciences papers' and 19.57% in 'Science and technology papers'. This means that in both academic corpora, the relationship between the three dan content paragraphs shown in (1) is significantly more likely to be systematically made explicit by connectives than in BCCWJ*.

Top 20 co-occurrence examples in PPM and PMI
To further clarify the systematic co-occurrence of connectives by genre, let us compare the top 20 co-occurrences in PPM (occurrences per million cases) and PMI respectively. It is immediately apparent from Table 3 that a high or low co-occurrence frequency expressed as PPM does not necessarily correlate with a high or low PMI value, with the mata → shikashi combination in BCCWJ* being a striking example of this. In Table 3, PMI values below the threshold value used here, i.e., < 2, observed in BCCWJ*, are highlighted with boldface. In both academic corpora, however, all PMI values are above the threshold, and the correlation between PPM and PMI values appears to be more consistent.
On the other hand, the PMI of co-occurrence of shikashi (however) → sokode (therefore) is consistently high, regardless of genre, with a PMI value of 2.7 for BCCWJ*, 5.2 for Humanities and social sciences papers and 5.9 for Science and technology papers. The PMI values of this co-occurrence are significantly higher in the academic corpus than in the BCCWJ*. This can be attributed to the fact that the range of observed combinations of connectives is more limited and more formulaic in the academic corpora than in the BCCWJ*. This is also clearly visible in the directed graphs visualization discussed in section 4.3. In Table 3, all co-occurrences in which the PMI value is above the threshold, but co-occurs with a high PPM, regardless of genre, could intuitively be considered as established co-occurrence patterns.
Finally, let us have a look at the use of shikashi (however) → sokode (therefore), one of the cases with the highest frequency expressed as PPM, using an editorial as an example.
( With regard to Japan Post, the Democratic Party of Japan (DPJ) government passed a law freezing the sale of shares, while a bill to fundamentally review postal reform was submitted to the ordinary session of the Diet last year. The government will draw up arrangements to lift the ban on asset sales when the bill is passed. However (shikashi), the revised bill faces strong opposition from opposition parties, including the Liberal Democratic Party (LDP), which had decisively privatized the postal service under the Koizumi administration. Therefore (sokode), there are glimpses of a desire to hasten its passage by tying it to the compression of tax hikes for reconstruction.
(Asahi Shimbun 17 Sep 2011, morning edition, editorial) In example (2), the PPM of the co-occurring pair shikashi (however) → sokode (therefore) in different corpora is as follows: BCCWJ* 171.1; HS 511.6; ST 713.1. PPM is significantly higher in academic data (about four times higher than BCCWJ* in 'Science and Technology papers' and about three times higher in 'Humanities and social sciences papers'). Example (2) is an example of a development in which, in a given situation, an adversative conjunctive expression such as shikashi (however) introduces an inconvenient situation and sokode (therefore) introduces a response to the situation (here, it is, 'hastening of the passage of bill'. This co-occurrence pattern is discussed in detail in Wang (2015a,b).
Next, let us look at the top 20 co-occurrences with the highest PMI values. At high PMI values, once the first connective X is selected, the subsequent connective Y is predictable to a significant degree. In Table 4, the co-occurrence frequencies (number of co-occurrences) in the academic paper data are all greater than 10. However, in BCCWJ*, which has the largest amount of data, 12 out of 20 cooccurrences have PPM values below 10, while their PMI values are high, some even very high (> 10). The co-occurrence with the lowest PPM value but still considerably high PMI (> 9) is hitotsuwa (one) → zenshawa (the previous one) and the second lowest is aruiwa (or) → izureniseyo (in any case).
In all three corpora, some co-occurrences with relatively low frequencies but PPM values of 10 or more are also considered highly predictable because of their rather high PMI values: in BCCWJ* there are 8 such occurrences out of the top 20 co-occurrences, in Humanities and social sciences papers 15 out of the top 20 in and in Science and technology papers, 9 out of the top 20. Most of these high PMI co-occurrences seem to correspond to examples classified as 'enumerations (seiri-rekkyo)' in Ishiguro (2008), such as hitotsuwa (one) → mōhitotsuwa (the other) above. Although only used in a limited number of contexts, these are examples with a very high degree of formulaicity.
Let us now have a look at an example of hitotsuwa (one) → mōhitotsuwa (the other) from an opinion article.
( The Ministry of the Environment becoming the prime regulator of nuclear power has two implications. One (hitotsu) is the end of the promotion of nuclear power as a national policy. <A detailed description spanning over six sentences, follows.> The other (mōhitotsu) is the end of the safety myth that has underpinned nuclear power.
In (3), the co-occurring pair hitotsuwa (one) → mōhitotsuwa (the other) has the PMI value as follows: BCCWJ* 10.4; HS 10.4; ST 11.1. This is a clear and relatively common example of 'enumeration'. The pair of connectives is introduced by the cataphoric reference of 'two implications', and the relationship between the first relatively long dan content paragraph and the subsequent dan content paragraph is also clearly indicated by both connectives. In this case, the high PMI value also implies a developed formulaicity of the co-occurring pair.
Three other interesting co-occurrence patterns are found in the Humanities and social sciences data. These are tashikani (certainly) → shikashi (however), mochiron (of course) → daga (but), and mochiron (of course) → shikashinagara (however). All three have fairly similar meanings, and the sequence is signaling a rhetoric pattern, i.e., 'acceptance of (collocutor's) proposition, followed by an alternative proposal'. This pattern appears to be used as a strategy to express cautious disagreement or for introducing additional alternatives. On the other hand, such co-occurrence examples do not seem to be used very often in the Science and technology papers data, but in Humanities and social sciences data, the PMI values and PPM of these cases are twice as high as in BCCWJ*. For example, in co-occurrence pair mochiron (of course) → daga (but) the PMI values are as follows: BCCWJ* 2.64; HS 6.6; ST N/A. Let us have a look at an example, again from an editorial.
(4) There are persistent voices asking whether lay prosecution jurors and lay jurors are capable of making the right decision.
Of course (mochiron), this is not to say that they will never make a mistake. But (daga) before we can say anything about the competence of the public, the fact that the right evidence is hidden or improperly guided by experts leads to erroneous conclusions. ...
(Asahi Shimbun, 18 Dec 2011, morning edition, editorial) In (4), first the doubt about the ability of lay jurors is presented. Of course (mochiron) then introduces the acceptance of the possibility that they could make mistakes. The argument is then countered in the next sentence, introduced by but (daga), which emphasizes the fact that it is actually the specialists who in many cases mishandle the evidence, leading to erroneous conclusions.
On the other hand, there are co-occurrence pairs in the top 20 PMI values that are only found in the Science and technology papers. These are the most frequently cooccurring pairs among the top 20 PMI co-occurrences, i.e., shikashi (however) → sokode (therefore), and its also frequent alternative shikashinagara (however) → sokode (therefore). As we have already seen, the first of the pairs, shikashi (however) → sokode (therefore) is relatively frequent and found in all corpora in the top 20 PPM co-occurrences.
In addition, there are two other frequent co-occurrence pairs including shikashi (however) and shikashinagara (however) as the first member, i.e., shikashi (however) → sono tame (therefore) and shikashinagara (however) → sono tame (therefore). Both pairs introduce 'another aspect of a given situation followed by its consequences'.
It is not only between the BCCWJ* data and the academic corpora that significant differences in the distribution of the high-frequency co-occurrence pairs can be observed. Interestingly, such differences are also found between the Humanities and social sciences data and Science and technology data. This suggests that differences arise not only between the genres such as general and academic use but also between different academic disciplines.
On the basis of Tables 3 and 4, we can conclude that when the co-occurring pair of connectives meets the threshold values for frequency, PMI, and additionally, the Dice coefficient, those pairs displaying either high frequency and medium PMI values, or relatively low frequency but high PMI values, can be considered intuitively as having developed into formulaic pairs. This finding allows RQ1 to be answered in the affirmative.

Identifying longer co-occurrence chains by directed graphs
As indicated in (1), the role of connectives is to explicitly indicate semantic relations between dan content paragraphs in a sequence of dan content paragraphs. Thus, behind the sequence of connectives, there is in fact a sequence of dan content paragraphs in the larger discourse unit that contains them. As the discourse unfolds in time (or space in the case of a written text), the pairs of connectives can be thought of as directed graphs. The connectives X and Y are nodes in the graph and 'X→Y' is the direction from X to Y. As one connective may co-occur with a number of other connectives in a context, co-occurring pairs can be linked into even larger chains. By identifying those chains of connectives that occur systematically in the context, i.e., Ishiguro's 'strategic usage' patterns, conjunctive relations behind them can be explored.
In the rest of this section, we discuss the possibility of identifying potential chains of connectives by representing the identified co-occurrence pairs as directed graphs. For this purpose, we use 'Pajek', a graph exploration software (see de Nooy et al., 2005).
Based on the co-occurrence data in Tables 3 and 4, visualization of potential, longer co-occurrence chains containing multiple co-occurring pairs is shown in Figure 1a and Figure 1b. Depending on one's point of view, this visualization can be interpreted as the potential knowledge of the use of connectives in a given community of language users (including a community of experts), i.e. it represents an aspect of what de Saussure (1916/1966) calls langue, or what Bourdieu (1991, 1994 Figure 1b visualize all the top 20 PPM and PMI co-occurrence pairs by integrating them into directed graphs. As the graphs in Figure 1a and Figure 1b reflect co-occurrence frequencies (PPM) or PMI values based on corpus data, they do not show actual linkage relations in specific contexts, but only potential co-occurrence patterns of language use within the general user community and within specific scientific communities.
In the graphs, the thickness of the arcs connecting the nodes is proportional to the frequency (PPM) of co-occurrence or to the PMI value. In Figure 1a, based on the high frequency (PPM) co-occurrences shown in Table 3, in all three genres, the bidirectionally connected shikashi (however) and mata (also) (in Figure 1a marked by circles) form a central pair of nodes with which various relations of connectives can be formed. Not only in terms of their position but also in terms of the frequency of cooccurrence, the connectives can be linked into longer chains around shikashi (however) and mata (also). However, the details vary from genre to genre.
On the other hand, for co-occurrences with high PMI values (Figure 1b), two types of co-occurrence sequences emerge. One type is potential sequences (marked with arrows) that clearly represent Ishiguro's multiple 'enumerating' strategy patterns mentioned above. The thickness of the arcs reflects the predictability of co-occurrence, thicker arcs representing higher PMI values and thus higher predictability of cooccurrence. The other type of patterns that emerge clearly in both academic corpora are the patterns (marked by circles) that appear to be used to develop the argumentation, although the PMI values involved in them are somewhat lower. In Humanities and social sciences data, this is the pattern mochiron (of course) → daga (but), discussed in section 4.2, and in Science and technology data, the pattern of shikashi (however) → sokode (therefore), also mentioned in 4.2.
In several places in Figure 1a and Figure 1b, there is a node called 'LOOP'. This node represents cases where, according to the specification of the 'Pajek' software used, one conjunctive expression co-occurs repeatedly with itself, such as de (and) → de (and), etc. LOOP is not combined with other connectives with a very high PPM (frequency) or high PMI value (predictability), except for mata (also).
In this section, we have looked at the co-occurrence of the most frequently occurring connectives and potential argumentation patterns based on them. On the basis of the above findings, the answer to RQ2 can be regarded as affirmative.
These potential patterns, which, with the help of visualization, suggest strategies for the development of longer discourse segments, can be said to be the realization of the linking potential possessed by the two-item co-occurrence examples presented in Tables 3 and 4. Let us now look at the potential chains containing two or more connectives that we obtained from Figure 1a and Figure 1b, i.e., chains based on the high co-occurring frequency (PPM) and high PMI values. In order to clarify the relationship between the potential chains including two or connectives and the dan content paragraphs, having the chain schema in (1) as a starting point, a more general form of the chain of connectives is shown in (5).
In general, dan content paragraphs are not necessarily sentences but can consist of a group of sentences. The semantic relationship between dani and dani+1 generally need not be explicitly indicated with the connective Xi. The unexpressed Xi is denoted as Ø for convenience, as a 'fill-in'. In fact, examples of conjunctive relations between sentences that are not explicitly signaled (i.e., are realized as 'Ø'), account for about two-thirds or more of all co-occurrences identified in all three corpora examined in this study.
As it is very difficult to deal with relations between dan content paragraphs that are not explicitly indicated using corpus linguistics methods, the analysis in this study is limited to chains of fully expressed connectives at this stage of the research. Among the co-occurrence patterns found in Tables 3 and 4, shikashi (however) → sokode (therefore), or more generally, 'adversative connective → additive connective', has already been examined in detail in Wang (2015a, b), as mentioned above.
From the visualization of potential co-occurrence chains in Figure 1a and Figure 1b, we can further extract longer potential chain patterns that are involved in discourse development.
In the present study, the extraction was restricted to the top 20 PPM and top 20 PMI-valued co-occurring connectives. If all co-occurring examples that meet the cooccurrence condition threshold were included, the number of potential linkage patterns would increase further, but this analysis is left as a future task.
Shikashi (however) and mata (also), which occupy a central position in the three corpora in terms of their potential for co-occurrence with other connectives, are also central to several longer co-occurrence chain patterns.
Prominent and possibly formulaic potential connective patterns found in Figure 1a and Figure 1b Table 5    If a connective in a chain links two overlapping co-occurring pairs, such as shikashi (however) in the chain [mochiron (indeed → {shikashi (however)} → tsumari (namely], then it is reasonable to assume that a chain consisting of three connectives, linked with the centrally occurring connective can occur in actual discourse. Table 5 shows the specific possibilities of such potential linkages. The respective ranges of two or more overlapping co-occurrence pairs are indicated by underlining and boldface. The '→' indicates the direction (order) of linkage of the conjunctive expression.
Among the top 20 PMI values, there are highly formulaic 'enumerating' patterns, such as daiichiwa (firstly)→dainiwa (secondly)→daisanwa (thirdly) which are selfexplanatory enough and have therefore been omitted from Table 5.
Among the various potential concatenation patterns in Table 5, there are a number of concatenation patterns of co-occurrence pairs, such as the overlapping cooccurrence pairs seen in the top 20 examples of PPM, which are formed around mochiron (of course) and tashikani (indeed). These patterns are involved with the development of argumentation and are seen in BCCWJ* and Humanities and social sciences data. On the other hand, among the top 20 PMI values, the two concatenation patterns of co-occurrence pairs formed around shikashi/shikashinagara (however) → sokode (therefore) are only found in the Science and technology data.
In the following Section 5, the potential possibilities of concatenation patterns of co-occurrence pairs presented in this section will be examined in actual discourse, namely in newspaper editorial articles.

Verification of co-occurrence chains in actual discourse
This section verifies the potential sequential patterns listed in Table 5 using concrete examples of their use in discourse. For this purpose, various academic monographs and a small corpus of 300 Asahi Shimbun editorials and opinion articles were used. First, we checked the extent to which the 76 co-occurrence pairs of connectives mentioned above overlapped with the chains of connectives found in the examples of actual discourse. Of the 76 co-occurrence pairs, 18 were either used as single co-occurrence pairs or appeared as a part of a chain of multiple connectives. For example, in addition to the examples of the single co-occurrence pairs shikashi (however) → sokode (therefore), hitotsuwa (one) → mōhitotsuwa (the other) and mochiron (of course) → daga (but) , seen in (2)-(4), there are also long chains of co-occurrence pairs in editorials, such as nanishiro (anyhow) → tada (just) → shikashi (however) → sorewa (that is) → mazu (first) → soshite (then) and shikashi (however) → mazu (first) →tsugini (next) → sonouede (moreover) → tatoeba (for example) → sonotame (therefore). Some of the 18 co-occurrence pairs mentioned above, such as tatoeba (for example) → sonotame (therefore) are included in these chains, and in other examples of long chains. As has been said before, these 18 co-occurrence pairs fulfill the co-occurrence threshold conditions and seem to be used as 'ready-made parts' in the development of discourse.
Some examples of chains containing multiple connectives are given below. The overlapping status of the ready-made co-occurrence pairs in the examples is highlighted by underlining and boldface. The examples extracted from the actual discourse data are shown in the order of the increasing complexity of the chains in which they appear. The first example is a chain formed by just one co-occurrence pair. In example (6), PMI values for the co-occurrence pair mazu (first) → soshite (then) are BCCWJ* n/a; HS 4.6; ST 5.8. (6) is structured as follows. The two reasons supporting the assertion in the first paragraph, i.e., '...civil war will flare up', are introduced by mazu (first), which, according to Ishiguro (2008) belongs to the 'organizing-enumerating' type of connectives, and soshite (then), which belongs to the 'organizing-coordinating' category. This is a co-occurrence pattern often found in academic papers, but similar patterns such as mazu (first) → tsugini (next) are also found in the BCCWJ* top 50 PMI examples.

The argument in
There are also examples of ready-made co-occurrence pairs embedded in longer chains, for example shikashi (however) → sokode (therefore) → sonouede (moreover) in the following chain.  Suppose that on a crowdsourcing platform, <...>, it is easier for the ordering party to find a suitable order taker. However (shikashi), even if the appropriate ordering party is selected, <...> it is still necessary to look at the specifics of the interaction. This chapter therefore (sokode) examines the actual interactions between ordering parties and order takers in the crowdsourcing ordering process, and clarifies the issues that arise in the communication between the two parties, referring to the opinions of the ordering party who actually placed the order. Moreover (sonouede), it attempts to present points for improvement in the exchange documents that can be presented by the ordering party. (Ishiguro ed. 2020, Chap. 14) The long chain in example (7) is first broken down into pairs of co-occurring connectives to check the PMI values. Pair 1 shikashi (however) → sokode (therefore): PMI value: BCCWJ* 2.7; HS 5.2; ST 5.9. Pair 2 sokode (therefore) → sonouede (moreover): PMI value: no examples meeting threshold conditions. Example (7) is a good example of a combination of patterned and non-patterned parts in a chain of connectives. In the chain, firstly, in a ready-made co-occurrence pair with a high frequency of shikashi (however) → sokode (therefore) is used. Shikashi (however) introduces the need for verification of the specific interaction between the order taker and the ordering party, and information on how to do this in concrete terms is presented by sokode (therefore). Finally, a new, more specific response, added to the first response, is introduced by sonouede (moreover): 'to present improvements in the exchange documents that can be presented by the order taker and the ordering party'.
Bunseki no kekka, hatchūsha kara mita juchūsha to no yaritori bunsho no mondaiten ga akiraka ni natta. Ika, 3-ten jun ni agete iku. The analysis reveals problems regarding the exchange of documents with the order taker, as seen the from the point of view of the ordering party. Hereinafter (ika) three points are listed in order. 4.1 Gaps in the perception of the work environment First (mazu), in the correspondence between the ordering part and the order taker, there was a gap in the perception of the work environment. As a concrete example, the ... instruction document by A is shown in Example 1, and the subsequent gap in ... recognition is shown in Example 2. Moreover (nao), hereafter, the ordering party's wording is shown as ... In addition (mata), the underlined parts in the ordering party's instruction document have been filled in by the author. (Kei Ishiguro (ed.) 2020, Chap. 14) First, the long chain of co-occurring connective pairs in example (8), ika (hereinafter) → mazu (first) → nao (moreover) → mata (again/in addition) is broken down to check the PMI values.
(8) is an example of an 'enumerate' chain consisting of two rather weak readymade co-occurrence pairs. Enumeration is introduced also more specifically following ika (hereinafter) by the expression '3-ten jun ni agete iku (three points listed in order)', which helps to bridge the section boundary by its explicit cataphoric reference. Pair ika (hereinafter) → mazu (first) has a moderately high PMI value in the academic data but does not meet the threshold condition in the BCCWJ*. This means that being 'readymade' is also related to the genre. On the other hand, the last pair nao (moreover) → mata (again/in addition) meets the threshold condition in all three genres. Both readymade pairs are connected by the mazu (first) → nao (moreover), a pair that does not satisfy the threshold condition for co-occurrence in any of the genres examined. This is not surprising since the functions of mazu (first), i.e., organizing-enumerating, and nao (moreover), i.e., understanding-supplementing, are in conflict.
The next example (9) also contains a long chain of connectives.  How much is known about this scientifically? This is to be clarified first and then radiation countermeasures are proposed to restore a safe environment for people to live in. First (mazu), the effect of the lowest level of radiation that is scientifically proven and internationally accepted is that exposure to 100 millisieverts increases the risk of dying from cancer by 0.5%. At lower levels, the effects on health are not scientifically proven. However (shikashi), the International Commission on Radiological Protection (ICRP) assumes from the standpoint of protecting health that the risk of cancer increases in proportion to the dose and calls for protective measures to be taken and doses to be reduced. Japan follows this approach. Again (mata), it is internationally recognized that the risk is greater if the same amount of radiation is received over a short period of time, and that the effects of external and internal exposure are the same if the same amount is received. <...> The decontamination process should be prioritized, with goals set up and carried out step by step. Then (soshite), he stated that considering that many people are concerned about the health of children, priority should be given to the decontamination of environments where children are present. (Asahi Shimbun, 17 Dec 2011, morning edition, editorial) The long chain of connectives in (9), mazu (first) → shikashi (however) → mata (again) → soshite (then), is introduced by a cataphoric reference in the immediately preceding paragraph, i.e., 'hōshasen taisaku o teian shite iru (radiation countermeasures are proposed)'.
Considered from the point of view of the text organization, this chain actually consists of only three directly interacting connectives. The pair mazu (first) → shikashi (however) is functioning only in the local dan content paragraph and based on its PMI value, the pair does not satisfy the threshold condition. It is therefore an ad hoc cooccurrence of connectives.
The actual chain is thus mazu (first) → mata (again) → soshite (then). This chain has first to be broken down into co-occurring pairs so that the PMI values can be checked.
Pair 1 mazu (first) → mata (again) PMI values: BCCWJ* 1.2, HS 3.8, ST 3.7 (In BCCWJ* the pair does not meet the threshold condition and is therefore not considered a co-occurring pair in this genre). Pair 2 mazu (first) → mata (again) PMI value: HS 4.2. The pair does not satisfy the threshold condition in BCCWJ* and ST. Pair 3 mata (again) → soshite (then) PMI value: 3.6 in HS. The pair does not satisfy the threshold condition in BCCWJ* and ST.
In example (9), in order to specify the 'proposed radiation measures' by the Government, the text is basically organized as a 'triple jump' of segments, introduced by mazu (first), mata (again), and finally soshite (then), all connectives being of the 'organize-enumerate' type.
In the dan content paragraph introduced by mazu (first), a discussion leading to a standard limit on radiation doses is presented. In contrast to the 'national standards', the 'international standards' are introduced locally, within the same dan content paragraph, by shikashi (however). Therefore, shikashi (however) does not form a content paragraph-based co-occurring pair with mazu (first) and is therefore not a part of the rest of the chain. It is therefore mata (again) of 'organize-coordinate' type that can be regarded as co-occurring with mazu (first). Mata (again) introduces the dan content paragraph about a link between the radiation dose and the time of exposure and further also a specific measure based on that link.
Finally, in contrast to the dan content paragraph introduced by mata (again), soshite (then) of the 'organize-coordinate' type introduces the last dan content paragraph which deals with the decontamination of the environment in which the children are located.
Based on the relatively high PMI values, here the chain mazu (first) → mata (again) → soshite (then) is formed by the overlapping mazu (first) → mata (again) and mata (again) → soshite (then), both of which are 'ready-made' co-occurrence pairs. So, this chain can be regarded as being formed directly by 'ready-made' co-occurrence pairs.
In the two-item co-occurrence pairs in examples (2), (3), (4) and (6) seen in the previous section and above, the PMI values of the co-occurrence criteria are at least 3.5 at the lowest, almost twice as big as the co-occurrence threshold condition. This means that these co-occurrence pairs are often used as ready-made elements. They may be used not only as single pairs but also as a part of longer chains. Examples (7) and (8) are cases where the chain contains one or two ready-made pairs. Example (9), on the other hand, is an example where the ready-made pairs mazu (first) → mata (again) and mata (again) → soshite (then) overlap over mata (again). Again, readymade co-occurrence pairs are used to develop the discourse. Based on PMI values involved in all these patterns, we can consider that the resulting overlapping pattern with three connectives, i.e., mazu (first) → mata (again) → soshite (then), and other co-occurrence chains formed with high PMI values are also 'ready-made' co-occurrence patterns.
Needless to say, from the speaker/writer's point of view 'ready-made' patterns are useful for discourse development because they reduce the discourse planning load. Because of their formulaicity, they also contribute to the predictability of discourse development from the listener's/reader's point of view. In other words, they reduce the cognitive load in both production and processing, thus contributing to the fluency of linguistic exchange.
The above observations thus point out to the existence of 'ready-made' cooccurrence chains that are longer than 'ready-made' co-occurrence pairs and they also clarify the overall role such chains play in discourse. Based on this, the answers to RQ2 and RQ3 can also be considered affirmative.

Discussion and conclusions
The distant co-occurrence of connectives has received increasing attention over the last ten or fifteen years. The present study is an exploratory study, aimed at determining the presence or absence of two-item distant co-occurrence patterns (RQ1), as well as the presence or absence of multiple-item distant co-occurrence patterns (RQ2), and the role of these patterns in discourse, especially in relation to the cognitive load needed to process the incoming discourse (RQ3). In order to identify potentially formulaic co-occurrences, we used general written material (the BCCWJ* corpus) and academic paper material (the Humanities and Social Sciences papers corpus and the Science and Technology papers corpus). The conditions for distant co-occurrence were somewhat more relaxed than in traditional collocation studies, with a co-occurrence frequency > 10, PMI value > 2, and Dice coefficient > 0.01. As for the co-occurrence cases meeting these conditions, 87 were found in BCCWJ*, 181 in Science and Technology paper data, and 202 in Humanities and Social Sciences papers data.
In order to identify reliable co-occurrences, in the present study, only the top 20 examples from each corpus in terms of PPM and PMI values were included in the analysis.
In terms of co-occurrences with high PMI values, the BCCWJ* data showed a low correlation with the PPM value, while the same correlation was relatively high in both academic corpora. This suggests that the range of combinations of co-occurring items in each of the academic corpora is narrower than in the general data represented in BCCWJ* and that the degree of formulaicity is consequently higher.
The vast majority of co-occurrence pairs that meet the aforementioned cooccurrence threshold conditions in the three corpora can intuitively be regarded as 'ready-made' co-occurrence pairs. Among the top 20 PMI values, typical co-occurrence patterns of connectives such as hitotsuwa (one)→ mōhitotsuwa (the other), both belonging to the 'organize-enumerating' type especially are prominent.
In addition to the top 20 PMI cases examined here, the majority of other cooccurrence cases that meet the co-occurrence threshold conditions also appear to be valid as two-item co-occurrence 'ready-made' pairs. The examination of the top 20 PPM cases revealed a more diverse pattern: in addition to many similarities with the top 20 PMI values, genre-specific differences were also noticeable. The answer to RQ1 is therefore in the affirmative. There is a need to investigate these differences in more detail in the future, taking into account, for example, teaching Japanese as a second language.
The multiple connective co-occurrences, i.e., chains of co-occurrences, were then identified based on the visualization of co-occurrence pairs by means of directed graphs. The visualization revealed similarities and differences between the top 20 PPM cases and the top 20 PMI cases. The similarities between genres, i.e., general vs. academic, are more pronounced in the top 20 PPM co-occurrences. In particular, shikashi (however) and mata (again) were found to be two centers around which a large number of co-occurrences of connectives are formed. At the same time, many other combinations of connectives were also present.
On the other hand, the top 20 PMI values are dominated by examples of longer chains of 'organize-enumerate' type of connective co-occurrences. As with the BCCWJ* data, there are few other types of co-occurrences, and the potential for combination with longer chains of connectives seems to be limited. In contrast, in both academic corpora, potential opportunities for longer chain formation other than of 'organizeenumerate' type, were revealed.
The chains of connectives visualized in the directed graphs in Figure 1a and Figure  1b, created on the basis of the co-occurrence data are to be understood as potential chains that can be used in the actual development of discourse, as also shown in Example (8) in Section 5. More specifically, they can be seen as 'ready-made' patterns that can potentially be used in argumentative prose such as academic papers and editorials.
Next, these 'ready-made' patterns, belonging to the realm of the possible, have been examined in actual discourse data. Specifically, instead of three corpora, a small corpus of editorial and opinion articles from the Asahi Shimbun (300 articles) and a specialist humanities monograph were used to examine chains containing multiple cooccurring connectives.
The extraction of multiple connective co-occurrence chains from the Asahi Shimbun data yielded 18 co-occurrence pairs that were included in the set of 76 cooccurrence pairs extracted based on the top 20 PPM and PMI values. Some of these 18 co-occurrence pairs here were frequently found to partially overlap in these extracted chains. This suggests the existence of longer systematic co-occurrence chains and also sheds light on the strategies for forming longer chains. In other words, the cooccurrence pairs of connectives as 'ready-made' parts play an important role in the formation of longer chains. The answer to RQ2 is therefore also affirmative.
The interpretation and positioning of such chains of systematically co-occurring connectives are in some ways similar to what Wray (2002) and others refer to as formulaic expressions, they can be seen as a kind of formulaicity and therefore contribute to discourse development. However, they differ from the conventional notion of formulaicity in that they are observed at the discourse level. On the other hand, the scope of formulaic expressions treated in conventional studies, such as Wray (2002) and Tanaka (2016), is limited to a single sentence. Therefore, many interesting findings from the conventional research on formulaic expressions cannot be directly applied to the distant co-occurrence of connectives.
To put the regularities observed in distant co-occurrence of connectives into proper perspective, Ishiguro's (2008) view of 'strategic usage' patterns is one reasonable way of looking at the phenomenon. At the same time, Bourdieu's notion of habitus also seems to provide a valid framework for its interpretation (see Bourdieu, 1991Bourdieu, , 1994. In Bourdieu's terms, the 'ready-made' chains of systematically cooccurring connectives in discourse reflect argumentative patterns internalized by the writer/speaker in the course of linguistic activity. Therefore, while the co-occurrence chain patterns observed here are part of the individual habitus, they can also be interpreted as forming part of the collective habitus of a particular linguistic community, since listeners/readers also internalize these patterns. Peers in an academic discipline, for example, are a good example of this. Such internalized 'ready-made' patterns contribute to the ease of organization and development of discourse on the part of the author and to the ease of comprehension on the part of the reader/listener. On the other hand, there is a negative aspect to this phenomenon. Namely, by directing the flow of thought into the predictable habitual channels, these ready-made patterns of argumentation can hinder the conception of new ideas and understanding.
As for the diversity of the usage of connectives, the greater diversity is found in the general BCCWJ* data as compared to the academic data. The reason is the higher need for accuracy in the transmission of academic data, as compared to general use. This is also one of the conceivable motives behind the more pronounced formulaicity in academic communication.
In conclusion, the aim of the present study was very limited: to test Ishiguro's (2008) predictions about the 'strategic usage' patterning of connectives and to ascertain the potential contribution of such patterning to discourse development. A tentative conclusion can be drawn that 'strategic usage' patterning is indeed a widely recognized systematic phenomenon that contributes to discourse development and understanding. The present study has also shown that directed graph visualization has a good potential for identifying such 'strategic usage' patterning. These preliminary result needs to be further tested and elaborated, both quantitatively and qualitatively, using additional linguistic material. The findings are expected to have applications in language teaching, particularly academic writing and teaching Japanese as a second language, in critical discourse analysis, and in language theory in general.