When people converse, multiple levels of speech and language representations have been found to converge. Recently, a new form of convergence called complexity matching was found in the nested clustering of acoustic events, and shown to reflect convergence of hierarchical temporal structure in lexical, phrasal, and prosodic units. Complexity matching predicts that convergence in hierarchical temporal structure should not require interlocutors to utter the same phonemes, words, or syntactic structures. Instead, convergence is expected to occur in the distributional properties of speech sounds over the relatively long timescales of intonation and prosody. Complexity matching is also more general, in that it predicts convergence of other power law distributions in speech and language. In the present study, we tested these predictions by examining convergence both within and across two different languages, expressed in two different measures of speech. Pairs of bilingual speakers conversed either in English or Spanish exclusively, or one spoke English while the other spoke Spanish (the Mixed condition). Results showed comparable amounts of convergence in hierarchical temporal structure in the Mixed condition compared with the Spanish and English language conditions. Convergence was also found in the frequency distributions of lemmas spoken in the Mixed condition. These results show that convergence in terms of complexity matching does not require a direct matching of linguistic tokens in conversation. Altogether, results provide evidence for complexity matching as a basic principle of both monolingual and bilingual language interaction.