This paper presents the results of a study on information extraction from unrestricted Turkish
text using statistical language processing methods. In languages like English, there is a very
small number of possible word forms with a given root word. However, languages like Turkish
have very productive agglutinative morphology. Thus, it is an issue to build statistical models
for specific tasks using the surface forms of the words, mainly because of the data sparseness
problem. In order to alleviate this problem, we used additional syntactic information, i.e. the
morphological structure of the words. We have successfully applied statistical methods using
both the lexical and morphological information to sentence segmentation, topic segmentation,
and name tagging tasks. For sentence segmentation, we have modeled the final inflectional
groups of the words and combined it with the lexical model, and decreased the error rate
to 4.34%, which is 21% better than the result obtained using only the surface forms of the
words. For topic segmentation, stems of the words (especially nouns) have been found to
be more effective than using the surface forms of the words and we have achieved 10.90%
segmentation error rate on our test set according to the weighted TDT-2 segmentation cost
metric. This is 32% better than the word-based baseline model. For name tagging, we used
four different information sources to model names. Our first information source is based on the
surface forms of the words. Then we combined the contextual cues with the lexical model, and
obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according
to the MUC evaluation criteria. Our results are important in the sense that, using linguistic
information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.