Other news - Monday, 8 January, 2018

„It is a text mining tool that the students can easily use”

The web application has been developed specifically to general medicine students within the framework of a university innovation project, which is in fact a special data processing software. It directly accesses the most frequently used MedLine database in the field of medicine – commonly referred to as PubMed – and it is able to process the literature in real time based on keyword search in a way that the software reads the given topic’s abstracts. The results are shown in an easily understandable form and reveal the links between the concepts, the topics, then the application presents them in a graphical form as well. The program is being developed at the moment as well but the trial version is available from the server of the Knowledge Centre and it can be used in practice too. The aim of the developers is not only to familiarise the students with this new opportunity but to actively involve them in the further steps as well. A similar application aiming directly at students does not exist either at the home or the foreign universities for now, although the PubMed database is widely used for detailed data processing by both research groups and private companies as well. The program was demonstrated by Dr. Ádám Feldmann, senior lecturer in the Department of Behavioural Sciences, member of the Big Data Research Group in the Szentágothai Research Centre. Dániel Orosz, innovative manager from the University of Pécs Chancellery, Technology Transfer Office, was also present at the interview.


Written by Rita Schweier


- First, let’s talk about your research group. What do you do exactly?

- We deal with text and data processing algorithms. We do not only do this at the Medical School but also at the Faculty of Business and Economics and at the Department of Sociology in an interdisciplinary research group as well, which is located within the Szentágothati Research Centre Big Data Cluster. We call ourselves Duo-Mining research group because we simultaneously apply data analysis and text mining solutions in the course of our work. Among our members there are mathematicians, IT operatives, sociologists, psychologists, and recently we have successfully involved several Student Researcher Society students in the joint work. In the past year we submitted two innovations at the university in connection with data processing, which were both accepted by the Innovation Committee; one of them is the topic of this conversation, the other one, the MedMiner application is in a test phase, but it can already be tried.      


- In the course of your scientific career, why have you turned in the direction of text analysis?

- I was always interested in data analysis, statistical thinking, computerized data processing. Few people know that in experimental psychology the at least basic knowledge of computer programming and solid statistical and methodological knowledge are essential requirements beyond the user knowledge.  

- How did this new program become relevant?

- In my everyday work I frequently have to search different materials on the Internet in special scientific databases and for this purpose the most obvious source in the field of medicine is the PubMed database. When we use search words, keyword combinations as filtering conditions, many relevant results show up but often there is too much information and we need further selection, clarification. Everybody who has done such work knows this phenomenon.

We would like to take the relevant information in a way that it will be possible to intuitively surf between the data and that we will not see only a vast amount of indexed lists in front of us. Therefore we have developed a program with which we can access PubMed through an API. The data that can be downloaded from there are not always public but we focus only on the abstracts because those can be accessed by everyone. The abstract is an extract of approximately 150 words, which briefly contains the aim of the examination, its methodology, results, and the conclusions along with the indication of the author and the institution. The entire medical literature database can be accessed through PubMed.  

There is a high level programming language, the „R”, in which we created our program called MedMiner, which is short for MedLine Miner. It is a very simple, dashboard based graphical interface. The dashboard based visualisation shows complex information in a visual form, from which often a lay person can also read the connections. On the interface of our program we enter the sought expressions and other conditions, such as date, name of the authors, then we can mark how many abstracts the program should read and in what time interval it should search the abstracts. After we have determined all this, the program downloads the found abstracts after a click with the help of the API. In the next step an NLP (natural language processing) module reads and processes all this: it takes out the stop words, the articles, the sentence markers, the empty characters, which are not connected closely to the topic.  

This pre-screening is done very fast, and then the essence comes next: it checks which words in the abstract, how often and in what connection occur. Every abstract is a document along with a chart, which shows what kind of words there are in the document and what their occurrence is. Since the program creates the chart with every document, it is going to be an enormous one. We created an interface for its visualisation, which is very simple and is based on the word cloud known by everyone. The more frequent a word, the bigger size it is going to appear in the cloud or it appears in italics and receives another font, or underlining /in the meantime it shows the word set on the screen with greater or lesser size and bold words, font types/.

There is another innovation, namely that it is interactive. If we click on a phrase, which the program brings up based on the abstracts, then on another page the word’s network of contacts appears, which shows what phrases the word is in connection with and how strong these connections are. If the connection is close, then the connecting edges are thick. We believe that this information can be informative for a student in itself but we would also like to test its effectiveness. We would like to know to what extent they can use it for studying and how this information gaining method without notes could be integrated in their everyday lives. This program is already available from the Knowledge Centre’s servers. We work in a close cooperation with the centre in the project.

-When did you start the development?

- It is a development going on since May 2015 and it has now reached the phase that a „beta test” can be created, namely that the students can see it for a trial and based on a questionnaire being currently made they can also give their opinions about it. Our aim is that the students themselves tell us what kind of a device they need and that it is not us who make it up for them. Based on the feedback from my own student groups and acquaintances the procedure is useful although it still has minor errors. We expect the students’ ideas in order to correct these errors.

- Let’s make it more illustrative – from the practical point of view – what this program actually means. Let’s search for a keyword.

- This keyword is going to be the well-tried depression. I set the program to download the last 500 abstracts. There is a “date range” - it cannot be adjusted yet but soon we will be able to - then we click an update. We find the frompubmed button and if we press it, we can see a little “progress bar” below that indicates that the relevant word cloud is being created from the abstracts. This is where it starts to download the 500 abstracts. If the “progress bar” is gone, it means that it has downloaded, processed and created the chart. The essence of it comes next because there are basic settings such as the minimum word frequency: it means there are only such words in the word cloud that have occurred at least 20 times in the abstracts. It can of course be stricter and the maximum word number can also be set in the word cloud so that it is neither too crowded, nor too thin.

If someone is not familiar with the medical literature of depression and starts using this program, then he can see the word depression in the middle in a large size. The other phrase is the patient, then in bold there are health, treatment, and anxiety as well. Depression often comes with anxiety; therefore it is an extra piece of information. We can see that it is also in connection with sleeping and stress. Let’s click on the anxiety. It shows a network of contacts and says that in this last 500 abstracts the anxiety occurred 428 times.

If I go back with the button one by one, then I can see which phrases have the strongest connection with depression. From this we can also see what kind of new keywords we can search behind depression to get to know it better – and this leads to new paths in our search.

We can complicate the search if we write functional MRI next to depression. This way we can also gain targeted information. Here by the 500 imaging systems article it shows those cerebral cortex and under the cortex structures that depression is connected to. Among the keywords the phrase cortex appears as well and if I click on it, I can see that it occurred in the fMRI articles 384 times. Let’s see the fields: there are prefrontal, medial prefrontal, dorsal parts, then the serotonergic, the pregenual, and the dorsolateral regions appear as phrases on the screen, meaning that in a targeted way we can gain further information in second and third steps. In the further development, below – in the part, which is currently empty and is under the info table – such articles will be visible that remained in stock from the 500 abstracts after the concrete click.

In the test versions more options are working than those that we can see right now. There is the opportunity in it to use the so called “topic analysis procedure”. Let’s think of a fictive online newspaper as an example where 70% of the topics is internal affairs, 10% is foreign policy, the rest are sport, culture and advertisements. This means that the topics can be grouped in roughly 5-6 topics. This means that any kind of article “can be mixed” from these topics. With backward thinking we can gain these topics without knowing exactly what they are. These are statistical procedures, which have been developed in the last 20 years. The scientific abstracts can be grouped, weighted, and prioritised in a similar way according to their topics. We would also like to achieve with the development to be able to provide recommendations to the students in connection with the articles worth reading in their chosen topics.

As a following development direction we are going to connect the authors and the keywords. There is a new machine learning model developed by Google, it is called “inserted word model”. This examines the words’ semantic environment and the “halo” belonging to the word, the most representative phrases in connection with each other and also the pathways, networks can be found by it. All the above mentioned procedures reveal different aspects of the texts under examination.

- As a teacher and developer, in what do you see the program’s practical utility?

- Firstly in the fact that the student can receive the relevant pieces of information. As an expert I can use it less because in my mind I have those connections that the program would outline. If, however, I need a periodic overview or I research connections, schemes unknown to me, then it can be useful. If for example I set the program to read all the abstracts in a given field- it can even be hundreds of thousands -, then there will surely be such new knowledge, even new scientific results as well about which neither I, nor others have known.

Compared to notes it also has the advantage that you can see from several years’ perspective how a given scientific standpoint or the meanings of different phrases have changed. Not to mention the broad international outlook and the infinite extent. Our aim was to provide such a text mining tool that can be used very easily. The steps to be performed here are all familiar to the students. Intuitively we can also see which phrase stands out if we click on a word cloud. Anyone is able to interpret a simpler network of contacts, even without any prior education.

This program enables a discursive education as well, it can be used in classes too, and it gives the opportunity of active, interactive and exploring learning. Besides the existence of prior knowledge, the supervision of the learning can also be ensured in a way that we place different subprojects in the program, different fields of medicine – as for example surgery, internal medicine – and their subfields, divided into diseases or other topics. These can be collected in advance, the abstracts can be downloaded and the given field’s experts can check the pathways that can be found in them. Now we are working on an analysis, as part of which we collect hundreds of thousands of abstracts and with the help of a special text classifying procedure – these are the trash collecting topics – we can filter those phrases that cannot be used. With this method we facilitate the access to even more accurate results.


Dániel Orosz who works as an innovative manager at the University of Pécs Chancellery, Technology Transfer Office, is also present at the interview. The task of the office is the utilization of the university’s research results, the transfer of technology. In connection with this they provide legal help and business advice as well, given the market demands.


- How can the researchers get in touch with you?

- In many ways, but the simplest is through our website – – or at the University Innovation Day organised by our office, in the framework of which this year the researchers and the students could apply for the Innovation Award for the first time. Besides this we have announced an internal innovation tender as well where researchers and teachers in university employment could win gross 1.5 million HUF support to develop their research projects and to facilitate the utilization. This is of course a drop in the ocean in most research projects but it is enough to get from A to at least Á.  

Getting in touch is, however, only the first step in the six-phase university technology transfer process. As soon as it happens, a pre-assessment process starts according to two aspects: one of them is the industrial property right, the other one is the business aspect because it is worth having a market focus at the beginning so that the competitive advantage can be demonstrable at the end of the process. The development and the financing run parallel to each other, we work on them together with the researchers until the process reaches the Innovation Committee who decides whether they accept the research in the “university intellectual property” portfolio. The chair of the committee is the scientific and innovation vice rector, its members come from the faculties and the chancellery. The industrial property right procedure starts when the committee accepted the development. In this phase the costs are financed by the university, which is a great easement for the researchers and we also actively help in the business negotiations needed for the utilization.

- In the case of this project, at what stage is the cooperation?

- The Innovation Committee accepted the development, currently we are looking for Hungarian and international partners. Our main goal with the development is to make it a marketable product outside the scope of the university.

- Based on what criteria does the Innovation Committee decide about the acceptance of the development?

- After getting in touch with the researcher, two assigned specialists from the Technology Transfer Office – a lawyer and an economist manager – start consulting about the research project. After that they report the research result together with the researcher to the university Intellectual Property Administration System and prepare the intellectual property’s pre-assessment based on three main aspects: legal, market, university. After that the committee receives the assessment with the suggestion of the two specialists where the inventor introduces the intellectual property in a presentation.

In the case of this project the utilization and the university aspects have been strongly present as the Medical School attracts the most students. It is a strategic goal to reach the number of 5000 foreign students and in order for this goal to succeed we need a higher level of infrastructure and education. An important part of the latter is this internal service as w

News archives