Skip to content

Home

License Python DOI Documentation Status GitHub release

Description

This repository provides tools for extracting and visualizing information from scientific papers in XML format. Using GROBID. for document processing, the scripts generate keyword clouds, charts displaying the number of figures per document, and extract links from XML files.

Info

For any issues or questions, please open an issue in the project issues.

Features

Given a XML file (or a directory with some of them) the tool will extract the data and make: - Keyword Cloud: Keyword cloud based on the abstract information. - Charts: Charts visualization showing the number of figures per article. - Links: list of the links found in each paper while ignoring references.

Project Structure

├── papers/              # Example research papers
├── data/                # Example XML files 
├── results/             # Example directory for generated files
├── scripts/             # Python scripts for data extraction and visualization
│   ├── keywordCloud.py  # Generates a keyword cloud from abstracts
│   ├── charts.py        # Creates charts showing the number of figures per document
│   ├── list.py          # Extracts links from XML files (excluding references)
├── docs/                # Additional documentation 
├── tests/               # Tests to check functionality 

Used Technologies and Standards

Grobid

We use GROBID to process scientific documents and extract their XML files for use as data input.