Matches in Nanopublications for { <http://purl.org/np/RAjyLA7-iEjl-Gbtz8AnROdFEAkzLvMXmH7OHk4N5lZrU#paragraph> ?p ?o ?g. }
Showing items 1 to 2 of
2
with 100 items per page.
- paragraph type Paragraph assertion.
- paragraph hasContent "Wikipedia dumps 10 are packaged as XML documents and contain text formatted according to the Mediawiki markup syntax, 11 with templates to be transcluded. 12 Hence, a pre-processing step is required to obtain a raw text representation of the dump. To achieve this, we leverage the WIKI EXTRACTOR , 13 a third-party tool that retains the text and expands templates of a Wikipedia XML dump, while discarding other data such as tables, references, images, etc. We note that the tool is not completely robust with respect to templates expansion. Such drawback is expected for two reasons: first, new templates are constantly defined, thus requiring regular maintenance of the tool; second, Wikipedia editors do not always comply to the specifications of the templates they include. Therefore, we could not obtain a fully cleaned Wikipedia plain text corpus, and noticed gaps in its content, probably due to template expansion failures. Nevertheless, we argue that the loss of information is not significant and can be neglected despite the recall cost. From the entire Italian Wikipedia corpus, we slice the use case subset by querying the I TALIAN DBPEDIA CHAPTER 14 for the Wikipedia article IDs of relevant entities." assertion.