CORD-19 Challenge

COVID-19 Open Research Dataset Challenge (CORD-19)


I am new to Kaggle and I will be sharing my story of how I am approaching my first Kaggle challenge.

I have created two notebooks on Kaggle, one for input data exploration: CORD19 — Input Data Exploration and second for ML: CORD19

1. CORD19 — Input Data Exploration



  • Input Files are present in the following directory: “kaggle/input/CORD-19-research-challenge/
  • Input Folder contains the following 8 directories/files: json_schema.txt, metadata.csv, comm_use_subset, COVID.DATA.LIC.AGMT.pdf, noncomm_use_subset, metadata.readme, custom_license, biorxiv_medrxiv
  • Out of these, only 4 folders contain JSON files i.e., comm_use_subset, noncomm_use_subset, custom_license, biorxiv_medrxiv
  • Details about Number of JSON Files as of 28 March is following:
    — Number of JSON files in noncomm_use_subset: 2350
    — Number of JSON files in biorxiv_medrxiv: 1053
    — Number of JSON files in comm_use_subset: 9315
    — Number of JSON files in custom_license: 20657
    — Total JSON files: 33375
  • No Key is missing in any JSON files but values are missing in abstract, ref_entries and back_matter keys in all the 4 folders, details about JSON files having missing values are as follows:
folder_name: paper_id metadata abstract body_text bib_entries ref_entries back_matternoncomm_use_subset: 0 0 646 0 0 153 807
biorxiv_medrxiv: 0 0 136 0 0 6 496
comm_use_subset: 0 0 942 0 0 330 1835
custom_license: 0 0 6824 0 0 1300 7976

I will keep updating this post as I make progress in the challenge.



