GSoC 2020: TensorFlow (Part-II)

GSoC 2020: TensorFlow (Part-II)

2020, Aug 27    

GSoC 2020 with TensorFlow Datasets

This summer was super fun! And it was only possible because of my mentors Etienne Pot, Marcin Michalski, and Pierre Ruyssen.

Special shout-out to Etienne Pot. Without his constant code-reviews and inputs, my work-product would not have been half as good.

Summary

Now, it’s time to wrap up and share my contributions in TensorFlow Datasets.
I have summarized my key contributions here:

  • Load datasets without reading the dataset generation code.
    Advantages:
    • Unlocks backward compatibility (generated dataset files can still be read, even if the dataset code don’t exist anymore).
    • Cross-language support (other languages could load the dataset directly from the generated dataset files)
  • Added Folder-based methods to prepare custom Image datasets, Translation datasets, etc.
  • Added a command-line tool to simplify the procedure to add new datasets in TFDS.
  • Simplified the procedure to add datasets outside TFDS.
  • Fixed bugs to improve Windows compatibility.
  • Improved TFDS dataset catalog by displaying Dataset samples and indicating new versions and configs.
  • Added scripts to benchmark datasets, clean and maintain TFDS.

And many more!

Pull Requests

New Features and Enhancements

#2354: Script to detect Dead URLs
#2346: Get generated dataset file location by builder name
#2322: Load dataset without reading its generation code
#2326: Add config_name field in dataset_info
#2301: Add generate_statistics function
#2284: Show Supported Python versions
#2278: Add subsplit API for Better auto-sharding
#2221: Load and dump FeatureConnectors
#2194: Add new TFDS CLI command
#2186: Use f-strings to generate dataset doc
#2185: Add TFDS CLI
#2144: Add overwrite flag in generate_visualization
#2142: CleanUp use tf.nest.map_structure
#2141: Support datasets in folders with checksums & fake_data
#2094: Add Custom Translate Datasets template
#2088: Script to Benchmark dataset
#2068: Support for Custom Image Datasets
#1947: Add try_import context manager
#1832: Avoid registering tests datasets
#1831: Update document_datasets to indicate which datasets are tfds-nightly only
#1826: Script for updating all stable versions of tfds datasets
#1717: Add new Image Classification section
#1586: Test that new dataset has correctly set the checksums

Bugs Fixes

#2269: Append path hash to extracted files
#2193: tf.io.gfile fails for gcs paths
#2099: Image Encoding
#2074: Fix list_directory function
#1916: Fix tfds/core bugs on windows
#1915: Renaming files on windows fails sometimes
#1914: Fix remaining tfds datasets bugs on windows
#1913: Fix tfds/image_classification bugs on windows
#1912: Don’t mock some os functions on windows
#1889: Fix regex error for windows
#1877: Fix kaggle downloader silent compression to ‘.zip’ files
#1857: list_full_names should not yield datasets with None version value
#1750: iter_archive() should not return directory path name
#1734: Download specified version through download_and_prepare script
#1712: Download dataset with given version

Datasets Updates

#1929: Update tf_flower dataset to iterate over archive
#1706: Read CelebA pictures from archive directly.
#1583: Add visual question-answer feature in clevr dataset
#1427: Add affNIST dataset to tfds.

Documentation

#2223: Update uc_merced Docstring
#2140: Update the source code link on dataset Catalog
#2112: Improve Kaggle API documentation and error msgs
#1702: Improve define the dataset outside TFDS doc
#1611: More beginner friendly documentation

Others

#2274: Update Caltech Birds dataset download urls
#2229: Update plant_village download url
#2183: Update Citrus leaves Dataset
#2117: Minor changes to GitHub workflow
#2006: Update xnli and mutli_nli checksums
#1980: Update checksums of oxford_flowers102 and dtd datasets
#1954: Improve GitHub workflow
#1850: Minor typos
#1793: Remove unneccesary files in caltech_birds2011 dataset
#1774: Fix pylint errors in Image Classification datasets
#1773: Fix some minor typos
#1737: Update Wikipedia checksums and dataset
#1733: Update README
#1731: Fix pylint for tfds script and proto
#1658: Fix pylint errors for tensorflow_dataset/testing
#1619: Cleanup legacy code
#1553: Remove legacy Python 2 code and add pytype Python3 support
#1522: Update datasets to clean up legacy code
#1507: Update kitti dataset checksum files
#1400: Fix some broken links