Introduction


     The growth of social media platforms such as twitter, facebook for communication between people has led to creation of huge user generated data. This is now leading to development of new challenges and perspectives in the language technologies research. Automatic processing of such texts requires us to develop new methodologies. Thus there is great need to develop various automatic systems such as information extraction, information retrieval, machine translation and other higher Natural Language Processing (NLP) applications such as Anaphora resolution, co-reference resolution which can be applied on these social media text.



The objectives of SocAnaRes-IL are:

  • Creation of benchmark data for Anaphora Resolution in Indian language text from various Social media text such as Facebook, Twitter, Chat conversations etc.
  • Encourage researchers to develop novel systems for Anaphora Resolution.
  • Providing opportunity to researchers to have comparison of different techniques.

Training data released



Task Description


     There are various challenges in anaphora resolution on these type of texts. One of the main challeges is that facebook (FB) posts and tweets are generally very short, thus often lack sufficient context to determine an antecedent of an anaphor without the aid of background or world knowledge. Especially in the resolution of third person pronominals “he/they” (woh, ve, vo) in atleast 20% of the cases the antecedent is not mentioned in the current tweet or FB post, it is either in posts which was already said a day before or it is understood with world knowledge.

Example Tweet:
 
 HI: “@preety bank wale usiko loan dete hain immandaar ko nahin”

 (“@preety Banks give loan to them not to honest”)

     Here in this tweet “usiko” is the third person pronoun, and here it referring to a person Nirav Modi who is corrupt and cheated the bank. The antecedent for this pronoun can be identified only if we have world knowledge.

     In comparison with English, Indian Languages have more dialectal variations. These dialects are mainly influenced by different regions and communities. And thus we have different styles of writing. Some of the main issues in handling of social media texts such as tweets are i) Spelling errors ii) Abbreviated new language vocabulary such as “b4u” for “before you” iii) use of symbols such as emoticons/emojis iv) use of meta tags and hash tags v) Code mixing. We need to preprocess the data to normalize the abbreviated vocabulary by providing expansions.

     The task is to identify Anaphor and its antecedent in a given text. The text is a tweet.



Training Corpus


Training data released on August 5th 2020

For more details on training data such as data format please click here .

Please register your Team following the procedure given in registration section and make request for data. The data will be mailed to the team contact person



Registration


Registration is now open !!!
     Please register by sending email to sobha@au-kbc.org with subject line "Registration for SocAnaRes-IL 2020" with the following details:
"Team Leader Name:"
"Team Affiliation (Proper full Address of the Organization):"
"Team Contact Person name:" and "Email ID:"
"Languages for which participating:"
"Team Members Names:"
(PS: Maximum of 4 members will be allowed in a team)


Submission Format


     The participants have to submit their test runs in the format as given in training data.
Note: There should be no changes/alterations in the format of the test run submission file.
Each team can submit maximum of 3 test runs for each language.


Evaluation Criteria


    We plan to use the standard evaluation metrics of Precision, Recall and F-measure. More details will be provided later.

Task Coordinators - Organizing Committee


Computational Linguistics Research Group (CLRG),
AU-KBC Research Centre



Sobha Lalitha Devi, AU-KBC Research Centre, Chennai, India.