Skip to content

livtol99/Liv-Stage---FOCOS-Hand-Over

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

SES Embedding Method on French Social Graph - Liv Tollånes Stage FOCOS 2024

This repository contains the scripts, outputs, and report from my time as an intern at the FOCOS group during 2024.

Table of Contents

  1. Project Summary
  2. Repository Structure
  3. Data Locations in Server
  4. The CORG-approach
  5. Contact Information

1. Project Summary

My work included exploring the application of the SES embedding method (He & Tsvetkova, 2023), an approach developed to infer socioeconomic status (SES) from social network data, in a French context. The hope was that we could reproduce the findings of a somewhat clear relationship between estimated positions of users and their associated income. The validation procedure included correlation and regression analyses of the coordinates of users in the first three dimensions, and their associated income. Income was obtained by retrieving self reported job titles via an N-gram frequency analysis and a topken matching procedure. The users who had matching tokens (mentionning some jib titles in our list)were then manually inspected to weed out incorrectly assigned job titles.

2. Repository Structure

├── Code
│   ├── Graph preparation
│   │   ├── CORG_wrangling.ipynb
│   │   └── Wrangling2.ipynb
│   ├── Utility files
│   │   ├── ca_pipeline.py
│   │   ├── corg_pipeline.py
│   │   ├── model_comparison.py
│   │   └── utils2.py
│   ├── Validation
│   │   ├── CA_all_models.ipynb
│   │   ├── CORG.ipynb
│   │   ├── Diminspection.ipynb
│   │   ├── Income_val.ipynb
│   │   └── Income_val_prep.ipynb
│   ├── poetry.lock
│   └── pyproject.toml
├── Outputs
│   └── Outputs.pdf
└── README.md

Folder structure

  • Graph preparation: Contains a notebook for preparing data for the rest of the analysis.
  • Utility files: Modules that support various parts of the project.
  • Validation: Contains notebooks for performing the CA analysis, and all code for validating the results.
  • Outputs: Contains a PDF of all figures and tables used in the actual thesis.

File Descriptions

  • Graph preparation

    • Wrangling2.ipynb: Notebook for inituial data cleaning, filtering the data, obtaining French users only, and creating an informative edgelist to use in further analyses.
    • CORG_wrangling.ipynb: Notebook for creating the labelling of markers into H/L to input in the CORG approach.
  • Utility files

    • ca_pipeline.py: Pipeline script for running the Correspondence Analysis. Includes a class with methods to create a bipartite graph, various grapgh checks, and perform Correspondence Analysis (CA) on the french edgelist
    • model_comparison.py: Contains a class to perform a WLS regression analysis and produce model comparison metrics.
    • utils2.py: Contains a collection of functions used throughout the project for various purposes.
    • corg_pipeline.py: Contains a class for perfoming both functionality 1 and 2 of the CORG method.
  • Validation

    • CA_all_models.ipynb: Notebook for fitting the CA pipeline and obtain estimate files for all nine models. Includes graph checks.
    • Diminspection.ipynb: Various outcome inspection of the CA estimates.
    • Income_val.ipynb: Preparation of the validation data. Includes the N-gram frequency analysis of bios. Final file is the user data with bios, job titles, and associated income.
    • Income_val_prep.ipynb: In this notebook, I am preparing the validation data. This includes, fetching the job titles and income of users via N-gram frequency analysis, subsequent data cleaning of manually inspected job title files to prep for model comparison and user estimate validation. Final result here is the finished file of users and their job titles and income
    • CORG.ipynb: Performed the full CORG approach, aiming to identify the dimension that best separated the markers, then project all data points onto the new dimension.
  • poetry.lock: Dependency lock file generated by Poetry.

  • pyproject.toml: Configuration file for the Poetry package manager, specifying dependencies and project metadata.

  • requirements_backup.txt: A backup file for an old virtual environment.

3. Data Locations in Server

The data collection took place in March 2023, and access was granted for use in the current study through my affiliations with the Formal Computational Socio-Politics Group at the Learning Planet Institute in Paris and Institut des Systèmes Complexes de Paris Île-de-France (ISC - PIF). It is not accessible in this repository, meaning that the code in the repository cannot be ran without an application for data access.

All the files I have used throughout the project are currently in the ssh server under /home/livtollanes/NewData. However, I have collected some "crucial" data files, on request, in the folder /home/livtollanes/final_data. The collected data folder has the following structure:

├── annotations
│   ├── Overview_title_keywords.csv
│   └── onlygreens_cleaned.csv
├── coordinates
│   ├── m1_column_coordinates.csv
│   ├── m1_jobs_rowcoords.csv
│   ├── m3_column_coordinates.csv
│   ├── m3_jobs_rowcoords.csv
│   ├── m7_column_coordinates.csv
│   └── m7_jobs_rowcoords.csv
└── data
    ├── followers_bios_french_updated.csv
    └── labeled_edgelist_hl.csv

Data and folder descriptions

  • Overview_title_keywords.csv: A collection of all job groups that occurred in the final validation data, together with the key words associated with each job group.
  • onlygreens_cleabed.csv: The validation data, after manual annotation. In other words, this df contains the users, their bios, job and income information, and other metadata, for the users whose job title could be identified via the token match procedure and were not deleted during manual inspection.
  • Coordinate-files: These contain the row and collumnn coordinate files for the selected three models (described in the thesis.) The job coordinates include the coordinates for users who had job titles only. The column coordinates include the full marker coordinates for the same models.
  • followers_bios_french_updated.csv: The df with the bios of all french users after filters have been applied
  • labeled_edgelist_hl.csv: Full french edgelist with all filters, but added high and low labels to use in CORG approach. Used as input to the CA.

Raw data placements

All the data in their non filtered form is placed in /home/livtollanes/SocialMarkers.

  • markers_followers_2023-05-19.csv = the raw edgelist
  • markers_followers_bios_2023-05-19.csv = follower metadata + bios
  • MarkersFrenchTwitter.xlsx = info about the selected markers, also what type they are
  • markers_bios_2023-05-19.csv = marker bios and metadata

4. The CORG-approach

I attempted to implement the CORG approach as a step in the model selection. The aim was to label markers for which we could be sure were either high or low SES, use this to identify the dimension that best separated the classififed markers, and project all data points onto this newly identified dimension. The approach is the CORG-approach created by the FOCOS group: https://github.com/pedroramaciotti/CORG/blob/main/tutorial/CORG_quickstart.ipynb

However, the results ended up being ambiguous. High F-scores were obtained for many models, and it was hard to select between them. We attributed this to the difficulty in labelling SES markers as either high or low SES due to the abiguity inherent to the concept. It is not like polituical parties, where one could be fairly sure that the labelling in advance was correct. Also, another issue was that we were unable to actually lable a lot of the markers to have anough representation of H/L markers for all models. This furtehr impeeded using this approach in model selection.

5. Contact Information

e-mail: liv.tollanes@gmail.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published