Europarl Scraper: 24 Languages of Politics, at your fingertips

I participated in a two-day PyDataBerlin Hackathon event in early-October and decided to build a scraper for European Parliament. This was after I found the Europarl parallel corpus a bit underwhelming as it is messy and not tagged for party, speakers or topic (this is understandable, as it is primarily used as a multilingual training corpus for machine-learning translation models).

At the hackathon, many folks were working on really interesting projects to analyze bias, framing and different word usage depending on party. Since I know a bit of web scraping, I built a scraper for the current European Parliament site. The data from the scraper is also available via a public bucket on S3.

All of the folks involved in the hackathon shared their findings at last night's PyData Berlin meetup. It was really interesting! Felix Biessmann, David Batista and Jirka Lewandowski all found correlations between word choices and party. I encourage you to check out their slides!

I hope we can have another PyData Berlin hackathon soon, and my data can be useful for further research in political language bias. Although I spent a lot of times in my slides making jokes (as I don't have much analysis to present and talking about web scraping is a bit boring), I do believe strongly that democracy is hard and the more folks we have who are "good at data" helping analyze and keep watch and collaborate with those who understand politics, the better.

Here are my slides, feel free to reach out if you have questions about the data or if you do anything interesting with it! 🙌